Computer Science 320 S2

Computer Science 320 S2 (2019)
Assignment 4
Due date Sep 28, 2019 23:59pm
Answer all of the following questions. There are 10 points, which contribute 5% of your total course
marks. Submit a properly typesetted pdf file (LATEX preferred) of your answers to Canvas before the
deadline. There is no automarker for the Python program. To aid the markers, scanned handwritten
solutions or late submissions are NOT accepted.
Email spam filtering
You are implementing a new email spam filtering for the University of Auckland. If the email comes from
one of these trusted addresses, it is not a spam. Otherwise, it will be a spam. Since the server memory
is very limited, you can not keep the list of trusted addresses on its memory (if an email address requires
25 bytes on average, it would take 25 GB to store A in the memory).
After searching on Google, you have found a very memory-efficient solution. That is, instead of
keeping A on memory, you will construct a bit array B of size n representing for the set A. You choose
n = 8000, 000, 000 bits so that you need only 1 GB of memory. The construction is as follows.
• B is initialized by 0s.
• You choose a hash function h : a 7→ [0, n). In other words, the input of h will be an string a (e.g.
email address) and the hash value will be an integer ranging between 0 and n.
• For each trusted email address a ∈ A, you hash a into one of n buckets, and set that bit to 1. In
order words, you simply set B[h(a)] = 1.
The filtering mechanism works as follows.
• When you receive a new email from the address a0, you compute the hash value h(a0).• If B[h(a0)] = 1, you consider that this email is not a spam and let it go through.• If B[h(a0)] = 0, you consider that this email is a spam and discard it.
Theoretical questions for measuring the performance of the filtering (5 pts):
1. Illustrate that if a new email from the address a ∈ A, it always gets through (1 pts).
2. Given any position 0 ≤ i < n, what is the probability that B[i] = 1 (2 pts).
3. Given a spam email from the address a
0 ∈/ A, what is the probability that it gets through (2 pts).
Practical implementation for measuring the performance of the filtering (5 pts):
Write a Python script to implement this email spam filtering technique given the scaled setting. Your
data structure B has size n = 8000, 000 bits. For simplicity, assume that your trusted email addresses
and spam email are presented as integers. In particular, trusted email addresses A = {1, 2, ..., 1000000}
and spam email addresses are any integer x > 1000, 000. You are free to choose your hash function to
hash an integer into the range [0, n).
1. Verify that any email from the address 1 ≤ a ≤ 1000, 000 it always gets through (1 pts).
2. Compute the probability that a spam email going through your filter given this setting. (1 pts).
3. Generate 1000 random integers x > 1000, 000 as spam email addresses and compute the number of
spam emails going through your filter. Verify this value with your theoretical value from the step
2 (3 pts).

因为专业,所以值得信赖。如有需要,请加QQ:99515681 或邮箱:[email protected]

微信:codehelp

原文地址:https://www.cnblogs.com/clga/p/11604328.html

时间: 2024-10-06 08:00:14

Computer Science 320 S2的相关文章

Computer Science Theory for the Information Age-5: 学习理论——VC维的定义以及一些例子

学习理论--VC维的定义以及一些例子 本文主要介绍一些学习理论上的东西.首先,我们得明确,从训练集上学习出来的分类器的最终目标是用于预测未知的样本,那么我们在训练的时候该用多少的样本才能使产生的分类器的效果尽可能的好呢?这些就是VC-理论要解决的问题.在介绍这个理论之前,我们得先介绍一个比较抽象的概念--VC维.这个指标是用与衡量假设空间的复杂程度.为了能更好的理解VC维,本文还会举一些例子来加深理解. (一)由一个例子引出的动机 为了更好的说明为什么我们要定义这个VC维,我们先来看一个例子.假

Discovering the Computer Science Behind Postgres Indexes

This is the last in a series of Postgres posts that Pat Shaughnessy wrote based on his presentation at the Barcelona Ruby Conference. You can also watch the video recording of the presentation. The series was originally published on his personal blog

Computer Science Theory for the Information Age-6: 学习理论——VC定理的证明

VC定理的证明 本文讨论VC理论的证明,其主要内容就是证明VC理论的两个定理,所以内容非常的枯燥,但对于充实一下自己的理论知识也是有帮助的.另外,VC理论属于比较难也比较抽象的知识,所以我总结的这些证明难免会有一些错误,希望各位能够帮我指出. (一)简单版本的VC理论. 给定一个集合系统$(U,\mathcal{S})$,VC理论可以解决以下问题.对于一个在$U$上的分布$P$,那么至少需要选择多少个样本(根据分布$P$选择),才能使对每个$S\in\mathcal{S}$,用样本估计出来的值以

MIT Introduction to Computer Science and Programming (Lesson one )

MIT Introduction to Computer Science and Programming (Lesson one ) 这篇文是记载 MIT 计算机科学及编程导论 第一集 的笔记 Lesson one : Goals of the course;what is computation;introduction to data types,operators,and variables 一 讲解课程的任务.课程目标 目标 像一个计算机科学家一样思考 都能够读写程序 tacking t

Note 2 for &lt;Pratical Programming : An Introduction to Computer Science Using Python 3&gt;

Book Imformation : <Pratical Programming : An Introduction to Computer Science Using Python 3> 2nd Edtion Author : Paul Gries,Jennifer Campbell,Jason Montojo Page : Chapter 2.3 to Chapter 2.5 1.A type consists of two things: (1).a set of values (2).

Side effect (computer science)

In computer science, a function or expression is said to have a side effect if it modifies some state outside its scope or has an observable interaction with its calling functions or the outside world besides returning a value. For example, a particu

How do you explain Machine Learning and Data Mining to non Computer Science people?

How do you explain Machine Learning and Data Mining to non Computer Science people? Pararth Shah, ML Enthusiast Answered Dec 22, 2012 · Featured on VentureBeat · Upvoted by Melissa Dalis, CS & Math major at Duke and Alberto Bietti, PhD student in mac

Computer Science 220S1C (2019)

Computer Science 220S1C (2019)Assignment 4 (traversal and optimisation)Due date June 7, 2019, 10pm100 Marks in totalThis assignment requires you to submit programs in Python that you have written yourselfto the automarker, http://www.cs.auckland.ac.n

CSCI 1100 Computer Science

CSCI 1100 — Computer Science 1 Homework 8Bears, Berries and Tourists Redux: ClassesOverviewThis homework is worth 100 points toward your overall homework grade, and is due Wednesday,December 11, 2019 at 11:59:59 pm. It has three parts. The first two