Computer Science 320 S2 (2019)
Assignment 4
Due date Sep 28, 2019 23:59pm
Answer all of the following questions. There are 10 points, which contribute 5% of your total course
marks. Submit a properly typesetted pdf file (LATEX preferred) of your answers to Canvas before the
deadline. There is no automarker for the Python program. To aid the markers, scanned handwritten
solutions or late submissions are NOT accepted.
Email spam filtering
You are implementing a new email spam filtering for the University of Auckland. If the email comes from
one of these trusted addresses, it is not a spam. Otherwise, it will be a spam. Since the server memory
is very limited, you can not keep the list of trusted addresses on its memory (if an email address requires
25 bytes on average, it would take 25 GB to store A in the memory).
After searching on Google, you have found a very memory-efficient solution. That is, instead of
keeping A on memory, you will construct a bit array B of size n representing for the set A. You choose
n = 8000, 000, 000 bits so that you need only 1 GB of memory. The construction is as follows.
• B is initialized by 0s.
• You choose a hash function h : a 7→ [0, n). In other words, the input of h will be an string a (e.g.
email address) and the hash value will be an integer ranging between 0 and n.
• For each trusted email address a ∈ A, you hash a into one of n buckets, and set that bit to 1. In
order words, you simply set B[h(a)] = 1.
The filtering mechanism works as follows.
• When you receive a new email from the address a0, you compute the hash value h(a0).• If B[h(a0)] = 1, you consider that this email is not a spam and let it go through.• If B[h(a0)] = 0, you consider that this email is a spam and discard it.
Theoretical questions for measuring the performance of the filtering (5 pts):
1. Illustrate that if a new email from the address a ∈ A, it always gets through (1 pts).
2. Given any position 0 ≤ i < n, what is the probability that B[i] = 1 (2 pts).
3. Given a spam email from the address a
0 ∈/ A, what is the probability that it gets through (2 pts).
Practical implementation for measuring the performance of the filtering (5 pts):
Write a Python script to implement this email spam filtering technique given the scaled setting. Your
data structure B has size n = 8000, 000 bits. For simplicity, assume that your trusted email addresses
and spam email are presented as integers. In particular, trusted email addresses A = {1, 2, ..., 1000000}
and spam email addresses are any integer x > 1000, 000. You are free to choose your hash function to
hash an integer into the range [0, n).
1. Verify that any email from the address 1 ≤ a ≤ 1000, 000 it always gets through (1 pts).
2. Compute the probability that a spam email going through your filter given this setting. (1 pts).
3. Generate 1000 random integers x > 1000, 000 as spam email addresses and compute the number of
spam emails going through your filter. Verify this value with your theoretical value from the step
2 (3 pts).
因为专业,所以值得信赖。如有需要,请加QQ:99515681 或邮箱:[email protected]
微信:codehelp
原文地址:https://www.cnblogs.com/clga/p/11604328.html