An attribute dataset
- Consider a 2 attribute dataset with 2 classes such that attribute x1, x2 are
taken from the set V ∈ 10, 20, 30, 40, 50. For each class, assume each
attribute follows a multinomial distribution such that all attribute values
have non-zero probability (p(xi = v|C) > 0 ∀i, v). In a python notebook, seed
the random number generator with a value of 1 and immediately generate
such a dataset with 60k instances (equally distributed among the two
classes), picking values of p(xi = v|C). Plot the resulting data to visualize
what attribute distributions you have created. One way (not necessarily the
best) is in 2-D with different color points for each class (you’ll have a lot of
points, so consider using the marker style ‘.’). If you do this, then many
points will lie on top of one another (because the attribute values are
chosen from a small set), so you’ll want to spread them out by adding in
some random “jitter”, and sample a subset of them so that they aren’t
packed too densely (also consider using an alpha ¡ 1 to make the points
transparent). Be sure just to do this for the visualization, not for the real
data. Play with your the attribute value distributions until you feel that you
can “usually” predict where one class should occur in the 2-D space.
Continuing from above, use your 60k-stratified dataset to generate 20
smaller datasets of size 30,60,100,300,600,1000,3000 using nonoverlapping slices of the full data set. Now, perform a set of experiments to
estimate the values p(xi = v|C) used to generate the sample. Show how the
estimates improve as the partitions grow. For simplicity, just pick one class
to present, either c0 or c1. - For the following question, begin by dividing the email dataset form PS1
into a “train” and “test” set, where the “test” set is simply the final 540 email
messages in numerical order (from their name).
Determine, for each word that appears in the training set, the mutual
information with the class variable using the procedure described in the McCallum98 paper. The end result should be a list indicating each word and it’s mutual
information content (as a float value) in sorted order. Save this data as a file
Implement the multi-variate Bernoulli model from the McCallum paper using equations 1, 2, 3 and 4. For equation 2, be sure that your implementation adds together log likelihoods instead of multiplying probabilities. Plot the
performance of the classifier on the training data, and on the testing data using
vocabulary size of 50, 100, 1000, and all words from your ranked calculations in
the previous step.
Implement the multinomial model using equations 1, 5 and 6 from the
McCallum paper. Note that the leading terms in equation 5 (i.e., P (|di|)|di|!) can
be dropped as this value is identical across both classes. Similarly, the
denomina-
tor in equation 5 (Nit!) can also be dropped as these terms together just act as a
constant scaling factor. Plot your results as in the previous part.
Revisit the Email training set. Grab the text document SPAMTrain.label which
contains (label, filename) pairs. The label 0 corresponds to the SPAM class,
while a 1 corresponds to the HAM (normal email) class. In this, we’ll build
multiple classifiers to compare against the Bayesian methods
of before.
Preliminaries: begin with the code from PS2 in which you divided the SPAM data
set into training and test subsets where the test subset included the final 540
email messages in numerical order based on their name. - Baseline: Use sci-kit learn’s CountVectorizer to create vocabularies, V,
of size 50, 100, 1000, and all words from the training set by setting the
max features appropriately. Then, use this vectorizer in conjunction with
the KNeighborsClassifier to build a classification pipeline (with the Pipeline
class or the make pipeline helper). Find the accuracy for
K=1,5,11,21,51,and 101 when: (a) trained and tested on the training data;
and (b) when trained on the training data and tested on the test data.
Recall that KNeighbors is essentially memorizing data, write one sentence
explaining why accuracy for part (a) isn’t always 100% here.
Sci-kit learn’s GridSearchCV is used to explore a set of parameters. For
this problem, run the same experiment above but instead of setting your
choice of V and K in a loop, use scikit learn’s GridSearchCV class to do
that. You’ll need to give it the list of viable parameters for the KNeighbors
classifier and for the CountVectorizer. Call the GridSearchCV’s fit
method on the training data to find the best parameter
setting.
Repeat the GridSearch as above, but this time, change the pipeline. Previously, the pipeline contained (1) CountVectorizer and (2)
KNeighborsClassifier. Modify it to include PCA (via TrucatedSVD) with the
following three steps: (1) CountVectorizer (allowing all words into the
vocabulary) (2) TruncatedSVD; (3) KNeighborsClassifier. Search through
parameters considering 1,5,11,21,51 and 101 neighbors and 50,100,200,
and 1000 principal components. Compare the results to your previous
step.
Repeat the process above by changing the pipeline above in the following
ways: (1) replace KNeighborsClassifer to a LinearSVC (support vector
machine with linear decision boundary); (2) replace the CountVectorizer
with TfidfVectorizer (set sublinear_tf to True). - Load in the text from the SPAM data set, create a histogram showing the
distribution of message lengths. Additionally, print the maximum, mean,
and median message lengths. Note that the maximum message length is
quite long.
Convolutional Neural Network: Use the architecture from question 1 (the
CNN) to train on the spam detection problem. You’ll need to pick a length
for the representation of the document, use 1.5x the mean document
length (in words) and truncate documents to that length. Use the 50
dimension Glove embeddings (instead of 100 in the tutorial), and turn
training off on the embedding layer (as in the tutorial). Document
your performance after 10 epochs. Then show a graph of training and
validation performance from epochs 2 through 100.
Compare the performance of the CNN to a simple network that just uses
the Glove embeddings (without convolution). For this step, you’ll use an
architecture that has is similar to this (relatively old) tutorial2. Specifically,
you’ll need an input layer and embedding layer as in question 3, but then
these should go into a GlobalAveragePooling1D layer followed by two
Dense layers. The Dense layers should have a size of 50 and then 1.
Begin by providing an explanation of what this model’s architecture is
actually doing. Then, as before, show a graph of training and validation
performance from epochs 2 through 100. Justify an appropriate stopping
point for this model. Illustrate the performance difference of this model and
the model from the previous question each at the stopping point you have
selected.