An attribute dataset

  1. Consider a 2 attribute dataset with 2 classes such that attribute x1, x2 are
    taken from the set V ∈ 10, 20, 30, 40, 50. For each class, assume each
    attribute follows a multinomial distribution such that all attribute values
    have non-zero probability (p(xi = v|C) > 0 ∀i, v). In a python notebook, seed
    the random number generator with a value of 1 and immediately generate
    such a dataset with 60k instances (equally distributed among the two
    classes), picking values of p(xi = v|C). Plot the resulting data to visualize
    what attribute distributions you have created. One way (not necessarily the
    best) is in 2-D with different color points for each class (you’ll have a lot of
    points, so consider using the marker style ‘.’). If you do this, then many
    points will lie on top of one another (because the attribute values are
    chosen from a small set), so you’ll want to spread them out by adding in
    some random “jitter”, and sample a subset of them so that they aren’t
    packed too densely (also consider using an alpha ¡ 1 to make the points
    transparent). Be sure just to do this for the visualization, not for the real
    data. Play with your the attribute value distributions until you feel that you
    can “usually” predict where one class should occur in the 2-D space.
    Continuing from above, use your 60k-stratified dataset to generate 20
    smaller datasets of size 30,60,100,300,600,1000,3000 using nonoverlapping slices of the full data set. Now, perform a set of experiments to
    estimate the values p(xi = v|C) used to generate the sample. Show how the
    estimates improve as the partitions grow. For simplicity, just pick one class
    to present, either c0 or c1.
  2. For the following question, begin by dividing the email dataset form PS1
    into a “train” and “test” set, where the “test” set is simply the final 540 email
    messages in numerical order (from their name).
    Determine, for each word that appears in the training set, the mutual
    information with the class variable using the procedure described in the McCallum98 paper. The end result should be a list indicating each word and it’s mutual
    information content (as a float value) in sorted order. Save this data as a file
    Implement the multi-variate Bernoulli model from the McCallum paper using equations 1, 2, 3 and 4. For equation 2, be sure that your implementation adds together log likelihoods instead of multiplying probabilities. Plot the
    performance of the classifier on the training data, and on the testing data using
    vocabulary size of 50, 100, 1000, and all words from your ranked calculations in
    the previous step.
    Implement the multinomial model using equations 1, 5 and 6 from the
    McCallum paper. Note that the leading terms in equation 5 (i.e., P (|di|)|di|!) can
    be dropped as this value is identical across both classes. Similarly, the
    denomina-
    tor in equation 5 (Nit!) can also be dropped as these terms together just act as a
    constant scaling factor. Plot your results as in the previous part.
    Revisit the Email training set. Grab the text document SPAMTrain.label which
    contains (label, filename) pairs. The label 0 corresponds to the SPAM class,
    while a 1 corresponds to the HAM (normal email) class. In this, we’ll build
    multiple classifiers to compare against the Bayesian methods
    of before.
    Preliminaries: begin with the code from PS2 in which you divided the SPAM data
    set into training and test subsets where the test subset included the final 540
    email messages in numerical order based on their name.
  3. Baseline: Use sci-kit learn’s CountVectorizer to create vocabularies, V,
    of size 50, 100, 1000, and all words from the training set by setting the
    max features appropriately. Then, use this vectorizer in conjunction with
    the KNeighborsClassifier to build a classification pipeline (with the Pipeline
    class or the make pipeline helper). Find the accuracy for
    K=1,5,11,21,51,and 101 when: (a) trained and tested on the training data;
    and (b) when trained on the training data and tested on the test data.
    Recall that KNeighbors is essentially memorizing data, write one sentence
    explaining why accuracy for part (a) isn’t always 100% here.
    Sci-kit learn’s GridSearchCV is used to explore a set of parameters. For
    this problem, run the same experiment above but instead of setting your
    choice of V and K in a loop, use scikit learn’s GridSearchCV class to do
    that. You’ll need to give it the list of viable parameters for the KNeighbors
    classifier and for the CountVectorizer. Call the GridSearchCV’s fit
    method on the training data to find the best parameter
    setting.
    Repeat the GridSearch as above, but this time, change the pipeline. Previously, the pipeline contained (1) CountVectorizer and (2)
    KNeighborsClassifier. Modify it to include PCA (via TrucatedSVD) with the
    following three steps: (1) CountVectorizer (allowing all words into the
    vocabulary) (2) TruncatedSVD; (3) KNeighborsClassifier. Search through
    parameters considering 1,5,11,21,51 and 101 neighbors and 50,100,200,
    and 1000 principal components. Compare the results to your previous
    step.
    Repeat the process above by changing the pipeline above in the following
    ways: (1) replace KNeighborsClassifer to a LinearSVC (support vector
    machine with linear decision boundary); (2) replace the CountVectorizer
    with TfidfVectorizer (set sublinear_tf to True).
  4. Load in the text from the SPAM data set, create a histogram showing the
    distribution of message lengths. Additionally, print the maximum, mean,
    and median message lengths. Note that the maximum message length is
    quite long.
    Convolutional Neural Network: Use the architecture from question 1 (the
    CNN) to train on the spam detection problem. You’ll need to pick a length
    for the representation of the document, use 1.5x the mean document
    length (in words) and truncate documents to that length. Use the 50
    dimension Glove embeddings (instead of 100 in the tutorial), and turn
    training off on the embedding layer (as in the tutorial). Document
    your performance after 10 epochs. Then show a graph of training and
    validation performance from epochs 2 through 100.
    Compare the performance of the CNN to a simple network that just uses
    the Glove embeddings (without convolution). For this step, you’ll use an
    architecture that has is similar to this (relatively old) tutorial2. Specifically,
    you’ll need an input layer and embedding layer as in question 3, but then
    these should go into a GlobalAveragePooling1D layer followed by two
    Dense layers. The Dense layers should have a size of 50 and then 1.
    Begin by providing an explanation of what this model’s architecture is
    actually doing. Then, as before, show a graph of training and validation
    performance from epochs 2 through 100. Justify an appropriate stopping
    point for this model. Illustrate the performance difference of this model and
    the model from the previous question each at the stopping point you have
    selected.

find the cost of your paper

This question has been answered.

Get Answer