Marketing campaign to influence the opinion of voters

  1. How to influence the opinion of people when on a budget. (15 points)
    Apparently it is in fashion to use social networks, like facebook, to influence voters in elections.
    In this problem, you will consider a simplified model of influence in social networks: the social
    network is represented by an undirected graph G = (V, E) and for a set of nodes S ⊆ V , we
    denote by N(S) the set of their neighbors:
    N(S) = {v ∈ V | ∃u ∈ S, (u, v) ∈ E}.
    The influence I(S) of a set of nodes is measured by I(S) = |N(S)|.
    Imagine you work now for facebook and you are given the dataset facebookgraph.txt (available in the same canvas folder). This dataset is in fact a subgraph of the real Facebook social
    graph. Each line in the file contains the id of two users, indicating that these two users are
    friends on Facebook.
    a. As the designer of a marketing campaign to influence the opinion of voters, your goal
    is to find a subset S ⊆ V of at most K nodes whose influence is maximal. Write a
    mathematical model to solve this problem. Can you solve the problem for the data you
    were given using MATLAB or AMPL or SCIP? If not, can you write a practical method
    to give an approximation to the optimum?
    b. Using any of your models/methods from part (a) write a computer program which,
    given the social network described in the dataset and a budget K ∈ N
    +, returns an
    approximately optimal set of nodes S for the influence function I(S). The function
    should return both the users to influence and the value (total amount of influence)
    obtained.
    c. Plot the influence I(S) obtained by your function as a function of the budget K. E.g.,
    what happens when K = 1 and K = |V |?
    d. In the definition of I(S) a node with largest degree is one of the most influential persons,
    i.e., he/she is a VIP. Can you think of a way to adapt the pagerank algorithm to identify
    VIP nodes in the graph? Can you invent of a third additional way to define who are the
    VIPs in this graph? What else could you use to measure influence?
    e. A vertex-cover of a graph G is a set S of vertices of G such that each edge of G is incident
    with at least one vertex of S. The vertex cover number τ (G) is the size of a minimum
    vertex cover in G. A dominating set for a graph is a subset D of vertices such that every
    vertex not in D is adjacent to at least one member of D. The domination number γ(G)
    is the smallest dominating set. Are there any relationships among the influence function
    I(S), τ (G), and γ(G)? Explain.
    Formulate a discrete models that given a graph find a vertex cover with smallest number
    of vertices and a smallest dominating set. Explain the reasoning on your variables and
    constraints.
  2. Clustering and pairing senators by their political ideas (12 points)
    The zip-file 2-congress-datafiles.zip (in the same canvas folder) contains two csv files with
    (real!) voting data. You only need to use the one named 114 USA-congress.csv. I call this
    file F114, and it has information of how all USA senators (in the 114th congress) voted in
    15 different bills. 1 means voted YES, 0 means voted NO, 0.5 means ABSTAINED.
    We always assume senators of the same party vote similarly, but let us analyze the votes
    using mathematics and uncover some information!
    Clustering is an operation in data science where we divide data up into a fixed number of
    clusters while trying to ensure that the items in each cluster are as similar as possible (e.g., if
    we cluster by political party we presumably get similar voting records, similar opinions). Kmeans is a clustering algorithm that allows you to cluster into K clusters based on distances.
    It is implemented in MATLAB already.
    a. Use K-means on F114, with K = 2 to get a clustering in two groups. Obviously, do
    not use the labels of the real affiliations I gave you! Choose one to be called democrats, the other to be the republicans (do not look at the original file). Make your best
    mathematical guess of the party affiliation.
    You have obtained a classification of senators into two classes. Does your classification
    coincide exactly with the real party affiliation? Or are there surprises? To measure this
    compute the false-positive rate and false-negative rate of your classification:
    false-positive rate := number of senators incorrectly labeled as democrats
    number of democratic senators
    false-negative rate := number of senators incorrectly labeled as republicans
    number of republican senators
    b. What if you cluster with K = 3? That would after all mean there are three factions in
    the senate (moderates, conservatives, liberals?).
    c. Use part of the data to train a SVM model to classify the senators, and test the model
    using the unused data. Report your results.
    d. Clustering can also be model as an integer optimization problem. Write such a model
    and code it in AMPL or SCIP. Run it in the senator’s data.
  3. Recommendation Systems. (13 points). The Movielens (https://movielens.org/) is
    a website for movie recommendations. It is run by a research lab at University of Minnesota.
    The zip file ml-latest-small.zip (in the same canvas folder) contains 100836 ratings from
    610 users to 9742 movies. More information of this dataset can be found here: http://
    files.grouplens.org/datasets/movielens/ml-latest-small-README.html. Rearrange
    the ratings to a matrix with size 610 × 9724 (there are some movies that did not receive any
    rating and they were not included). Denote this matrix as M. Each row of M corresponds
    to a user, and each column of M corresponds to a movie. In this matrix there are 100836
    entries (ratings) are given, that is only 100836/(610×9724) = 1.7% of the entries. So, 98.3%
    ratings are missing. You are asked to predict the missing ratings. Denote the set of indices
    of the known entries in the matrix as Ω, and denote the indices set of the unknowns as Ω. ¯
    That is:
    Ω := {(i, j) | Mij is known}, Ω := ¯ {(i, j) | Mij is not known}
    You can’t use all known entries in Ω to train your model, because you need some testing data
    to test whether your model and algorithm perform well. Therefore, you need to partition Ω
    to two parts: Ωtrain and Ωtest, i.e., Ω = Ωtrain S
    Ωtest. Entries corresponding to Ωtrain are
    used as training data, and entries corresponding to Ωtest are used as testing data.
    a. The first model is a matrix factorization model. Denote m = 610 and n = 9724. The
    model is:
    min
    U∈Rm×r,V ∈Rr×n
    kPΩtrain (UV − M)k
    2
    F
    , (MatFac)
    where r < m, and the operator PΩtrain : R
    m×n
    7→ R
    m×n
    is defined as:
    [PΩtrain (Z)]ij =

    Zij , if (i, j) ∈ Ωtrain
    0, otherwise
    Implement the gradient method (with a fixed step size) to solve (MatFac). Randomly
    choose p% indices of Ω as Ωtrain and the remaining (1−p%) as Ωtest. Measure the qualify
    of (U

    , V ∗
    ) returned by your algorithm using the following mean squared error:
    MSE :=
    1
    |Ωtest|
    X
    (i,j)∈Ωtest
    ((U
    ∗V

    )ij − Mij )
    2
    For different combinations of (p, r), report MSE in table(s). You can choose p =
    {99, 95, 90, 80, 70, 60, 50, . . .}, and r = {1, 10, 20, 50, 100, 200, 300}. Moreover, report the
    effect of the step size of the gradient method to the results.
    b. In (MatFac) we used Frobenius norm in the objective. We can change it to 1 norm. min U∈Rm×r,V ∈Rr×n kPΩtrain (UV − M)k1. (MatFacL1) The1 norm of a matrix Z is defined as kZk1 =
    P
    ij |Zij |, that is, the sum of absolute
    values of all entries. (MatFacL1) can’t be solved by gradient method, because the
    absolute value function is not differentiable. But you can still solve it approximately
    by slightly perturbing the absolute value function. For a scalar z, you can approximate
    |z| by |z| ≈ √
    z
    2 + with a very small positive . Note that the latter function is
    differentiable for z. Now by replacing all absolute values in the objective of (MatFacL1)
    by this formula, you get a differentiable objective function. Implement the gradient
    method (with a fixed step size) to solve (MatFacL1). Report the same everything as
    what you did in part (a).
    c. Discuss the two models (MatFac) and (MatFacL1), which one is better? Under what
    situation (MatFac) is better? Under what situation (MatFacL1) is better?