Sale!

Intr to ML/PR Project 1

$30.00

EECS4404 Intr to ML/PR
Project 1

Note: This is the project for 4404 students only. You need to work individually for this project.
You must use this latex template to write up your report. Remember to fill in your information (name, student number, email) at above. Submit your codes/scripts (*.zip) and a project
report (*.pdf) (maximum 6 pages) from eClass before the deadline. No late submission will be
accepted. No handwriting is accepted. Direct your queries to Hui Jiang ([email protected]).
Latent Semantic Analysis for Natural Language Processing

Category:

Description

5/5 - (4 votes)

EECS4404 Intr to ML/PR
Project 1

Note: This is the project for 4404 students only. You need to work individually for this project.
You must use this latex template to write up your report. Remember to fill in your information (name, student number, email) at above. Submit your codes/scripts (*.zip) and a project
report (*.pdf) (maximum 6 pages) from eClass before the deadline. No late submission will be
accepted. No handwriting is accepted. Direct your queries to Hui Jiang ([email protected]).
Latent Semantic Analysis for Natural Language Processing
In this project, you will use a text corpus, called the English Wikipedia Dump, to construct document-word matrices and then use the latent semantic analysis (LSA) technique to
factorize the matrices to derive word representations, a.k.a. word embeddings or word vectors.
You will first use the derived word vectors to investigate semantic similarity between different
words based on the Pearson’s correlation coefficient obtained by comparing cosine distance
between word vectors and human assigned similarity scores in a data set called WordSim353
(http://www.cse.yorku.ca/~hj/wordsim353_human_scores.txt). Furthermore, the derived
word vectors will be visualized in a 2-dimensional space using the t-SNE method to inspect the
semantic relationship among English words. In this project, you will implement several machine learning methods to factorize large sparse matrices to study how to produce meaningful
word representations for natural language processing.
1. Use a small data set, called enwiki8 (downloading from http://www.cse.yorku.ca/~hj/
enwiki8.txt.zip) to construct a document-word frequency matrix in Figure 7.7. In this
project, you treat each paragraph in a line as a document. You construct the matrix
in a sparse format for the top 10,000 most frequent words in enwiki8 and all words in
WordSim353.
2. You first use a standard SVD procedure from a linear algebra library to factorize the
sparse document-word matrix, and truncate it to k = 20, 50, 100. Examine the run-in
time and memory consumption for SVD.
3. You can choose any algorithm from the following two choices to implement matrix factorization from scratch1
:
(a) The alternating Algorithm 7.6; or
(b) The stochastic gradient descent (SGD) method in Exercise 7.5.
Use your implementation to factorize the document-word matrix for k = 20, 50, 100.
Examine the run-in time and memory consumption.
4. Investigate the quality of the above derived word vectors based on the correlation with
some human assigned similarity scores. For each pair of words in WordSim353, compute
the cosine distance between their word vectors and then compute the Pearson’s correlation coefficient between these cosine distances and human scores. Tuning your learning
hyperparameters towards higher correlation.
1You are only allowed to use linear algebra libraries, such as matrix multiplication, matrix inversion, eigenvalues
and eigenvectors.
Department of Electrical Engineering and Computer Science 1
York University EECS4404 Intr to ML/PR (Winter 2021)
5. Visualize the above word representations for the top 300 most frequent words in enwiki8
using the t-SNE method2 by projecting each set into a 2-dimensional space. Investigate
how these 300 word representations are distributed and inspect whether the semantically
relevant words are located closer in the space. Explain why.
6. Refer to [1] to re-construct the document-word matrix based on the positive pointwise
mutual information (PPMI). Repeat the above steps 2-5 to see how much the performance
is improved.
What to submit?
You need to submit all of your codes written for this project. Please provide a clear instruction on how to repeat your experiments in a separate readme file. You need to submit a project
report (in pdf, maximum 6 pages) to summarize what you have done in terms of algorithm development and experimental fine-tuning, also report the best settings for each case and discuss
your findings from this project.
References
[1] Peter D. Turney and Patrick Pantel. ’From Frequency to Meaning: Vector Space Models
of Semantics.’ In: J. Artif. Int. Res. 37.1 (Jan. 2010), pp. 141–188. (https://arxiv.org/abs/
1003.1141)
2You can use any existing t-SNE library.
Department of Electrical Engineering and Computer Science 2
York University EECS4404 Intr to ML/PR (Winter 2021)
Your Report (maximum 6 pages):
Algorithm Development
1. SVD vs SGD Matrix Factorization
To compute the SVD I used the scipy library as it allowed for choosing how many
features are needed. The time it took for K=20,50 and 100 along with the Pearson
Correlation Coefficient using the SVD and Human scores are in Figure 1. As seen it
takes roughly 0.15*k seconds for the algorithm to compute the matrices. Memory usage
can get very high, using a sparse matrix it was possible to compute the entire matrices
however it still used a lot of memory compared to Stochastic Gradient Descent.
On the other hand, the stats using Stochastic Gradient Descent to compute the matrix
factorization are in Figure 2. As seen it runs much quicker than SVD at about roughly
Memory usage was much lower than what SVD requires, while it was especially when
we are using larger matrices.
Figure 1: Computing Matrix Factorization with SVD(left) with the scipy linalg library and with
SGD(right) for k=20,50,100
Experimental Fine-tuning
1. SGD Matrix Factorization
After initially obtaining the Pearson Correlation Coefficients for each n × k matrix I
found that the correlation values were very low, even compared to the SVD coefficients
so I spent decided to test out the algorithm with different learning rates, different
regularization parameter values, different step sizes and also checked for convergence
at different points. I found that higher k values required smaller parameters and larger
step-sizes because values would more easily get large. I also adaptively tuned the step
size instead of altering it with a consistent value each iteration, based on the change in
loss it would either get larger or smaller.
Department of Electrical Engineering and Computer Science 3
York University EECS4404 Intr to ML/PR (Winter 2021)
2. Positive Pointwise Mutual Information
Performance did not change much if at all after the PPMI optimization for both
algorithms, however accuracy greatly improved. As seen in the figure below, the
Pearson Correlation Coefficient was much higher when we used the document word
frequency matrix constructed using PPMI.
Figure 2: PPMI SVD Matrix Factorization for k=20,50,100
Semantically relevant words
1. Semantically Relevant words
Yes, semantically relevant words are located closer to each other in the space as seen in
Figure 3. This would be because they would often be used in the same context. i.e.
”white” is beside ”person” whereas something like ”men” and ”years” are on opposite
ends. These semantically relevant words form ”communities” in which words often
used in the same context are found, if we were to use more than the top 300 words we
would see much larger communities but as we only used the top 300 the communities
are going to be pretty small.
Department of Electrical Engineering and Computer Science 4
York University EECS4404 Intr to ML/PR (Winter 2021)
Figure 3: Visual representation of the top 300 most frequent words
Department of Electrical Engineering and Computer Science 5