Vector Space Models for NLP

11 min readApr 3, 2023

Vector space models are a common approach used in Natural Language Processing (NLP) to represent text as a set of numerical vectors. These vectors can then be used in various NLP tasks, such as text classification, information retrieval, and machine translation.

In a vector space model, each document or text is represented as a vector in a high-dimensional space. The dimensions of the vector correspond to the features or attributes of the text, such as the frequency of words or their presence/absence. The values in the vector indicate the strength or importance of each feature for that particular text.

There are several techniques for creating vector space models, including:

Bag-of-Words Model: In this model, each text is represented as a set of words, and the frequency of each word is used to create a vector representation. This model ignores the order and context of words in the text.
TF-IDF Model: This model takes into account both the frequency of words in a text and their importance in the corpus. Words that occur frequently in a text but are rare in the corpus are given higher importance.
Word Embeddings: This is a more recent approach that represents words as dense, low-dimensional vectors, which are learned from large amounts of text data using techniques like neural networks. The vector representation captures the semantic relationships between words, and can be used for tasks like word similarity and analogy.

Overall, vector space models provide a powerful way to represent text data in a form that can be easily processed by machine learning algorithms.

Let’s say you have two questions, first, where are you going? Second, where are you from? These sentences contain the same words, except for the last one. However, both have a different meaning. On the other hand, let’s say you have two more questions that have completely different words but both sentences mean the same thing. Vector space models will help you determine whether the first pair of questions or the second pair of questions are similar in meaning, even if they don’t share the same words. They can be used to identify similarity for answering, paraphrasing, and summarizing a question. Vector field models also allow you to capture dependencies between words.

With vector-based models, you’ll be able to capture this and many other types of relationships between different sets of words. Vector space models are used to extract information to answer questions such as who, what, where, how, and so on. In machine translation and chess sport programming. They are also used in many other applications.

This is one of the most fundamental concepts in NLP. The way representations are made when using vector space models is to define the context around each word in the text, and this captures the relative meaning. Eureka vector space models allow you to represent words and documents as vectors.

Word by Word and Word by Doc.

Word by Word Design; In this approach, the text data is processed one word at a time. This involves breaking down the text into individual words and then analyzing each word independently to extract its meaning, context, and relevance to the overall text. This approach is commonly used in tasks such as text classification, sentiment analysis, and named entity recognition.

For example, in the sentence “The cat is sitting on the mat”, the word by word approach would analyze each word separately to understand their meaning and context. The analysis might involve determining that “cat” and “mat” are nouns, “sitting” is a verb, “the” is a determiner, and so on.

We will start by exploring the word by word design. Assume that you are trying to come up with a vector that will represent a certain word. One possible design would be to create a matrix where each row and column corresponds to a word in your vocabulary. Then you can iterate over a document and see the number of times each word shows up next each other word. You can keep track of the number in the matrix. In the video I spoke about a parameter K. You can think of K as the bandwidth that decides whether two words are next to each other or not.

In the example above, you can see how we are keeping track of the number of times words occur together within a certain distance k. At the end, you can represent the word data, as a vector v = [2,1,1,0].

Word by Document Design; In this approach, the entire text is analyzed as a single unit, rather than analyzing individual words in isolation. This approach is commonly used in tasks such as text summarization, document classification, and topic modeling.

For example, in the case of document classification, the word by doc approach would analyze the entire text of a document to understand its overall theme, tone, and purpose. This would involve looking at factors such as the frequency of certain words or phrases, the presence of particular topics, and the overall structure of the text. Overall, both the word by word and word by doc approaches have their own strengths and weaknesses, and the choice of which approach to use depends on the specific task and the nature of the text data being analyzed.

You can now apply the same concept and map words to documents. The rows could correspond to words and the columns to documents. The numbers in the matrix correspond to the number of times each word showed up in the document.

You can represent the entertainment category, as a vector v=[500,7000]. You can then also compare categories as follows by doing a simple plot.

Euclidean Distance

Euclidean distance is a measure of the distance between two points in a vector space, commonly used in machine learning and natural language processing (NLP) for comparing the similarity of two vectors.

In NLP, we often represent text documents as vectors, where each dimension represents a feature or a term. The Euclidean distance between two such vectors can be computed as the square root of the sum of the squared differences between their corresponding feature values:

You can generalize finding the distance between the two points (A,B) to the distance between an n dimensional vector as follows:

d(x, y) = sqrt((x1 — y1)² + (x2 — y2)² + … + (xn — yn)²)

This distance metric can be used for a variety of tasks in NLP, such as text classification, clustering, and information retrieval, among others.

Cosine Similarity

Cosine similarity and Euclidean distance are both popular measures for comparing the similarity between two vectors in natural language processing (NLP).

Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space. It gives a measure of the similarity between the direction of the two vectors, irrespective of their magnitude. Cosine similarity ranges from -1 (exact opposite directions) to 1 (exact same direction). In NLP, cosine similarity is often used to compare the similarity of text documents represented as vectors, where each dimension represents a word or a feature.

Euclidean distance, on the other hand, measures the distance between two vectors in a multi-dimensional space. It gives a measure of the similarity between the magnitudes of the two vectors, irrespective of their direction. Euclidean distance ranges from 0 (when the two vectors are identical) to infinity (when the two vectors are completely dissimilar). In NLP, Euclidean distance is also used to compare the similarity of text documents represented as vectors, where each dimension represents a word or a feature.

When comparing these two measures, there are some key differences to consider:

Magnitude vs. direction: Cosine similarity only considers the direction of the vectors, while Euclidean distance considers both magnitude and direction.
Scaling: Cosine similarity is invariant to scaling, while Euclidean distance is affected by scaling. This means that if we scale the vectors by a constant factor, the cosine similarity will remain unchanged, while the Euclidean distance will change.
Interpretability: Cosine similarity can be more interpretable than Euclidean distance in some cases, since it is based on the angle between the vectors. This can be helpful in understanding the semantic similarity between text documents.
Dimensionality: Cosine similarity is generally more robust to high-dimensional data than Euclidean distance. This is because in high-dimensional spaces, the distance between any two points tends to become similar, making it difficult to distinguish between similar and dissimilar points using Euclidean distance.

In summary, both cosine similarity and Euclidean distance have their strengths and weaknesses, and their choice depends on the specific task and the nature of the data being analyzed.

One of the issues with euclidean distance is that it is not always accurate and sometimes we are not looking for that type of similarity metric. For example, when comparing large documents to smaller ones with euclidean distance one could get an inaccurate result. Look at the diagram below:

Normally the food corpus and the agriculture corpus are more similar because they have the same proportion of words. However the food corpus is much smaller than the agriculture corpus. To further clarify, although the history corpus and the agriculture corpus are different, they have a smaller euclidean distance. Hence d2<d1.
To solve this problem, we look at the cosine between the vectors. This allows us to compare B and α.

If v and w are the same then you get the numerator to be equal to the denominator. Hence β=0. On the other hand, the dot product of two orthogonal (perpendicular) vectors is 00. That takes place when β=90.

Manipulating Words in Vector Spaces

You can use word vectors to actually extract patterns and identify certain structures in your text. For example:

You can use the word vector for Russia, USA, and DC to actually compute a vector that would be very similar to that of Moscow. You can then use cosine similarity of the vector with all the other word vectors you have and you can see that the vector of Moscow is the closest.

Note that the distance (and direction) between a country and its capital is relatively the same. Hence manipulating word vectors allows you identify patterns in the text.

Visualization and PCA

Principal component analysis is an unsupervised learning algorithm which can be used to reduce the dimension of your data. As a result, it allows you to visualize your data. It tries to combine variances across features. Here is a concrete example of PCA:

Note that when doing PCA on this data, you will see that oil & gas are close to one another and town & city are also close to one another. To plot the data you can use PCA to go from d>2 dimensions to d=2.

Those are the results of plotting a couple of vectors in two dimensions. Note that words with similar part of speech (POS) tags are next to one another. This is because many of the training algorithms learn words by identifying the neighboring words. Thus, words with similar POS tags tend to be found in similar locations. An interesting insight is that synonyms and antonyms tend to be found next to each other in the plot. Why is that the case?

PCA Algorithm

PCA (Principal Component Analysis) is a widely used dimensionality reduction technique in machine learning, including in Natural Language Processing (NLP). The primary goal of PCA is to reduce the number of variables while retaining most of the information in the original dataset.

In NLP, we can use PCA to reduce the dimensionality of a text corpus. The high-dimensional nature of text data makes it difficult to perform various NLP tasks such as text classification, clustering, and information retrieval. By reducing the dimensionality, we can reduce the computational complexity and improve the performance of these tasks.

Here is a step-by-step guide to applying PCA to NLP:

Preprocessing: First, we need to preprocess the text corpus. This includes removing stop words, stemming or lemmatization, and converting the text into a numerical representation, such as TF-IDF or Bag of Words.
Computing the covariance matrix: Next, we compute the covariance matrix of the preprocessed corpus. The covariance matrix describes the relationship between the variables in the data.
Computing the eigenvectors and eigenvalues: We then compute the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent the directions of maximum variance in the data, and the eigenvalues represent the amount of variance explained by each eigenvector.
Choosing the number of principal components: We need to choose the number of principal components to retain. One common approach is to retain the top k components that explain a certain percentage of the total variance in the data.
Transforming the data: Finally, we transform the preprocessed corpus into the reduced dimensional space by multiplying it with the selected eigenvectors.

PCA can be applied to a variety of NLP tasks, including text classification, topic modeling, and information retrieval. However, it’s important to note that PCA may not always improve the performance of these tasks, and in some cases, it may even reduce performance. Therefore, it’s essential to evaluate the performance of the PCA model on the specific task and dataset before using it in practice.

PCA is commonly used to reduce the dimension of your data. Intuitively the model collapses the data across principal components. You can think of the first principal component (in a 2D dataset) as the line where there is the most amount of variance. You can then collapse the data points on that line. Hence you went from 2D to 1D. You can generalize this intuition to several dimensions.

Steps to Compute PCA:

Mean normalize your data
Compute the covariance matrix
Compute SVD on your covariance matrix. This returns (Σ)[USV]=svd(Σ). The three matrices U, S, V are drawn above. U is labelled with eigenvectors, and S is labelled with eigenvalues.
You can then use the first n columns of vector U, to get your new data by multiplying XU[:,0:n].

Referance

Younes Bensouda Mourri, Natural Language Processing with Classification and Vector Spaces

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute these slides for commercial purposes. You may make copies of these slides and use or distribute them for educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see

Creative Commons Legal Code

CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES…

creativecommons.org