Vector Space Model and TF-IDF

May 16, 2020

There are many similarity models floating in the universe, holding different assumptions and priorities. Here we will explore the most common one used, the TF-IDF Vector Space Model.

VSM (Vector Space Model)

A vector space model is a model where each term of the query is considered a vector dimension. It assumes that each term is a dimension that is orthogonal to all other terms, which means terms are modeled as occurring in documents independently. When evaluating the query to a document considered, one vector represents the query, and another, the document. The cosine similarity of the two vectors can be used to represent the relevance of the document to the query. A cosine value of 0 means that the query and the document vector are orthogonal and have no match (ie. none of the query terms were in the document). Cosine similarity is advantageous over Euclidean distance because cosine similarity measures the angle between two vectors, which means it captures the direction of the two vectors - while Euclidean distance captures the magnitude. Two vectors can be far apart by Euclidean distance, ie. imagine a short vector and a long vector, but have a small angle between them.

Let’s see how this can play out. We could set each coordinate in the document vector to be 1 if the document contains the term for the dimension, otherwise set it to 0. The query vector would be all 1’s, assuming each term has the same weight. Here’s an example:

query: can my cat eat chicken
- vector: [1, 1, 1, 1, 1]
document1: A cat can eat chicken. Chicken is part of a cat's diet.
- vector: [1, 0, 1, 1, 1]
- score: 4
document2: The cat ran past the chicken to try to eat my mouse.
- vector: [0, 1, 1, 1, 1]
- score: 4

This model would rank document1 and document2 equally relevant to query. It rank a document mentioning a query term once in a footnote, as relevant as, another document that uses a query term repeatedly throughout the text. Hmm… the rankings do not seem right. We should be able to calcuate different weights for a term in the document given its frequency and other factors that could affect its relevancy. Enter TF-IDF! A plausible way to account for various factors that this current model does not. Instead of using 1 if the term is present, perhaps, we can use TF-IDF in the document vector.

TF-IDF (Term Frequency-Inverse Document Frequency)

Term Frequency-Inverse Document Frequency, is a numerical statistic that represents how influential a word is to defining the “relevancy” of a document in a corpus. The TF-IDF value increases proportionally to the frequency a word appears in a document (TF) and is offset by the number of documents in the corpus that also contain that word (IDF). The offset accounts for the fact that if a word, such as the or is, appears often in many documents, maybe it’s not as influential in defining the “uniqueness” or “relevancy” of a specific document. Another way of thinking about it is, for instance, if you have a corpus that is about cats, when you search can my cat eat chicken, perhaps cat is not as important as the other terms you used in your search, such as chicken, and documents with the word chicken should be more relevant to your search than having the word cat.

TF-IDF is an intuitive concept, and works well under the assumption that each document in the corpus roughly has around the same length. However, if documents have varying lengths in the corpus, TF-IDF alone doesn’t not account for that. Perhaps we could add additional weights to this model to account for varying document lengths and how much influence it should have to the overall relevance value of a given document.

In Lucene’s implementation of TF-IDF, ClassicSimilarity, an additional bias, lengthNorm, was introduced to account for document length. lengthNorm gave signficant bias towards matching shorter documents over longer documents. It assumed that if a term occurs the same amount of times in two documents, but one is shorter than the other, then the shorter one is more “concentrated” and therefore, scores higher in relevancy than the longer document. The implementation looked like this:

IDF * TF * lengthNorm

IDF implemented as log((docCount+1)/(docFreq+1)) + 1 where docFreq is number of documents containing the term, and docCount is total number of documents
TF implemented as sqrt(freq)
lengthNorm implemented as 1/sqrt(length)

Keep in mind, this implementation accounts for document length, but does not allow you to adjust its influence to the overall value.

Let’s see how using TF-IDF weighting, based on Lucene’s implementation, changes the relevancy scores in our earlier example.

query: can my cat eat chicken
document1: A cat can eat chicken. Chicken is part of a cat's diet.
- vector: [0.2886751345948129, 0, 0.1716274399381882, 0.1716274399381882, 0.24271785323595957]
- score: 0.8746478677071488
document2: The cat ran past the chicken to try to eat my mouse.
- vector: [0, 0.2886751345948129, 0.1716274399381882, 0.1716274399381882, 0.1716274399381882]
- score: 0.8035574544093774

TF-IDF Implementation: https://gist.github.com/mtham8/8f93db2443b6214dc0734c01d029b80c

Playground: https://play.golang.org/p/XZOC8fHUyWs

We can see here document1 is ranked higher than document2 in relevancy to query using TF-IDF weighting. Hooray! This ranking is closer to what we are looking for. There were a few text cleaning steps I had to do beforehand, ie. lowercasing all the terms, and omitting some punctuations (periods). I also tokenized the documents, ie. splitting the sentences into terms. These transformations are common preprocessing steps, and depending on the corpus and search needs, one might preprocess the text differently.

Limitations

There are some limitations using the TF-IDF VSM. IDF is calculated not based on relevance information, but on matching terms. Documents with simliar context, but different term vocabulary would not be considered a match, therefore, resulting in false negatives. The positions of the words within a document is lost in the vector space representation. The model assumes the terms are independent. Under this assumption, documents are a bag of words and order does not matter. Often that is not the case. Terms can be dependent in situations such as:

polysemy: same terms used in different contexts - ie. I love my cat vs cat lady, which affects precision because irrelevant documents can be retrieved due to term matching.
synonymity: different terms use in the same contexts - ie. feline vs cat, which affects recall because documents that are relevant would not be retrieved.
ordering: different terms used in different positions in different contexts - ie. college junior vs junior college, which affects precision and recall.

Nonetheless, TF-IDF VSM is popular because it can go pretty far! Every corpus has its flavor. I invite you to try TF-IDF VSM, with its limitations in mind so that the surprises left are the unknowns. 🚀

Sources

Written by Michelle who enjoys reading, writing software, and petting cats. Twitter Github

← Concurrency Bloopers