Vector Space Model and TF-IDF
May 16, 2020
There are many similarity models floating in the universe, holding different assumptions and priorities. Here we will explore the most common one used, the TF-IDF Vector Space Model.
VSM (Vector Space Model)
A vector space model is a model where each term of the query is considered a vector dimension. It assumes that each term is a dimension that is orthogonal to all other terms, which means terms are modeled as occurring in documents independently. When evaluating the query to a document considered, one vector represents the query, and another, the document. The cosine similarity of the two vectors can be used to represent the relevance of the document to the query. A cosine value of 0 means that the query and the document vector are orthogonal and have no match (ie. none of the query terms were in the document). Cosine similarity is advantageous over Euclidean distance because cosine similarity measures the angle between two vectors, which means it captures the direction of the two vectors - while Euclidean distance captures the magnitude. Two vectors can be far apart by Euclidean distance, ie. imagine a short vector and a long vector, but have a small angle between them.
Let’s see how this can play out. We could set each coordinate in the document vector to be 1 if the document contains the term for the dimension, otherwise set it to 0. The query vector would be all 1’s, assuming each term has the same weight. Here’s an example:
-
query:
can my cat eat chicken
- vector: [1, 1, 1, 1, 1]
-
document1:
A cat can eat chicken. Chicken is part of a cat's diet.
- vector: [1, 0, 1, 1, 1]
- score: 4
-
document2:
The cat ran past the chicken to try to eat my mouse.
- vector: [0, 1, 1, 1, 1]
- score: 4
This model would rank document1
and document2
equally relevant to query
. It rank a document mentioning a query term once in a footnote, as relevant as, another document that uses a query term repeatedly throughout the text. Hmm… the rankings do not seem right. We should be able to calcuate different weights for a term in the document given its frequency and other factors that could affect its relevancy. Enter TF-IDF! A plausible way to account for various factors that this current model does not. Instead of using 1 if the term is present, perhaps, we can use TF-IDF in the document vector.
TF-IDF (Term Frequency-Inverse Document Frequency)
Term Frequency-Inverse Document Frequency, is a numerical statistic that represents how influential a word is to defining the “relevancy” of a document in a corpus. The TF-IDF value increases proportionally to the frequency a word appears in a document (TF) and is offset by the number of documents in the corpus that also contain that word (IDF). The offset accounts for the fact that if a word, such as the
or is
, appears often in many documents, maybe it’s not as influential in defining the “uniqueness” or “relevancy” of a specific document. Another way of thinking about it is, for instance, if you have a corpus that is about cats, when you search can my cat eat chicken
, perhaps cat
is not as important as the other terms you used in your search, such as chicken
, and documents with the word chicken
should be more relevant to your search than having the word cat
.
TF-IDF is an intuitive concept, and works well under the assumption that each document in the corpus roughly has around the same length. However, if documents have varying lengths in the corpus, TF-IDF alone doesn’t not account for that. Perhaps we could add additional weights to this model to account for varying document lengths and how much influence it should have to the overall relevance value of a given document.
In Lucene’s implementation of TF-IDF, ClassicSimilarity
, an additional bias, lengthNorm
, was introduced to account for document length. lengthNorm
gave signficant bias towards matching shorter documents over longer documents. It assumed that if a term occurs the same amount of times in two documents, but one is shorter than the other, then the shorter one is more “concentrated” and therefore, scores higher in relevancy than the longer document. The implementation looked like this:
IDF * TF * lengthNorm
IDF
implemented aslog((docCount+1)/(docFreq+1)) + 1
wheredocFreq
is number of documents containing the term, anddocCount
is total number of documentsTF
implemented assqrt(freq)
lengthNorm
implemented as1/sqrt(length)
Keep in mind, this implementation accounts for document length, but does not allow you to adjust its influence to the overall value.
Let’s see how using TF-IDF weighting, based on Lucene’s implementation, changes the relevancy scores in our earlier example.
- query:
can my cat eat chicken
-
document1:
A cat can eat chicken. Chicken is part of a cat's diet.
- vector: [0.2886751345948129, 0, 0.1716274399381882, 0.1716274399381882, 0.24271785323595957]
- score: 0.8746478677071488
-
document2:
The cat ran past the chicken to try to eat my mouse.
- vector: [0, 0.2886751345948129, 0.1716274399381882, 0.1716274399381882, 0.1716274399381882]
- score: 0.8035574544093774
TF-IDF Implementation: https://gist.github.com/mtham8/8f93db2443b6214dc0734c01d029b80c
Playground: https://play.golang.org/p/XZOC8fHUyWs
We can see here document1
is ranked higher than document2
in relevancy to query
using TF-IDF weighting. Hooray! This ranking is closer to what we are looking for. There were a few text cleaning steps I had to do beforehand, ie. lowercasing all the terms, and omitting some punctuations (periods). I also tokenized the documents, ie. splitting the sentences into terms. These transformations are common preprocessing steps, and depending on the corpus and search needs, one might preprocess the text differently.
Limitations
There are some limitations using the TF-IDF VSM. IDF is calculated not based on relevance information, but on matching terms. Documents with simliar context, but different term vocabulary would not be considered a match, therefore, resulting in false negatives. The positions of the words within a document is lost in the vector space representation. The model assumes the terms are independent. Under this assumption, documents are a bag of words and order does not matter. Often that is not the case. Terms can be dependent in situations such as:
- polysemy: same terms used in different contexts - ie.
I love my cat
vscat lady
, which affects precision because irrelevant documents can be retrieved due to term matching. - synonymity: different terms use in the same contexts - ie.
feline
vscat
, which affects recall because documents that are relevant would not be retrieved. - ordering: different terms used in different positions in different contexts - ie.
college junior
vsjunior college
, which affects precision and recall.
Nonetheless, TF-IDF VSM is popular because it can go pretty far! Every corpus has its flavor. I invite you to try TF-IDF VSM, with its limitations in mind so that the surprises left are the unknowns. 🚀
Sources
- https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java
- https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
- https://nlp.stanford.edu/IR-book/html/htmledition/the-binary-independence-model-1.html
- https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
- https://www.elastic.co/blog/found-similarity-in-elasticsearch
- http://www.minerazzi.com/tutorials/term-vector-3.pdf