TF-IDF & Cosine Similarity

TF-IDF: Term Frequency-Inverse Document Frequency

TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus. This measure helps in information retrieval and text mining by adjusting the frequency of words by how often they appear across multiple documents. In essence, TF-IDF increases proportionally with the number of times a word appears in a document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

  • Term Frequency (TF): This measures how frequently a term occurs in a document. There are several ways to calculate term frequency, with the simplest being the raw count of a term in a document. The term frequency is often normalized to prevent a bias towards longer documents (which may have a higher term count regardless of the actual importance of a word in the document).
  • Inverse Document Frequency (IDF): This measures how important a term is within the whole corpus. The more common a word is, the lower its IDF.

tf(t,d)=ft,dtdft,dtf(t,d)=\frac{f_{t,d}}{\sum_{t'\in d} f_{t',d}}

idf(t,D)=logN{dD:td}+1+1idf(t,D)=log\frac{N}{|\left \{ d\in D:t\in d \right \}|+1}+1

tfidf(t,d,D)=tf(t,d)idf(t,D)tf-idf(t, d, D)=tf(t,d)*idf(t,D)

Code Implementation:

tf(t, d)  # Times term t_i occurs in document d_i / Total number of words in document d_i
df(t)  # Documents containing term t
idf(t, D) = log(N / (df(t) + 1)) + 1
tf-idf(t, d, D) = tf(t, d) * idf(t, D)
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
    "the quick brown fox jumped over the lazy dog",
    "the dog jumped over the moon",
    "the quick brown fox",
]
tr_idf_model  = TfidfVectorizer()
tf_idf_vector = tr_idf_model.fit_transform(corpus)
tf_idf_array = tf_idf_vector.toarray()

# Print TF-IDF Vectors
print(tf_idf_array)

# Exhibit TF-IDF Vectors
words_set = tr_idf_model.get_feature_names_out()
df_tf_idf = pd.DataFrame(tf_idf_array, columns = words_set)
pd.set_option('display.max_rows', False)
print(df_tf_idf)

Cosine Similarity

Cosine similarity is a measure used to determine how similar two vectors are irrespective of their size. Commonly used in text analysis and other multidimensional data spaces, it is particularly useful in high-dimensional positive space where the angle between vectors is a good indication of their similarity.

sim(D1,D2)=D1D2D1×D2sim(D_1,D_2)=\frac{\overrightarrow{D_1}\cdot \overrightarrow{D_2}}{|\overrightarrow{D_1}|\times |\overrightarrow{D_2}|}

Code Implementation:

import numpy as np

def cosine_similarity(A, B):
    dot_product = np.dot(A, B)
    norm_A = np.linalg.norm(A)
    norm_B = np.linalg.norm(B)
    return dot_product / (norm_A * norm_B)

# Example TF-IDF Vectors
vector_A = np.array([1, 2, 3])
vector_B = np.array([4, 5, 6])

# Compute cosine similarity
similarity = cosine_similarity(vector_A, vector_B)
print(f"Cosine Similarity: {similarity}")