TF-IDF & Cosine Similarity

TF-IDF: Term Frequency-Inverse Document Frequency

TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus. This measure helps in information retrieval and text mining by adjusting the frequency of words by how often they appear across multiple documents. In essence, TF-IDF increases proportionally with the number of times a word appears in a document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

Term Frequency (TF): This measures how frequently a term occurs in a document. There are several ways to calculate term frequency, with the simplest being the raw count of a term in a document. The term frequency is often normalized to prevent a bias towards longer documents (which may have a higher term count regardless of the actual importance of a word in the document).
Inverse Document Frequency (IDF): This measures how important a term is within the whole corpus. The more common a word is, the lower its IDF.

tf(t,d)=\frac{f_{t,d}}{\sum_{t'\in d} f_{t',d}}

idf(t,D)=log\frac{N}{|\left \{ d\in D:t\in d \right \}|+1}+1

tf-idf(t, d, D)=tf(t,d)*idf(t,D)

Code Implementation:

tf(t, d)  # Times term t_i occurs in document d_i / Total number of words in document d_i
df(t)  # Documents containing term t
idf(t, D) = log(N / (df(t) + 1)) + 1
tf-idf(t, d, D) = tf(t, d) * idf(t, D)

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
    "the quick brown fox jumped over the lazy dog",
    "the dog jumped over the moon",
    "the quick brown fox",
]
tr_idf_model  = TfidfVectorizer()
tf_idf_vector = tr_idf_model.fit_transform(corpus)
tf_idf_array = tf_idf_vector.toarray()

# Print TF-IDF Vectors
print(tf_idf_array)

# Exhibit TF-IDF Vectors
words_set = tr_idf_model.get_feature_names_out()
df_tf_idf = pd.DataFrame(tf_idf_array, columns = words_set)
pd.set_option('display.max_rows', False)
print(df_tf_idf)

Cosine Similarity

Cosine similarity is a measure used to determine how similar two vectors are irrespective of their size. Commonly used in text analysis and other multidimensional data spaces, it is particularly useful in high-dimensional positive space where the angle between vectors is a good indication of their similarity.

sim(D_1,D_2)=\frac{\overrightarrow{D_1}\cdot \overrightarrow{D_2}}{|\overrightarrow{D_1}|\times |\overrightarrow{D_2}|}

Code Implementation:

import numpy as np

def cosine_similarity(A, B):
    dot_product = np.dot(A, B)
    norm_A = np.linalg.norm(A)
    norm_B = np.linalg.norm(B)
    return dot_product / (norm_A * norm_B)

# Example TF-IDF Vectors
vector_A = np.array([1, 2, 3])
vector_B = np.array([4, 5, 6])

# Compute cosine similarity
similarity = cosine_similarity(vector_A, vector_B)
print(f"Cosine Similarity: {similarity}")

TF-IDF & Cosine Similarity

TF-IDF: Term Frequency-Inverse Document Frequency

Cosine Similarity

感谢您的支持，我会继续努力的!