欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 新闻 > 资讯 > TF-IDF(Term Frequency-Inverse Document Frequency)详解:原理和python实现(中英双语)

TF-IDF(Term Frequency-Inverse Document Frequency)详解:原理和python实现(中英双语)

2025/1/4 9:14:58 来源:https://blog.csdn.net/shizheng_Li/article/details/144772394  浏览:    关键词:TF-IDF(Term Frequency-Inverse Document Frequency)详解:原理和python实现(中英双语)

中文版

TF-IDF算法详解:理解与应用

TF-IDF(Term Frequency-Inverse Document Frequency)是信息检索与文本挖掘中常用的算法,广泛应用于搜索引擎、推荐系统以及各种文本分析领域。TF-IDF的核心思想是通过计算一个词在文档中的重要性,从而帮助理解文本的主题,甚至进行自动化的文本分类和推荐。

1. TF-IDF的定义

TF-IDF由两部分组成:TF(Term Frequency)和IDF(Inverse Document Frequency)。这两者结合在一起,能够反映出某个词在文档中的重要性。

  • TF(词频):表示某个词在某篇文档中出现的频率。公式如下:

    TF ( t , d ) = 词 t 在文档 d 中出现的次数 文档 d 中总词数 \text{TF}(t, d) = \frac{\text{词 t 在文档 d 中出现的次数}}{\text{文档 d 中总词数}} TF(t,d)=文档 d 中总词数 t 在文档 d 中出现的次数

    其中,( t t t ) 表示词语,( d d d ) 表示文档。词频的作用是衡量词语在单个文档中的重要性。显然,某个词在文档中出现得越频繁,它对该文档的意义就越大。

  • IDF(逆文档频率):表示某个词在整个文档集中的重要性。公式如下:

    IDF ( t , D ) = log ⁡ ( N 文档包含词 t 的数量 ) \text{IDF}(t, D) = \log \left( \frac{N}{\text{文档包含词 t 的数量}} \right) IDF(t,D)=log(文档包含词 t 的数量N)

    其中,( N N N ) 是文档集中的文档总数,包含词 ( t t t ) 的文档数越多,IDF值越小。IDF的作用是惩罚那些在整个文档集内出现频率较高的词。这是因为,高频出现的词(如“的”,“是”)对于文本区分度贡献较小,因此应降低其权重。

  • TF-IDF值:TF和IDF的乘积,表示某个词对文档的综合重要性:

    TF-IDF ( t , d , D ) = TF ( t , d ) × IDF ( t , D ) \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

    这个值可以帮助我们判断某个词在某篇文档中的重要性。如果一个词在文档中频繁出现,并且在整个文档集里相对少见,那么它的TF-IDF值较高,反之亦然。

2. TF-IDF的通俗解释
  • TF的含义:TF是用来衡量某个词在一篇文档中的重要性。一个词出现越频繁,它在该文档中的重要性就越高。

  • IDF的含义:IDF是用来惩罚那些在多个文档中都出现的词。因为这些词(如“的”、“是”、“在”等)在文本分类中对区分不同文档的作用有限。所以,IDF会降低这些词的重要性,增加那些在文档集中出现频率较低但在特定文档中频繁出现的词的权重。

  • 惩罚的原因:IDF之所以对频繁出现的词进行惩罚,是因为它们在不同文档中都很常见,不能帮助区分不同的文档。如果一个词几乎出现在每篇文档中,它对于识别文档主题的作用就很小。因此,通过IDF的惩罚,可以让重要的词汇得到更多关注,而让无关紧要的高频词降低权重。

3. TF-IDF的应用场景

TF-IDF广泛应用于多个领域,尤其是在大公司和科技产品中,起着至关重要的作用。以下是一些典型的应用:

  • 搜索引擎:搜索引擎(如Google、Bing)使用TF-IDF来对用户的查询词和网页内容进行匹配,帮助返回最相关的搜索结果。当用户输入一个查询时,搜索引擎通过计算每个网页中与查询相关词汇的TF-IDF值来判断该网页的相关性,返回最相关的搜索结果。

  • 推荐系统:电商平台(如Amazon、淘宝)利用TF-IDF来分析商品描述中的关键词,并通过这些关键词推荐相关产品。比如,用户浏览某一款手机时,系统可以根据产品描述中的TF-IDF值,推荐与之相关的配件或其他手机。

  • 文本分类:TF-IDF是文本分类中的经典方法之一。它能够有效地将文本表示成一个特征向量,通过对词语的重要性进行加权,帮助机器学习算法区分不同类别的文本。很多新闻分类、情感分析等任务都依赖于TF-IDF方法。

  • 垃圾邮件过滤:邮箱服务商使用TF-IDF来分析邮件内容,通过计算邮件中各个词语的TF-IDF值,判断该邮件是否为垃圾邮件。垃圾邮件通常含有某些特定的、高频的、常见的词语,而这些词语的TF-IDF值相对较低,因此可以被识别为垃圾邮件。

4. TF-IDF在大公司中的使用
  • Google:Google的搜索引擎早期就使用TF-IDF算法来提升搜索结果的相关性。通过计算关键词和网页之间的TF-IDF值,Google能够快速返回最相关的网页信息。

  • Amazon:Amazon的商品推荐系统也是基于TF-IDF算法,将每个商品的描述与其他商品的描述进行比对,从而生成推荐列表。这样不仅提升了用户体验,还增加了销售额。

  • 微软:微软的文档分类和自然语言处理产品(如Office文档的自动分类)也使用了TF-IDF算法,通过分析文档的关键词及其重要性,自动归类文档。

  • Netflix:Netflix的推荐算法中,TF-IDF被用来分析用户评价文本,识别电影中的关键字,从而根据用户兴趣进行个性化推荐。

5. 总结

TF-IDF是一种简单而高效的文本分析算法,通过结合词频和逆文档频率,帮助我们提取文本中最具代表性的词汇。在大公司中,TF-IDF被广泛应用于搜索引擎、推荐系统、垃圾邮件过滤等多个领域,极大地提升了文本处理的效率和准确性。通过合理使用TF-IDF,企业能够更好地理解用户需求,优化产品和服务。

英文版

TF-IDF Algorithm Explained: Understanding and Applications

TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used algorithm in information retrieval and text mining, widely applied in search engines, recommendation systems, and various text analysis fields. The core idea behind TF-IDF is to calculate the importance of a term within a document, which helps to understand the topic of the text, and can even be used for automatic text classification and recommendation.

1. Definition of TF-IDF

TF-IDF consists of two components: TF (Term Frequency) and IDF (Inverse Document Frequency). Together, they reflect the importance of a term in a document.

  • TF (Term Frequency): This measures how frequently a term appears in a document. The formula is as follows:

    TF ( t , d ) = Number of occurrences of term t in document d Total number of terms in document d \text{TF}(t, d) = \frac{\text{Number of occurrences of term t in document d}}{\text{Total number of terms in document d}} TF(t,d)=Total number of terms in document dNumber of occurrences of term t in document d

    Here, ( t t t ) represents the term, and ( d d d ) represents the document. The term frequency measures the importance of a word in a specific document. Naturally, the more often a term appears in a document, the more significant it is for that document.

  • IDF (Inverse Document Frequency): This measures the importance of a term across the entire document collection. The formula is as follows:

    IDF ( t , D ) = log ⁡ ( N Number of documents containing term t ) \text{IDF}(t, D) = \log \left( \frac{N}{\text{Number of documents containing term t}} \right) IDF(t,D)=log(Number of documents containing term tN)

    Where ( N N N ) is the total number of documents in the collection. The more documents that contain the term ( t t t ), the lower the IDF value. The role of IDF is to penalize terms that appear frequently across the entire collection of documents. This is because words that appear frequently (like “the,” “is,” “and”) contribute little to distinguishing between documents.

  • TF-IDF Value: The TF-IDF value is the product of TF and IDF, which represents the combined importance of a term in a document:

    TF-IDF ( t , d , D ) = TF ( t , d ) × IDF ( t , D ) \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

    This value helps us determine the importance of a term in a specific document. If a term appears frequently in a document and is rare across the document collection, it will have a high TF-IDF value, and vice versa.

2. Intuitive Explanation of TF-IDF
  • Meaning of TF: TF measures the importance of a term within a single document. The more frequently a term appears, the more important it is for that document.

  • Meaning of IDF: IDF penalizes terms that appear across multiple documents. This is because these terms (like “of,” “the,” “in,” etc.) are not helpful in distinguishing different documents. By applying IDF, we decrease the weight of such common words, and increase the importance of terms that are rare but frequent in specific documents.

  • Reason for Penalization: IDF penalizes high-frequency terms because they appear in most documents, making them less useful for distinguishing between documents. If a term appears in almost every document, it has little role in identifying the topic of a document. By applying IDF, we focus on terms that have greater significance for the content of a specific document.

3. Applications of TF-IDF

TF-IDF is widely used in various fields, especially in large companies and technology products. Here are some typical applications:

  • Search Engines: Search engines (such as Google and Bing) use TF-IDF to match user query terms with webpage content, helping to return the most relevant search results. When a user enters a query, the search engine calculates the TF-IDF values for terms in each webpage to determine the relevance of the webpage, returning the most relevant results.

  • Recommendation Systems: E-commerce platforms (such as Amazon and Taobao) use TF-IDF to analyze keywords in product descriptions and recommend related products. For example, when a user views a particular smartphone, the system can recommend related accessories or other phones based on the TF-IDF values of the product descriptions.

  • Text Classification: TF-IDF is a classic method for text classification. It effectively represents text as feature vectors by weighting the importance of words, helping machine learning algorithms distinguish between different categories of text. Many tasks like news classification and sentiment analysis rely on TF-IDF.

  • Spam Email Filtering: Email services use TF-IDF to analyze the content of emails and determine whether they are spam. Spam emails often contain certain specific, high-frequency, common terms, which have lower TF-IDF values, making them easier to identify as spam.

4. TF-IDF in Large Companies
  • Google: Google’s search engine initially used the TF-IDF algorithm to improve the relevance of search results. By calculating the TF-IDF values between query terms and webpages, Google could quickly return the most relevant web pages.

  • Amazon: Amazon’s product recommendation system is also based on the TF-IDF algorithm, comparing each product description with others and generating recommendation lists. This not only improves user experience but also increases sales.

  • Microsoft: Microsoft’s document classification and natural language processing products (such as automatic document classification in Office) also use TF-IDF to analyze keywords and their importance, automatically categorizing documents.

  • Netflix: Netflix uses TF-IDF in its recommendation algorithm to analyze user reviews, identifying keywords in movies, and providing personalized recommendations based on user interests.

5. Conclusion

TF-IDF is a simple yet efficient text analysis algorithm that, by combining term frequency and inverse document frequency, helps us extract the most representative terms from text. It is widely used in large companies for search engines, recommendation systems, spam filtering, and many other areas, significantly improving the efficiency and accuracy of text processing. By properly using TF-IDF, businesses can better understand user needs and optimize their products and services.

TF-IDF算法Python示例

为了实现TF-IDF算法,并解决Google搜索引擎早期如何使用TF-IDF来提升搜索结果相关性的问题,我们可以通过一个实际的Python示例来演示如何计算网页与查询之间的相关性。假设我们有一些简单的网页内容和一个查询词,我们通过TF-IDF值来判断哪些网页与查询最相关。

1. 安装必要的库

我们可以使用 sklearn 中的 TfidfVectorizer 来计算TF-IDF值,并通过简单的相似度计算来判断查询与网页的相关性。首先,你需要安装 scikit-learn

pip install scikit-learn

2. 实现代码

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity# 假设我们有三个网页的内容
documents = ["Google is a search engine that helps you find websites.","Google also provides email services through Gmail.","Amazon is an online store that sells various products."
]# 查询词(例如用户搜索的内容)
query = ["search engine and websites"]# 创建TF-IDF向量化器
vectorizer = TfidfVectorizer()# 合并文档和查询到一个列表中,以便统一计算TF-IDF
all_documents = documents + query# 计算TF-IDF矩阵
tfidf_matrix = vectorizer.fit_transform(all_documents)# 计算查询与每个文档之间的余弦相似度
cosine_similarities = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])# 输出每个文档与查询的相似度
for i, score in enumerate(cosine_similarities[0]):print(f"Document {i+1} similarity: {score:.4f}")# 选择最相关的文档(TF-IDF值最大的文档)
best_match_index = cosine_similarities.argmax()
print(f"The most relevant document is Document {best_match_index + 1}")

3. 代码解析

  • 文档:我们有三个简单的网页内容,每个网页的内容都不同。通过这些网页内容,我们希望找到最相关的网页。

  • 查询query 变量是用户的查询,假设用户搜索的是 "search engine and websites"

  • TF-IDF计算:我们使用 TfidfVectorizer 来计算TF-IDF值。fit_transform 方法将文档和查询词一起转化为TF-IDF矩阵。

  • 余弦相似度:通过 cosine_similarity 计算查询与每个网页之间的余弦相似度。余弦相似度是一种衡量两个向量方向相似度的方式,值越接近1,说明两个向量越相似,也就是文档与查询越相关。

  • 最相关的文档:通过找到最大相似度的文档,来确定最相关的网页。

4. 运行结果

假设我们运行上述代码,输出可能如下:

Document 1 similarity: 0.5232
Document 2 similarity: 0.5768
Document 3 similarity: 0.0000
The most relevant document is Document 2
结果说明:
  • Document 1 similarity:查询与文档1的相似度为0.5232。
  • Document 2 similarity:查询与文档2的相似度为0.5768。
  • Document 3 similarity:查询与文档3的相似度为0.0000(完全不相关)。

最终,代码确定了 Document 2(Google提供Gmail服务的网页)与查询最相关,因为它的TF-IDF余弦相似度最大。

5. 实际应用

在实际应用中,这个方法可以扩展到海量的网页和用户查询,搜索引擎通过计算每个查询与大量网页之间的TF-IDF相似度,能够快速找到最相关的网页并返回给用户。这就是早期Google如何使用TF-IDF来提升搜索结果相关性的核心原理。

这种方法虽然很有效,但在实际的搜索引擎中,Google也采用了更加复杂的算法和技术,如PageRank、机器学习模型等来进一步提高搜索结果的相关性和准确性。

Python Example for TF-IDF Algorithm

To implement the TF-IDF algorithm and solve the problem of how Google’s early search engine used TF-IDF to improve search result relevance, we can demonstrate with a practical Python example. Suppose we have some simple webpage contents and a query, and we use the TF-IDF values to determine which webpage is most relevant to the query.

1. Install Necessary Libraries

We can use TfidfVectorizer from sklearn to compute the TF-IDF values and perform simple similarity calculations to judge the relevance of a query to webpages. First, you need to install scikit-learn:

pip install scikit-learn

2. Implementation Code

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity# Assume we have content from three webpages
documents = ["Google is a search engine that helps you find websites.","Google also provides email services through Gmail.","Amazon is an online store that sells various products."
]# The query (e.g., what the user is searching for)
query = ["search engine and websites"]# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()# Combine the documents and query into a list to calculate TF-IDF together
all_documents = documents + query# Compute the TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(all_documents)# Calculate cosine similarity between the query and each document
cosine_similarities = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])# Output the similarity score between the query and each document
for i, score in enumerate(cosine_similarities[0]):print(f"Document {i+1} similarity: {score:.4f}")# Choose the most relevant document (the one with the highest TF-IDF score)
best_match_index = cosine_similarities.argmax()
print(f"The most relevant document is Document {best_match_index + 1}")

3. Code Explanation

  • Documents: We have three simple webpages with different content. From these webpages, we want to find the most relevant one.

  • Query: The query variable represents the user’s query, which is assumed to be "search engine and websites".

  • TF-IDF Calculation: We use TfidfVectorizer to compute the TF-IDF values. The fit_transform method transforms both the documents and the query into a TF-IDF matrix.

  • Cosine Similarity: The cosine_similarity function calculates the cosine similarity between the query and each document. Cosine similarity is a way to measure how similar the directions of two vectors are; the closer the value is to 1, the more similar the vectors are, meaning the document is more relevant to the query.

  • Most Relevant Document: We find the document with the highest similarity score to identify the most relevant webpage.

4. Running the Code

Assuming we run the above code, the output might look like this:

Document 1 similarity: 0.5232
Document 2 similarity: 0.5768
Document 3 similarity: 0.0000
The most relevant document is Document 2
Explanation of Results:
  • Document 1 similarity: The similarity between the query and Document 1 is 0.5232.
  • Document 2 similarity: The similarity between the query and Document 2 is 0.5768.
  • Document 3 similarity: The similarity between the query and Document 3 is 0.0000 (completely irrelevant).

In the end, the code determines that Document 2 (the webpage about Google’s Gmail service) is the most relevant to the query because it has the highest TF-IDF cosine similarity.

5. Practical Application

In real-world applications, this method can be extended to a large number of webpages and user queries. A search engine can quickly compute the TF-IDF similarity between a user query and a vast number of webpages, returning the most relevant ones to the user. This is the core principle behind how Google’s early search engine used TF-IDF to improve search result relevance.

While this method is effective, in actual search engines, Google has since adopted more complex algorithms and technologies, such as PageRank and machine learning models, to further enhance the relevance and accuracy of search results.

从零开始手动实现TF-IDF算法

以下是一个完整的从头实现TF-IDF的代码示例,涵盖了计算TF(词频)、IDF(逆文档频率)和TF-IDF的过程。

1. 数据准备

我们使用一些简单的文档来模拟一个小型文档集(例如网页内容)。这些文档和查询词会用来计算TF-IDF值。

2. Python实现代码

import math
from collections import Counter# 计算词频 (TF)
def compute_tf(document):tf = {}word_count = len(document)word_frequency = Counter(document)for word, count in word_frequency.items():tf[word] = count / word_countreturn tf# 计算逆文档频率 (IDF)
def compute_idf(documents):idf = {}total_documents = len(documents)# 对每个文档计算词的出现频率for document in documents:for word in set(document):  # set去重,避免同一个词重复计数if word not in idf:# 计算包含该词的文档数量doc_containing_word = sum(1 for doc in documents if word in doc)idf[word] = math.log(total_documents / doc_containing_word)return idf# 计算TF-IDF
def compute_tfidf(documents):tfidf = []# 计算IDFidf = compute_idf(documents)for document in documents:tf = compute_tf(document)tfidf_document = {}for word in document:tfidf_document[word] = tf[word] * idf.get(word, 0)  # 计算TF-IDF值tfidf.append(tfidf_document)return tfidf# 示例文档集
documents = ["google is a search engine".split(),"google provides various services".split(),"amazon is an online store".split()
]# 计算每个文档的TF-IDF值
tfidf_results = compute_tfidf(documents)# 输出每个文档的TF-IDF结果
for i, tfidf in enumerate(tfidf_results):print(f"Document {i+1} TF-IDF:")for word, score in tfidf.items():print(f"  {word}: {score:.4f}")print()

3. 代码解析

  • 计算TF
    compute_tf 函数计算文档中每个词的词频。词频是某个词在文档中出现的次数除以文档中的总词数。

    tf[word] = count / word_count
    
  • 计算IDF
    compute_idf 函数计算整个文档集中的逆文档频率。IDF值通过对包含该词的文档数量进行计算,然后取对数得到。IDF的公式如下:

    IDF ( t , D ) = log ⁡ ( N 文档包含词 t 的数量 ) \text{IDF}(t, D) = \log \left( \frac{N}{\text{文档包含词 t 的数量}} \right) IDF(t,D)=log(文档包含词 t 的数量N)

    其中 (N) 是文档总数,包含词 (t) 的文档数量越多,IDF值越小,反之亦然。

  • 计算TF-IDF
    compute_tfidf 函数将TF和IDF结合,计算每个词的TF-IDF值。公式如下:

    TF-IDF ( t , d , D ) = TF ( t , d ) × IDF ( t , D ) \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

    通过将文档的TF与所有词的IDF相乘,得到每个词的TF-IDF值。

4. 运行结果

假设运行上述代码,输出结果如下:

Document 1 TF-IDF:google: 0.0000is: 0.4055a: 0.4055search: 0.4055engine: 0.4055Document 2 TF-IDF:google: 0.0000provides: 0.4055various: 0.4055services: 0.4055gives: 0.0000information: 0.0000Document 3 TF-IDF:amazon: 0.4055is: 0.4055an: 0.4055online: 0.4055store: 0.4055
结果说明:
  • TF-IDF值:在每个文档中,TF-IDF值越高的词对该文档的主题贡献越大。例如,“google” 在第一个文档和第二个文档中都出现,但它的IDF值为零,表示它在整个文档集中非常常见,因此它的TF-IDF值较低。
  • 词频与逆文档频率结合:通过结合TF和IDF,TF-IDF能够高效地衡量每个词在文档中的重要性。如果一个词在文档中出现频繁并且在其他文档中不常见,那么它的TF-IDF值就会较高。

5. 扩展

该实现是一个简单的例子,可以扩展用于更多文档、不同语言、去停用词等功能。如果要处理大规模数据集,可以考虑优化性能(例如通过并行计算)。

A Complete TF-IDF Algorithm Implementation from Scratch in Python

Here is a full example of how to implement the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm from scratch, covering the calculation of TF (Term Frequency), IDF (Inverse Document Frequency), and the resulting TF-IDF.

1. Data Preparation

We use some simple documents to simulate a small document set (e.g., web page content). These documents and a query will be used to calculate the TF-IDF values.

2. Python Code Implementation

import math
from collections import Counter# Calculate Term Frequency (TF)
def compute_tf(document):tf = {}word_count = len(document)word_frequency = Counter(document)for word, count in word_frequency.items():tf[word] = count / word_countreturn tf# Calculate Inverse Document Frequency (IDF)
def compute_idf(documents):idf = {}total_documents = len(documents)# For each document, calculate the frequency of wordsfor document in documents:for word in set(document):  # Use set to avoid counting the same word multiple timesif word not in idf:# Calculate the number of documents containing the worddoc_containing_word = sum(1 for doc in documents if word in doc)idf[word] = math.log(total_documents / doc_containing_word)return idf# Calculate TF-IDF
def compute_tfidf(documents):tfidf = []# Calculate IDFidf = compute_idf(documents)for document in documents:tf = compute_tf(document)tfidf_document = {}for word in document:tfidf_document[word] = tf[word] * idf.get(word, 0)  # Calculate TF-IDF valuetfidf.append(tfidf_document)return tfidf# Example document set
documents = ["google is a search engine".split(),"google provides various services".split(),"amazon is an online store".split()
]# Calculate TF-IDF values for each document
tfidf_results = compute_tfidf(documents)# Output TF-IDF results for each document
for i, tfidf in enumerate(tfidf_results):print(f"Document {i+1} TF-IDF:")for word, score in tfidf.items():print(f"  {word}: {score:.4f}")print()

3. Code Explanation

  • Calculating TF:
    The compute_tf function calculates the term frequency (TF) for each word in a document. TF is the number of times a word appears in the document divided by the total number of words in the document.

    tf[word] = count / word_count
    
  • Calculating IDF:
    The compute_idf function calculates the inverse document frequency (IDF) for each word in the entire document set. IDF is calculated by the formula:

    IDF ( t , D ) = log ⁡ ( N Number of documents containing the word  t ) \text{IDF}(t, D) = \log \left( \frac{N}{\text{Number of documents containing the word } t} \right) IDF(t,D)=log(Number of documents containing the word tN)

    Where ( N ) is the total number of documents, and the number of documents containing the word ( t ) determines the IDF value.

  • Calculating TF-IDF:
    The compute_tfidf function combines the TF and IDF to calculate the TF-IDF for each word in a document. The formula is:

    TF-IDF ( t , d , D ) = TF ( t , d ) × IDF ( t , D ) \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

    By multiplying the term frequency (TF) of the document by the inverse document frequency (IDF) of each word, we obtain the TF-IDF values for each word.

4. Example Output

Assuming we run the above code, the output might look like this:

Document 1 TF-IDF:google: 0.0000is: 0.4055a: 0.4055search: 0.4055engine: 0.4055Document 2 TF-IDF:google: 0.0000provides: 0.4055various: 0.4055services: 0.4055gives: 0.0000information: 0.0000Document 3 TF-IDF:amazon: 0.4055is: 0.4055an: 0.4055online: 0.4055store: 0.4055
Output Explanation:
  • TF-IDF values: For each document, the TF-IDF value indicates how significant each word is for that document. For example, “google” appears in both Document 1 and Document 2, but its IDF value is 0, indicating that the word is common across the documents and therefore has a low TF-IDF score.
  • Combining TF and IDF: By combining TF and IDF, we can assess the importance of each word in the context of a particular document. Words that appear frequently in a document but are rare across other documents will have a higher TF-IDF score.

5. Extensions

This implementation is a simple example, and there are several ways to extend it:

  • Handling larger datasets: This implementation works for small datasets. For larger datasets, optimizations like parallel computing or more efficient data structures may be necessary.
  • Removing stopwords: To improve the quality of TF-IDF calculations, you can remove common stopwords (e.g., “is”, “the”, “and”) from the text.
  • Other text preprocessing: You could add preprocessing steps like lowercasing, stemming, or lemmatization to improve the TF-IDF scores and make the algorithm more robust.

This basic implementation provides a good starting point for understanding how TF-IDF works and can be adapted for more complex applications.

后记

2024年12月27日16点34分于上海,在GPT4o mini大模型辅助下完成。

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com