如何比较两个文本文件的词频?

2024-09-30 05:30:49 发布

您现在位置:Python中文网/ 问答频道 /正文

如何比较python中两个文本文件的词频?例如,如果一个单词同时包含在file1和file2中,那么它应该只写一次,但是在比较时不加上它们的频率,它应该是{'The':3,5}。这里3是文件1中的频率,5是文件2中的频率。如果某些单词只存在于一个文件中,而不是同时存在于两个文件中,那么对于该文件,应该有0。请帮忙 以下是我目前所做的:

import operator
f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2

wordlist=[]
wordlist2=[]
for line in f1:
    for word in line.split():
        wordlist.append(word)

for line in f2:
    for word in line.split():
        wordlist2.append(word)

worddictionary = {}
for word in wordlist:
    if word in worddictionary:
        worddictionary[word] += 1
    else:
        worddictionary[word] = 1

worddictionary2 = {}
for word in wordlist2:
    if word in worddictionary2:
        worddictionary2[word] += 1
    else:
        worddictionary2[word] = 1

print(worddictionary)
print(worddictionary2)

Tags: 文件inforlineopen单词file1file2
3条回答

您可能会发现以下演示程序是获取文件词频的良好起点:

#! /usr/bin/env python3
import collections
import pathlib
import pprint
import re
import sys


def main():
    freq = get_freq(sys.argv[0])
    pprint.pprint(freq)


def get_freq(path):
    if isinstance(path, str):
        path = pathlib.Path(path)
    return collections.Counter(
        match.group() for match in re.finditer(r'\b\w+\b', path.open().read())
    )


if __name__ == '__main__':
    main()

特别是,您需要使用get_freq函数来获得一个Counter对象,它告诉您单词的频率是什么。您的程序可以使用不同的文件名多次调用get_freq函数,您应该会发现Counter对象与您以前使用的词典非常相似。在

编辑:我误解了问题,代码现在适用于您的问题。在

f1 = open('file1.txt','r') #file 1
f2 = open('file2.txt','r') #file 2

wordList = {}

for line in f1.readlines(): #for each line in lines (file.readlines() returns a list)
    for word in line.split(): #for each word in each line
        if(not word in wordList): #if the word is not already in our dictionary
            wordList[word] = 0 #Add the word to the dictionary

for line in f2.readlines(): #for each line in lines (file.readlines() returns a list)
    for word in line.split(): #for each word in each line
        if(word in wordList): #if the word is already in our dictionary
            wordList[word] = wordList[word]+1 #add one to it's value

f1.close() #close files
f2.close()

f1 = open('file1.txt','r') #Have to re-open because we are at the end of the file.
#might be a n easier way of doing this

for line in f1.readlines(): #Removing keys whose values are 0
    for word in line.split(): #for each word in each line
        try:
            if(wordList[word] == 0): #if it's value is 0
                del wordList[word] #remove it from the dictionary
            else:
                wordList[word] = wordList[word]+1 #if it's value is not 0, add one to it for each occurrence in file1
        except:
            pass #we know the error was that there was no wordList[word]
f1.close()

print(wordList)

添加第一个文件单词,如果该单词在第二个文件中,则在值中添加一个。 然后,检查每个单词,如果它的值为0,则删除它。在

这不能通过遍历字典来实现,因为它在遍历字典的同时改变了大小。在

以下是对多个文件(更复杂)的实现方法:

^{pr2}$

编辑:以下是对任何文件列表执行此操作的更一般的方法(注释中的解释):

f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2

file_list = [f1, f2] # This would hold all your open files
num_files = len(file_list)

frequencies = {} # We'll just make one dictionary to hold the frequencies

for i, f in enumerate(file_list): # Loop over the files, keeping an index i
    for line in f: # Get the lines of that file
        for word in line.split(): # Get the words of that file
            if not word in frequencies:
                frequencies[word] = [0 for _ in range(num_files)] # make a list of 0's for any word you haven't seen yet   one 0 for each file

            frequencies[word][i] += 1 # Increment the frequency count for that word and file

print frequencies

按照您编写的代码,以下是如何创建组合字典:

^{pr2}$

相关问题 更多 >

    热门问题