如何比较两个文本文件的词频？

import operator f1=open('file1.txt','r') #file 1 f2=open('file2.txt','r') #file 2 wordlist=[] wordlist2=[] for line in f1: for word in line.split(): wordlist.append(word) for line in f2: for word in line.split(): wordlist2.append(word) worddictionary = {} for word in wordlist: if word in worddictionary: worddictionary[word] += 1 else: worddictionary[word] = 1 worddictionary2 = {} for word in wordlist2: if word in worddictionary2: worddictionary2[word] += 1 else: worddictionary2[word] = 1 print(worddictionary) print(worddictionary2)

3条回答

网友

1楼 · 编辑于 2024-09-30 05:30:49

您可能会发现以下演示程序是获取文件词频的良好起点：

#! /usr/bin/env python3
import collections
import pathlib
import pprint
import re
import sys


def main():
    freq = get_freq(sys.argv[0])
    pprint.pprint(freq)


def get_freq(path):
    if isinstance(path, str):
        path = pathlib.Path(path)
    return collections.Counter(
        match.group() for match in re.finditer(r'\b\w+\b', path.open().read())
    )


if __name__ == '__main__':
    main()

特别是，您需要使用get_freq函数来获得一个Counter对象，它告诉您单词的频率是什么。您的程序可以使用不同的文件名多次调用get_freq函数，您应该会发现Counter对象与您以前使用的词典非常相似。在

网友

2楼 · 编辑于 2024-09-30 05:30:49

编辑：我误解了问题，代码现在适用于您的问题。在

f1 = open('file1.txt','r') #file 1
f2 = open('file2.txt','r') #file 2

wordList = {}

for line in f1.readlines(): #for each line in lines (file.readlines() returns a list)
    for word in line.split(): #for each word in each line
        if(not word in wordList): #if the word is not already in our dictionary
            wordList[word] = 0 #Add the word to the dictionary

for line in f2.readlines(): #for each line in lines (file.readlines() returns a list)
    for word in line.split(): #for each word in each line
        if(word in wordList): #if the word is already in our dictionary
            wordList[word] = wordList[word]+1 #add one to it's value

f1.close() #close files
f2.close()

f1 = open('file1.txt','r') #Have to re-open because we are at the end of the file.
#might be a n easier way of doing this

for line in f1.readlines(): #Removing keys whose values are 0
    for word in line.split(): #for each word in each line
        try:
            if(wordList[word] == 0): #if it's value is 0
                del wordList[word] #remove it from the dictionary
            else:
                wordList[word] = wordList[word]+1 #if it's value is not 0, add one to it for each occurrence in file1
        except:
            pass #we know the error was that there was no wordList[word]
f1.close()

print(wordList)

添加第一个文件单词，如果该单词在第二个文件中，则在值中添加一个。然后，检查每个单词，如果它的值为0，则删除它。在

这不能通过遍历字典来实现，因为它在遍历字典的同时改变了大小。在

以下是对多个文件（更复杂）的实现方法：

^{pr2}$

网友

3楼 · 编辑于 2024-09-30 05:30:49

编辑：以下是对任何文件列表执行此操作的更一般的方法（注释中的解释）：

f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2

file_list = [f1, f2] # This would hold all your open files
num_files = len(file_list)

frequencies = {} # We'll just make one dictionary to hold the frequencies

for i, f in enumerate(file_list): # Loop over the files, keeping an index i
    for line in f: # Get the lines of that file
        for word in line.split(): # Get the words of that file
            if not word in frequencies:
                frequencies[word] = [0 for _ in range(num_files)] # make a list of 0's for any word you haven't seen yet   one 0 for each file

            frequencies[word][i] += 1 # Increment the frequency count for that word and file

print frequencies

按照您编写的代码，以下是如何创建组合字典：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章