使用Python时两个文件之间最常见的单词

2024-09-29 21:29:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我是Python新手,正在尝试编写脚本,在两个文件之间查找最常见的单词。我能在两个文件中分别找到最常见的单词,但不确定如何计数,比如说,在两个文件中最常见的5个单词?最常见的词和最常见的词之间也应该找到。在

import re
from collections import Counter


finalLineLower=''
with open("test3.txt", "r") as hfFile:
        for line in hfFile:
            finalLine = re.sub('[,.<;:)-=!>_(?"]', '', line)            
            finalLineLower += finalLine.lower()
            words1 = finalLineLower.split()

f = open('test2.txt', 'r')
sWords = [line.strip() for line in f]


finalLineLower1=''
with open("test4.txt", "r") as tsFile:
        for line in tsFile:
            finalLine = re.sub('[,.<;:)-=!>_(?"]', '', line)            
            finalLineLower1 += finalLine.lower()
            words = finalLineLower1.split()
#print (words)
mc = Counter(words).most_common()
mc2 = Counter(words1).most_common()

print(len(mc))
print(len(mc2))

下面是test3和test4文件的示例。 测试3:

^{pr2}$

测试4:

Essays are generally scholarly pieces of writing giving the author's own argument, but the definition is vague, overlapping with those of an article, a pamphlet and a short story.

Essays can consist of a number of elements, including: literary criticism, political manifestos, learned arguments, observations of daily life, recollections, and reflections of the author. Almost all modern essays are written in prose, but works in verse have been dubbed essays (e.g. Alexander Pope's An Essay on Criticism and An Essay on Man). While brevity usually defines an essay, voluminous works like John Locke's An Essay Concerning Human Understanding and Thomas Malthus's An Essay on the Principle of Population are counterexamples. In some countries (e.g., the United States and Canada), essays have become a major part of formal education. Secondary students are taught structured essay formats to improve their writing skills, and admission essays are often used by universities in selecting applicants and, in the humanities and social sciences, as a way of assessing the performance of students during final exams.

Tags: and文件oftheinreanline
2条回答

这个问题模棱两可。在

您可能会询问两个文件中最常见的单词,例如,一个单词在file1中出现10000次,在file2中出现1次,就被认为出现了10001次。在这种情况下:

mc = Counter(words) + Counter(words1) # or Counter(chain(words, words1))
mos = mc.most_common(5)

或者,您可以询问在文件中最常见的单词,这些单词在另一个文件中至少出现一次:

^{pr2}$

或者是两个文件中最常见的,但前提是它们在每个文件中至少出现一次:

^{3}$

或许还有其他的解释方法。如果你能用明确的英语表达这个规则,那么把它翻译成Python应该很容易;如果你不能这样做,那就不可能了。在


像你的答案{cd2}而不是在读你的代码。当你对一个Counter调用most_common(),你得到的是list,而不是Counter。只是…别这么做,做这里的代码。在

您只需找到Counter对象与&操作数之间的交集:

mc = Counter(words)
mc2 = Counter(words1)
total=mc&mc2
mos=total.most_common(N)

示例:

^{pr2}$

但是请注意,&返回计数器之间的最小计数您也可以使用union |返回最大计数,您可以使用简单的dict理解来获得最大计数:

^{3}$

最后,如果你需要常用词的总和,你可以把你的计数器加在一起:

>>> s=Counter(max)+t
>>> s
Counter({'t': 10, 'a': 8, 'h': 7})

相关问题 更多 >

    热门问题