如何使用python查找文件中最频繁出现的单词对集？

"485","AlterNet","Statistics","Estimation","Narnia","Two and half men" "717","I like Sheen", "Narnia", "Statistics", "Estimation" "633","MachineLearning","AI","I like Cars, but I also like bikes" "717","I like Sheen","MachineLearning", "regression", "AI" "136","MachineLearning","AI","TopGear"

def collect_pairs(file): pair_counter = Counter() for line in open(file): unique_tokens = sorted(set(line)) combos = combinations(unique_tokens, 2) pair_counter += Counter(combos) print pair_counter file = ('myfileComb.txt') p=collect_pairs(file)

2条回答

网友

1楼 · 编辑于 2024-10-01 19:31:53

根据你的语料库有多大，你可以从下面这样开始：

>>> from itertools import combinations
>>> from collections import Counter

>>> def collect_pairs(lines):
    pair_counter = Counter()
    for line in lines:
        unique_tokens = sorted(set(line))  # exclude duplicates in same line and sort to ensure one word is always before other
        combos = combinations(unique_tokens, 2)
        pair_counter += Counter(combos)
    return pair_counter

结果是：

^{pr2}$

你想不想把数字包括在这些组合中？因为你没有特别提到排除它们，所以我把它们包括在这里。在

编辑：使用文件对象

您在上面第一次尝试时发布的函数非常接近工作。您只需将每行（字符串）更改为元组或列表。假设您的数据看起来与上面发布的数据完全相同（每个术语前后都有引号和逗号分隔），我建议一个简单的修复方法：您可以使用ast.literal_eval。（否则，您可能需要使用某种类型的正则表达式。）请参阅下面的内容，以获得带有ast.literal_eval的修改版本：

from itertools import combinations
from collections import Counter
import ast

def collect_pairs(file_name):
    pair_counter = Counter()
    for line in open(file_name):  # these lines are each simply one long string; you need a list or tuple
        unique_tokens = sorted(set(ast.literal_eval(line)))  # eval will convert each line into a tuple before converting the tuple to a set
        combos = combinations(unique_tokens, 2)
        pair_counter += Counter(combos)
    return pair_counter  # return the actual Counter object

现在可以这样测试：

file_name = 'myfileComb.txt'
p = collect_pairs(file_name)
print p.most_common(10)  # for example

网友

2楼 · 编辑于 2024-10-01 19:31:53

除了数所有的对，你没什么可做的。在

明显的优化是尽早删除重复的单词和同义词，执行词干分析（任何减少不同标记数的方法都是好的！），并且只计算对(a,b)，其中a<b（在您的示例中，只有count statistics,narnia，或{}，但不能两者都有！）。在

如果内存不足，请执行两次传递。在第一个过程中，使用一个或多个哈希函数来获取候选筛选器。在第二个过程中，只计算通过这个过滤器的单词（MinHash/LSH样式过滤）。在

这是一个简单的并行问题，因此也很容易将其分发到多个线程或计算机上。在

相关问题更多 >

编程相关推荐

热门问题

热门文章