在文档中根据单词出现数量查找单词频率

2024-09-30 01:26:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要一个特定单词出现的文档的数量

示例:

data = ["This is my pen","That is his pen","This is not my pen"]

期望输出:

{'This':2,'is': 3,'my': 2,'pen':3}
{'That':1,'is': 3,'his': 1,'pen':3}
{'This':2,'is': 3,'not': 1,'my': 2,'pen':3}

for sent in documents:
    for word in sent.split():

    if word in sent:

        windoc=dict(Counter(sent.split()))
        print(windoc)

Tags: in文档forthatismynotthis
3条回答

考虑到每个文件的字数不得超过一次:

import collections

data = ["This is my pen my pen my pen","That is his pen","This is not my pen"]
deduped = (set(d.split()) for d in data)
freq =  collections.Counter(w for d in deduped for w in d)
result = [{ w: freq[w] for w in d } for d in deduped ]

您需要先对单词进行重复数据消除(请参见上面的deduped)。我制作了一个重复数据消除生成器,以避免使用中间列表集,但这将为每个文档生成一个中间词集。你知道吗

或者,您可以实现自己的计数器。一般来说,实现自己的计数器不是一个好主意,但如果内存消耗非常重要,并且您希望避免在deduped生成器上迭代时创建的中间集,则可能需要实现。你知道吗

不管怎样,时间和内存复杂度都是线性的。你知道吗

输出:

[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
 {'That': 1, 'his': 1, 'is': 3, 'pen': 3},
 {'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]

您可以根据所有可用的句子构造一个字典来保存words frequency。然后构造所需的输出。下面是一个工作示例:

给定输入文件:

In [1]: documents 
Out[1]: ['This is my pen', 'That is his pen', 'This is not my pen']

构建词频词典:

In [2]: d = {}
    ...: for sent in documents:
    ...:     for word in set(sent.split()):    
    ...:         d[word] = d.get(word, 0) + 1
    ...: 

然后构造所需的输出:

In [3]: result = []
    ...: for sent in documents:
    ...:     result.append({word: d[word] for word in sent.split()})
    ...:     

In [4]: result 
Out[4]: 
[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
 {'That': 1, 'his': 1, 'is': 3, 'pen': 3},
 {'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]

所以,总的来说,代码是这样的:

documents = ['This is my pen', 'That is his pen', 'This is not my pen']
d = {}
# construct the words frequencies dictionary
for sent in documents:
    for word in set(sent.split()):    
        d[word] = d.get(word, 0) + 1

# format the output in the desired format
result = []
for sent in documents:
    result.append({word: d[word] for word in sent.split()})
from collections import Counter

data = ["This is my pen is is","That is his pen pen pen pen","This is not my pen"]

d = []
for s in data:
    for word in set(s.split()):
        d.append(word)

wordCount = Counter(d)
for item in data:
    result = {}
    for word in item.split():
        result[word] = wordCount[word]
    print (result)

输出:

{'This': 2, 'is': 3, 'my': 2, 'pen': 3}
{'That': 1, 'is': 3, 'his': 1, 'pen': 3}
{'This': 2, 'is': 3, 'not': 1, 'my': 2, 'pen': 3}

相关问题 更多 >

    热门问题