在文档中根据单词出现数量查找单词频率

{'This':2,'is': 3,'my': 2,'pen':3} {'That':1,'is': 3,'his': 1,'pen':3} {'This':2,'is': 3,'not': 1,'my': 2,'pen':3} for sent in documents: for word in sent.split(): if word in sent: windoc=dict(Counter(sent.split())) print(windoc)

3条回答

网友

1楼 · 编辑于 2024-09-30 01:26:10

考虑到每个文件的字数不得超过一次：

import collections

data = ["This is my pen my pen my pen","That is his pen","This is not my pen"]
deduped = (set(d.split()) for d in data)
freq =  collections.Counter(w for d in deduped for w in d)
result = [{ w: freq[w] for w in d } for d in deduped ]

您需要先对单词进行重复数据消除（请参见上面的deduped）。我制作了一个重复数据消除生成器，以避免使用中间列表集，但这将为每个文档生成一个中间词集。你知道吗

或者，您可以实现自己的计数器。一般来说，实现自己的计数器不是一个好主意，但如果内存消耗非常重要，并且您希望避免在deduped生成器上迭代时创建的中间集，则可能需要实现。你知道吗

不管怎样，时间和内存复杂度都是线性的。你知道吗

输出：

[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
 {'That': 1, 'his': 1, 'is': 3, 'pen': 3},
 {'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]

网友

2楼 · 编辑于 2024-09-30 01:26:10

您可以根据所有可用的句子构造一个字典来保存words frequency。然后构造所需的输出。下面是一个工作示例：

给定输入文件：

In [1]: documents 
Out[1]: ['This is my pen', 'That is his pen', 'This is not my pen']

构建词频词典：

In [2]: d = {}
    ...: for sent in documents:
    ...:     for word in set(sent.split()):    
    ...:         d[word] = d.get(word, 0) + 1
    ...:

然后构造所需的输出：

In [3]: result = []
    ...: for sent in documents:
    ...:     result.append({word: d[word] for word in sent.split()})
    ...:     

In [4]: result 
Out[4]: 
[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
 {'That': 1, 'his': 1, 'is': 3, 'pen': 3},
 {'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]

所以，总的来说，代码是这样的：

documents = ['This is my pen', 'That is his pen', 'This is not my pen']
d = {}
# construct the words frequencies dictionary
for sent in documents:
    for word in set(sent.split()):    
        d[word] = d.get(word, 0) + 1

# format the output in the desired format
result = []
for sent in documents:
    result.append({word: d[word] for word in sent.split()})

网友

3楼 · 编辑于 2024-09-30 01:26:10

from collections import Counter

data = ["This is my pen is is","That is his pen pen pen pen","This is not my pen"]

d = []
for s in data:
    for word in set(s.split()):
        d.append(word)

wordCount = Counter(d)
for item in data:
    result = {}
    for word in item.split():
        result[word] = wordCount[word]
    print (result)

输出：

{'This': 2, 'is': 3, 'my': 2, 'pen': 3}
{'That': 1, 'is': 3, 'his': 1, 'pen': 3}
{'This': 2, 'is': 3, 'not': 1, 'my': 2, 'pen': 3}

相关问题更多 >

编程相关推荐

热门问题

热门文章