从更大的语料库中创建dict

2024-10-02 00:29:35 发布

男 | 程序猿一只，喜欢编程写python代码。

我有30000条信息

corpus = [
    "hello world", 
    "i like mars", 
    "a planet called venus", 
    ... , 
    "it's all pcj500"]

我已经标记了它们，并形成了一个包含所有唯一单词的word_set

word_lists = [text.split(" ") for text in corpus]
>>> [['hello', 'world'],
    ['i', 'like', 'mars'],
    ['a', 'planet', 'called', 'venus'],
    ...,
    ["it's", 'all', 'pcj500']]

word_set = set().union(*word_lists)
>>> ['hello', 'world', 'i', 'like', ..., 'pcj500']

我正在尝试创建一个字典列表，其中word in the word_set作为键，初始值作为0作为计数
如果word in word_set出现在word_list in word_lists中，则适当计数为值

对于步骤1，我是这样做的

tmp = corpus[:10]
word_dicts = []
for i in range(len(tmp)):
    word_dicts.append(dict.fromkeys(list(word_set)[:30], 0))

word_dicts
>>> [{'hello': 0,
  'world': 0,
  'mars': 0,
  'venus': 0,
  'explore': 0,
  'space': 0}]

问题：

如何对语料库中的所有文本对word_set中的所有项目执行dict.fromkeys操作？就整个语料库而言，我的内存不足。应该有更好的办法，但我自己找不到

Tags： in hello world it corpus lists like word

1条回答

网友
1楼 · 发布于 2024-10-02 00:29:35

您可以使用defaultdict或Counterfrom collections使用惰性键。例如：
from collections import Counter word_dicts = [] for words_list in word_lists: word_dicts.append(Counter(words_list))

从更大的语料库中创建dict

相关问题更多 >

编程相关推荐

热门问题

热门文章

从更大的语料库中创建dict

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >