创建列表时项目数错误

from nltk.corpus import PlaintextCorpusReader corpus = PlaintextCorpusReader('C:\CorpusData\Polit_Speeches_by_Gender_POS', '.*\.txt') documents = [(list(ngrams(corpus.words(fileid), 2)), gender) for gender in [f[47] for f in corpus.fileids()] for fileid in corpus.fileids()]

1条回答

网友

1楼 · 发布于 2024-06-26 00:16:05

如果我正确地阅读了您的程序，那么您正在尝试将每个文档的列表存储在元组中，以及文档的“性别”，即fileid的索引47处的元素

用于构造documents的列表理解首先迭代内部列表理解，然后迭代corpus.fileids()。当Python列表理解迭代两个iterable时，对于第一个iterable的每个值，它将迭代整个第二个iterable。我们可以通过一个例子看到这一点：

>>> print([(a, b) for a in [1, 2] for b in [1, 2]])
[(1, 1), (1, 2), (2, 1), (2, 2)]

相反，在这种情况下，我们似乎可以通过将f[47]应用于我们从corpus.fileids()提取的文件ID来避免双重迭代。这样，每个fileid只考虑一次

documents = [(list(ngrams(corpus.words(fileid), 2)), fileid[47]) for fileid in corpus.fileids()]

因此，整个程序就变得非常简单

from nltk.corpus import PlaintextCorpusReader
corpus = PlaintextCorpusReader('C:\CorpusData\Polit_Speeches_by_Gender_POS', '.*\.txt')
documents = [(list(ngrams(corpus.words(fileid), 2)), fileid[47]) for fileid in corpus.fileids()]

相关问题更多 >

编程相关推荐

热门问题

热门文章