NLTK PlainTextCorpusReader在多个文本文件中计算句子/段落数时显示AssertionError

2024-10-03 13:16:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我用PlainTextCorpusReader分析了大约40个文本文件。文件很容易导入,但是当我尝试使用.sents()计算每个文件中的句子/段落/单词数时,它抛出了一个断言错误

当我避免计算句子和单词时,文件运行并完美地形成了一个数据帧。但我需要这些信息来做分析

下面是python代码

nm_speeches = [ ]  #Empty dataframe to store all the content of speeches
for path in paths:
    corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')
    number_of_sents = len(corpusReader.sents())
    # opening the files and converting it to a dictionary

错误

AssertionError                            Traceback (most recent call last)
<ipython-input-44-00f3eb2ee014> in <module>()
      2 for path in paths:
      3     corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')
----> 4     number_of_sents = len(corpusReader.sents())
      5     # opening the files and converting it to a dictionary
      6     with open(path, encoding="utf-8") as speech_file:

~\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in __len__(self)
    378         if len(self._offsets) <= len(self._pieces):
    379             # Iterate to the end of the corpus.
--> 380             for tok in self.iterate_from(self._offsets[-1]): pass
    381 
    382         return self._offsets[-1]

~\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in iterate_from(self, start_tok)
    400 
    401             # Get everything we can from this piece.
--> 402             for tok in piece.iterate_from(max(0, start_tok-offset)):
    403                 yield tok
    404 

~\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in iterate_from(self, start_tok)
    299                 self.read_block.__name__)
    300             num_toks = len(tokens)
--> 301             new_filepos = self._stream.tell()
    302             assert new_filepos > filepos, (
    303                 'block reader %s() should consume at least 1 byte (filepos=%d)' %

~\Anaconda3\lib\site-packages\nltk\data.py in tell(self)
   1366             check1 = self._incr_decode(self.stream.read(50))[0]
   1367             check2 = ''.join(self.linebuffer)
-> 1368             assert check1.startswith(check2) or check2.startswith(check1)
   1369 
   1370         # Return to our original filepos (so we don't have to throw

AssertionError: 

我希望它能数清句子的数目并成功地执行


Tags: ofthetoinfromselfforlen