NLTK自定义分类语料库不读取文件

import nltk from nltk.corpus import CategorizedPlaintextCorpusReader mr = CategorizedPlaintextCorpusReader('C:\mycorpus', r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*') len(mr.categories())

2条回答

网友

1楼 · 编辑于 2024-09-27 07:27:19

我觉得你的头发有点奇怪

cat_pattern=r'(neg|pos)/.*'

因为您使用的是基于MsDOS的系统（我猜是Windows），文件夹包含用\，而不是/（或者我不明白）

网友

2楼 · 编辑于 2024-09-27 07:27:19

我使用的是Linux，以下对代码的修改（使用玩具语料库文件）对我来说是正确的：

import nltk
from nltk.corpus import CategorizedPlaintextCorpusReader

import os


mr = CategorizedPlaintextCorpusReader(
    '/home/ely/programming/nltk-test/mycorpus',
    r'(?!\.).*\.txt',
    cat_pattern=os.path.join(r'(neg|pos)', '.*')
)

print(len(mr.categories()))

这表明在Windows系统上使用cat_pattern作为文件系统分隔符的/字符串有问题。你知道吗

在我的示例中使用os.path.join，或者如果使用python3使用pathlib，这将是一个很好的解决方法，因此它不受操作系统的影响，并且不会出现与文件系统分隔符混合的正则表达式转义斜杠。你知道吗

事实上，对于参数字符串中所有文件系统分隔符的情况，您都可以使用这种方法，而且这通常是一个很好的习惯，可以使代码具有可移植性，避免使用奇怪的字符串。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章