在从头开始创建语料库的同时去掉.DS_存储文件

import nltk import random nltk.download('cmudict') nltk.download('wordnet') nltk.download('stopwords') nltk.download('averaged_perceptron_tagger') nltk.download('punkt') from nltk.corpus import cmudict from nltk.stem.wordnet import WordNetLemmatizer from nltk.corpus import stopwords import string from nltk import word_tokenize import os from nltk.corpus.reader.plaintext import PlaintextCorpusReader corpusdir = '/Users/username/nltk_data/corpusfilename' corp = PlaintextCorpusReader(corpusdir, '.*') corp.fileids() # gives me 6 fileids, 5 existing and one .DS_Store corp.sents() # error: 'utf-8' codec can't decode byte 0xd5 in position 161: invalid continuation byte

2条回答

网友

1楼 · 编辑于 2024-09-27 21:24:55

&13；第13部分,；

find . -name ".DS_Store" -delete

；

和#13；

上面的脚本将完成从目录中删除文件的操作

网友

2楼 · 编辑于 2024-09-27 21:24:55

从Wikipedia：

In the Apple macOS operating system, .DS_Store is a file that stores custom attributes of its containing folder, such as the position of icons or the choice of a background image.

所以在任何地方都可能有一个.DS_Store

在这一行：corp = PlaintextCorpusReader(corpusdir, '.*')您可以选择哪些文件将在语料库中

第二个参数'.*'是一个正则表达式，用于选择要使用的文件。根据the doc，此参数可以是“指定此语料库中文件ID的列表或regexp”

因此，在您的例子中，您可以将匹配所有内容的'.*'更改为'.*\.txt'，以匹配任何字符以及“.”和“txt”。或者，如果您知道需要的每个文件的名称，您可以使用文件名列表['file1.txt', 'file2.txt']

相关问题更多 >

编程相关推荐

热门问题

热门文章