<p>@annabanazzi在这里提供了一个代码片段,<a href="https://gist.github.com/sloria/6407257" rel="nofollow noreferrer">https://gist.github.com/sloria/6407257</a></p>
<pre><code>import os, glob
folder = "/path/to/folder/"
os.chdir(folder)
files = glob.glob("*.txt") # Makes a list of all files in folder
bloblist = []
for file1 in files:
with open (file1, 'r') as f:
data = f.read() # Reads document content into a string
document = tb(data.decode("utf-8")) # Makes TextBlob object
bloblist.append(document)
</code></pre>
<p>我修改了它以供我使用(Python3):</p>
^{pr2}$
<hr/>
<p><strong>更新1:</strong></p>
<p>我个人在使用pythonglob模块时遇到了困难,因为我经常(I)使用没有扩展名的文件名(例如01),以及(ii)希望在嵌套目录上递归。在</p>
<p>乍一看,“glob”方法似乎是一个简单的解决方案。但是,当我试图遍历glob返回的文件时,我经常会遇到错误(例如)</p>
<pre><code>IsADirectoryError: [Errno 21] Is a directory: ...
</code></pre>
<p>当循环遇到glob返回的目录名(不是文件名)时。在</p>
<p>在我看来,只要稍加努力,以下方法就更为有效:</p>
<pre><code>import os
bloblist = []
def make_corpus(input_dir):
for root, subdirs, files in os.walk(input_dir):
for filename in files:
f = os.path.join(root, filename)
print('file:', f)
with open(os.path.join(root, filename)) as f:
for line in f:
# print(line, end='')
bloblist.append(line)
# print('bloblist:\n', bloblist)
print('len(bloblist):', len(bloblist), '\n')
make_corpus('input') ## 'input' = input dir
</code></pre>
<hr/>
<p><strong>更新2:</strong></p>
<p>最后一种方法(Linux shell<code>find</code>命令,适合在python3中使用):</p>
<pre><code>import sh ## pip install sh
def make_corpus(input_dir):
'''find (here) matches filenames, excludes directory names'''
corpus = []
file_list = []
#FILES = sh.find(input_dir, '-type', 'f', '-iname', '*.txt') ## find all .txt files
FILES = sh.find(input_dir, '-type', 'f', '-iname', '*') ## find any file
print('FILES:', FILES) ## caveat: files in FILES are '\n'-terminated ...
for filename in FILES:
#print(filename, end='')
# file_list.append(filename) ## when printed, each filename ends with '\n'
filename = filename.rstrip('\n') ## ... this addresses that issue
file_list.append(filename)
with open(filename) as f:
#print('file:', filename)
#
# for general use:
#for line in f:
#print(line)
#corpus.append(line)
#
# for this particular example (Question, above):
data = f.read()
document = tb(data)
corpus.append(document)
print('file_list:', file_list)
print('corpus length (lines):', len(corpus))
with open('output/corpus', 'w') as f: ## write to file
for line in corpus:
f.write(line)
</code></pre>