用NLTK创建一个新的语料库问题的回答

用NLTK创建一个新的语料库

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

经过几年的研究，以下是最新的教程 如何创建包含文本文件目录的NLTK语料库？ 其主要思想是利用<a href="http://nltk.org/api/nltk.corpus.html">nltk.corpus.reader</a>包。如果您有一个英文文本文件目录，最好使用<a href="http://nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.plaintext.PlaintextCorpusReader">PlaintextCorpusReader</a>。 如果您的目录如下所示： <pre><code>newcorpus/ file1.txt file2.txt ... </code></pre> 只要使用这些代码行，就可以得到一个语料库： <pre><code>import os from nltk.corpus.reader.plaintext import PlaintextCorpusReader corpusdir = 'newcorpus/' # Directory of corpus. newcorpus = PlaintextCorpusReader(corpusdir, '.*') </code></pre> 注意：使用默认的<code>nltk.tokenize.sent_tokenize()</code>和<code>nltk.tokenize.word_tokenize()</code>将文本拆分成句子和单词，这些函数是为英语构建的，它可能不适用于所有语言。 下面是创建测试文本文件的完整代码，以及如何使用NLTK创建语料库，以及如何在不同级别访问语料库： <pre><code>import os from nltk.corpus.reader.plaintext import PlaintextCorpusReader # Let's create a corpus with 2 texts in different textfile. txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus.""" txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n""" corpus = [txt1,txt2] # Make new dir for the corpus. corpusdir = 'newcorpus/' if not os.path.isdir(corpusdir): os.mkdir(corpusdir) # Output the files into the directory. filename = 0 for text in corpus: filename+=1 with open(corpusdir+str(filename)+'.txt','w') as fout: print>>fout, text # Check that our corpus do exist and the files are correct. assert os.path.isdir(corpusdir) for infile, text in zip(sorted(os.listdir(corpusdir)),corpus): assert open(corpusdir+infile,'r').read().strip() == text.strip() # Create a new corpus by specifying the parameters # (1) directory of the new corpus # (2) the fileids of the corpus # NOTE: in this case the fileids are simply the filenames. newcorpus = PlaintextCorpusReader('newcorpus/', '.*') # Access each file in the corpus. for infile in sorted(newcorpus.fileids()): print infile # The fileids of each file. with newcorpus.open(infile) as fin: # Opens the file. print fin.read().strip() # Prints the content of the file print # Access the plaintext; outputs pure string/basestring. print newcorpus.raw().strip() print # Access paragraphs in the corpus. (list of list of list of strings) # NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and # nltk.tokenize.word_tokenize. # # Each element in the outermost list is a paragraph, and # Each paragraph contains sentence(s), and # Each sentence contains token(s) print newcorpus.paras() print # To access pargraphs of a specific fileid. print newcorpus.paras(newcorpus.fileids()[0]) # Access sentences in the corpus. (list of list of strings) # NOTE: That the texts are flattened into sentences that contains tokens. print newcorpus.sents() print # To access sentences of a specific fileid. print newcorpus.sents(newcorpus.fileids()[0]) # Access just tokens/words in the corpus. (list of strings) print newcorpus.words() # To access tokens of a specific fileid. print newcorpus.words(newcorpus.fileids()[0]) </code></pre> 最后，要读取文本目录并用其他语言创建NLTK语料库，必须首先确保有一个python可调用的单词标记化模块和句子标记化模块，它们接受string/basestring输入并生成这样的输出： <pre><code>>>> from nltk.tokenize import sent_tokenize, word_tokenize >>> txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus.""" >>> sent_tokenize(txt1) ['This is a foo bar sentence.', 'And this is the first txtfile in the corpus.'] >>> word_tokenize(sent_tokenize(txt1)[0]) ['This', 'is', 'a', 'foo', 'bar', 'sentence', '.'] </code></pre>

用NLTK创建一个新的语料库

1 个回答

相关Python问题