如何将NLTK块输出到文件？

#! /usr/bin/python2.7 import nltk import re import codecs xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."] def processLanguage(): for item in xstring: tokenized = nltk.word_tokenize(item) tagged = nltk.pos_tag(tokenized) #print tokenized #print tagged chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}""" chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}""" chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}""" chunkParser1 = nltk.RegexpParser(chunkGram1) chunked1 = chunkParser1.parse(tagged) chunkParser2 = nltk.RegexpParser(chunkGram2) chunked2 = chunkParser2.parse(tagged) chunkParser3 = nltk.RegexpParser(chunkGram3) chunked3 = chunkParser2.parse(tagged) #print chunked1 #print chunked2 #print chunked3 # with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile: # for i,line in enumerate(chunked1): # if "JJ" in line: # outfile.write(line) # elif "NNP" in line: # outfile.write(line) processLanguage()

#store cleaned file into variable with open('path\to\file.txt', 'r') as infile: xstring = infile.readlines() infile.close def remove_non_ascii(line): return ''.join([i if ord(i) < 128 else ' ' for i in line]) for i, line in enumerate(xstring): line = remove_non_ascii(line) #tokenize and tag text def processLanguage(): for item in xstring: tokenized = nltk.word_tokenize(item) tagged = nltk.pos_tag(tokenized) print tokenized print tagged processLanguage()

2条回答

网友

1楼 · 编辑于 2024-10-03 21:34:08

您的代码有几个问题，尽管主要原因是您的for循环没有修改xstring的内容：

我将在这里解决您代码中的所有问题：

您不能用单个\编写这样的路径，因为\t将被解释为制表符，\f将被解释为换行符。你必须加倍。我知道这是一个例子，但这样的混淆经常出现：

with open('path\\to\\file.txt', 'r') as infile:
    xstring = infile.readlines()

下面的infile.close行是错误的。它不调用close方法，实际上不执行任何操作。此外，您的文件已经被with子句关闭了，如果您在任何地方的任何答案中看到这一行，请直接否决该答案，并评论说file.close是错误的，应该是file.close()。在

下面的方法应该可以工作，但是需要注意的是，它将每个非ascii字符替换为' '，这将破坏诸如naiveve和café之类的单词

^{pr2}$
但以下是您的代码因unicode异常而失败的原因：您根本没有修改xstring的元素，也就是说，您正在计算删除了ascii字符的行，但这是一个新值，从未存储到列表中：
~~for i, line in enumerate(xstring): line = remove_non_ascii(line)~~
~~相反，它应该是：~~
~~for i, line in enumerate(xstring): xstring[i] = remove_non_ascii(line)~~
~~或者我最喜欢的Python：~~
~~xstring = [ remove_non_ascii(line) for line in xstring ]~~
虽然这些Unicode错误的发生主要是因为您使用的是Python2.7来处理纯Unicode文本，而最近的Python3版本在这方面遥遥领先，因此我建议您，如果您刚开始执行任务，请尽快升级到Python3.4+。在

网友
2楼 · 编辑于 2024-10-03 21:34:08

首先，看这个视频：https://www.youtube.com/watch?v=0Ef9GudbxXY
现在我们来看看正确的答案：
import re import io from nltk import pos_tag, word_tokenize, sent_tokenize, RegexpParser xstring = u"An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system." chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}""" chunkParser1 = RegexpParser(chunkGram1) chunked = [chunkParser1.parse(pos_tag(word_tokenize(sent))) for sent in sent_tokenize(xstring)] with io.open('outfile', 'w', encoding='utf8') as fout: for chunk in chunked: fout.write(str(chunk)+'\n\n')
[出来]：
^{pr2}$
如果你必须坚持使用python2.7：
with io.open('outfile', 'w', encoding='utf8') as fout: for chunk in chunked: fout.write(unicode(chunk)+'\n\n')
[出来]：
alvas@ubi:~$ python test2.py alvas@ubi:~$ head outfile (S An/DT (Chunk electronic/JJ library/NN) (/: also/RB referred/VBD to/TO as/IN (Chunk digital/JJ library/NN) or/CC alvas@ubi:~$ python3 test2.py Traceback (most recent call last): File "test2.py", line 18, in <module> fout.write(unicode(chunk)+'\n\n') NameError: name 'unicode' is not defined
如果您必须坚持使用py2.7，强烈建议您：
from six import text_type with io.open('outfile', 'w', encoding='utf8') as fout: for chunk in chunked: fout.write(text_type(chunk)+'\n\n')
[出来]：
alvas@ubi:~$ python test2.py alvas@ubi:~$ head outfile (S An/DT (Chunk electronic/JJ library/NN) (/: also/RB referred/VBD to/TO as/IN (Chunk digital/JJ library/NN) or/CC alvas@ubi:~$ python3 test2.py alvas@ubi:~$ head outfile (S An/DT (Chunk electronic/JJ library/NN) (/: also/RB referred/VBD to/TO as/IN (Chunk digital/JJ library/NN) or/CC

相关问题更多 >

编程相关推荐

热门问题

热门文章