如何将NLTK块输出到文件?

2024-10-03 21:34:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个python脚本,在这里我使用nltk库来解析、标记和分块,比如说来自web的随机文本。在

我需要格式化并在一个文件中写入chunked1chunked2chunked3的输出。它们有class 'nltk.tree.Tree'

更具体地说,我只需要编写与正则表达式chunkGram1chunkGram2chunkGram3匹配的行。在

我怎么能做到呢?在

#! /usr/bin/python2.7

import nltk
import re
import codecs

xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."]


def processLanguage():
    for item in xstring:
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        #print tokenized
        #print tagged

        chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
        chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}"""
        chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}"""

        chunkParser1 = nltk.RegexpParser(chunkGram1)
        chunked1 = chunkParser1.parse(tagged)

        chunkParser2 = nltk.RegexpParser(chunkGram2)
        chunked2 = chunkParser2.parse(tagged)

        chunkParser3 = nltk.RegexpParser(chunkGram3)
        chunked3 = chunkParser2.parse(tagged)

        #print chunked1
        #print chunked2
        #print chunked3

        # with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile:

            # for i,line in enumerate(chunked1):
                # if "JJ" in line:
                    # outfile.write(line)
                # elif "NNP" in line:
                    # outfile.write(line)



processLanguage()

目前,当我试图运行它时,我得到了一个错误:

^{pr2}$

编辑:在@Alvas answer之后,我设法做到了我想要的。但是现在,我想知道如何从文本语料库中去除所有非ascii字符。示例:

#store cleaned file into variable
with open('path\to\file.txt', 'r') as infile:
    xstring = infile.readlines()
infile.close

    def remove_non_ascii(line):
        return ''.join([i if ord(i) < 128 else ' ' for i in line])

    for i, line in enumerate(xstring):
        line = remove_non_ascii(line)

#tokenize and tag text
def processLanguage():
    for item in xstring:
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        print tokenized
        print tagged
processLanguage()

上面的这句话是从S/O中的另一个答案中提取出来的,但是它似乎不起作用。可能是什么问题?我得到的错误是:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
not in range(128)

Tags: orandinforaswithlinelibrary
2条回答

您的代码有几个问题,尽管主要原因是您的for循环没有修改xstring的内容:

我将在这里解决您代码中的所有问题:

您不能用单个\编写这样的路径,因为\t将被解释为制表符,\f将被解释为换行符。你必须加倍。我知道这是一个例子,但这样的混淆经常出现:

with open('path\\to\\file.txt', 'r') as infile:
    xstring = infile.readlines()

下面的infile.close行是错误的。它不调用close方法,实际上不执行任何操作。此外,您的文件已经被with子句关闭了,如果您在任何地方的任何答案中看到这一行,请直接否决该答案,并评论说file.close是错误的,应该是file.close()。在

下面的方法应该可以工作,但是需要注意的是,它将每个非ascii字符替换为' ',这将破坏诸如naiveve和café之类的单词

^{pr2}$

但以下是您的代码因unicode异常而失败的原因:您根本没有修改xstring的元素,也就是说,您正在计算删除了ascii字符的行,但这是一个新值,从未存储到列表中:

for i, line in enumerate(xstring):
   line = remove_non_ascii(line)

相反,它应该是:

for i, line in enumerate(xstring):
    xstring[i] = remove_non_ascii(line)

或者我最喜欢的Python:

xstring = [ remove_non_ascii(line) for line in xstring ]

虽然这些Unicode错误的发生主要是因为您使用的是Python2.7来处理纯Unicode文本,而最近的Python3版本在这方面遥遥领先,因此我建议您,如果您刚开始执行任务,请尽快升级到Python3.4+。在

首先,看这个视频:https://www.youtube.com/watch?v=0Ef9GudbxXY

enter image description here

现在我们来看看正确的答案:

import re
import io

from nltk import pos_tag, word_tokenize, sent_tokenize, RegexpParser


xstring = u"An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."


chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkParser1 = RegexpParser(chunkGram1)

chunked = [chunkParser1.parse(pos_tag(word_tokenize(sent))) 
            for sent in sent_tokenize(xstring)]

with io.open('outfile', 'w', encoding='utf8') as fout:
    for chunk in chunked:
        fout.write(str(chunk)+'\n\n')

[出来]:

^{pr2}$

如果你必须坚持使用python2.7:

with io.open('outfile', 'w', encoding='utf8') as fout:
    for chunk in chunked:
        fout.write(unicode(chunk)+'\n\n')

[出来]:

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC
alvas@ubi:~$ python3 test2.py
Traceback (most recent call last):
  File "test2.py", line 18, in <module>
    fout.write(unicode(chunk)+'\n\n')
NameError: name 'unicode' is not defined

如果您必须坚持使用py2.7,强烈建议您:

from six import text_type
with io.open('outfile', 'w', encoding='utf8') as fout:
    for chunk in chunked:
        fout.write(text_type(chunk)+'\n\n')

[出来]:

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile 
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile 
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC

相关问题 更多 >