在NLTK中找不到ghostscript?

2024-10-01 17:23:28 发布

您现在位置:Python中文网/ 问答频道 /正文

当我尝试使用chunk模块时,我在玩NLTK

enter import nltk as nk
Sentence  = "Betty Botter bought some butter, but she said the butter is  bitter, I f I put it in my batter, it will make my batter bitter."
tokens = nk.word_tokenize(Sentence)
tagged = nk.pos_tag(tokens)
entities = nk.chunk.ne_chunk(tagged) 

当我输入时,代码运行良好

^{pr2}$

我收到以下错误消息:

enter code here Out[2]: Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])Traceback (most recent call last):

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\IPython\core\formatters.py", line 343, in __call__
return method()

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\tree.py", line 726, in _repr_png_
subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 602, in find_binary
binary_names, url, verbose))

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 596, in find_binary_iter
url, verbose):

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 567, in find_file_iter
raise LookupError('\n\n%s\n%s\n%s' % (div, msg, div))

LookupError: 

===========================================================================
NLTK was unable to find the gs file!
Use software specific configuration paramaters or set the PATH environment variable.
===========================================================================

根据to this post,解决方案是安装Ghostscript,因为chunker正试图使用它来显示一个解析树,并在寻找3个二进制文件中的一个:

file_names=['gs', 'gswin32c.exe', 'gswin64c.exe']

使用。 但即使我安装了ghostscript,我现在可以在windows搜索中找到二进制文件,但我仍然得到相同的错误。在

我需要修复或更新什么?在


其他路径信息:

import os; print os.environ['PATH']

退货:

C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Program Files (x86)\Parallels\Parallels Tools\Applications;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Oracle\RPAS14.1\RpasServer\bin;C:\Oracle\RPAS14.1\RpasServer\applib;C:\Program Files (x86)\Java\jre7\bin;C:\Program Files (x86)\Java\jre7\bin\client;C:\Program Files (x86)\Java\jre7\lib;C:\Program Files (x86)\Java\jre7\jre\bin\client;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;  

Tags: inbinlibpackageslocalsitennusers
3条回答

从“https://www.ghostscript.com/download/gsdnld.html”下载gs.exe,并将其路径添加到Environment Variables

路径可以存储在

C:\Program Files\

(in my system it looks like "C:\Program Files\gs\gs9.21\bin")

对于将其添加到环境变量:

control panel->system and security->system->advanced system settings->Environment Variables->(in system variables scroll down and double click on path)->

然后添加复制的路径

(in my case "C:\Program Files\gs\gs9.21\bin")

p.S.:在覆盖路径之前,不要忘记添加分号(;),而不是删除现有路径,然后简单地将其放在那里,您可能会遇到麻烦,需要运行备份:)

在我的例子中,当我用相同的alvas代码添加path时,结果是:

'C:\\Program Files\\gs\\gs9.27\x08in'

这是不正确的,所以,我改为:path_to_gs='C:/Program Files/gs/gs9.27/bin',它就可以工作了。在

简而言之,:

请执行以下操作,而不是>>> entities

>>> print entities.__repr__()

或者:

^{pr2}$

问题在于您试图打印ne_chunk的输出,这将触发ghostscript以获取带有NE标记的句子的字符串和绘图表示,该语句是一个nltk.tree.Tree对象。这将需要ghostscript,以便您可以使用小部件来可视化它。在

让我们一步一步来。在

首先,当您使用ne_chunk时,可以直接在顶层导入它:

from nltk import ne_chunk

建议在导入时使用名称空间,即:

from nltk import word_tokenize, pos_tag, ne_chunk

当您使用ne_chunk时,它来自https://github.com/nltk/nltk/blob/develop/nltk/chunk/init.py

目前还不清楚pickle加载的是什么类型的函数,但是经过一番检查,我们发现只有一个内置的NE chunker不是基于规则的,而且由于pickle二进制状态maxent的名称,我们可以假设它是一个统计chunker,因此它很可能来自于NEChunkParser对象:https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py。还有ACE数据API函数,比如pickle二进制文件的名称。在

现在,每当您可以使用ne_chunk函数时,它实际上是在调用 NEChunkParser.parse()返回nltk.tree.Tree对象的函数:https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py#L118

class NEChunkParser(ChunkParserI):
    """
    Expected input: list of pos-tagged words
    """
    def __init__(self, train):
        self._train(train)

    def parse(self, tokens):
        """
        Each token should be a pos-tagged word
        """
        tagged = self._tagger.tag(tokens)
        tree = self._tagged_to_parse(tagged)
        return tree

    def _train(self, corpus):
        # Convert to tagged sequence
        corpus = [self._parse_to_tagged(s) for s in corpus]

        self._tagger = NEChunkParserTagger(train=corpus)

    def _tagged_to_parse(self, tagged_tokens):
        """
        Convert a list of tagged tokens to a chunk-parse tree.
        """
        sent = Tree('S', [])

        for (tok,tag) in tagged_tokens:
            if tag == 'O':
                sent.append(tok)
            elif tag.startswith('B-'):
                sent.append(Tree(tag[2:], [tok]))
            elif tag.startswith('I-'):
                if (sent and isinstance(sent[-1], Tree) and
                    sent[-1].label() == tag[2:]):
                    sent[-1].append(tok)
                else:
                    sent.append(Tree(tag[2:], [tok]))
        return sent

如果我们看一下^{}对象,当它试图调用_repr_png_函数时,会出现ghostscript问题:https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L702

def _repr_png_(self):
    """
    Draws and outputs in PNG for ipython.
    PNG is used instead of PDF, since it can be displayed in the qt console and
    has wider browser support.
    """
    import os
    import base64
    import subprocess
    import tempfile
    from nltk.draw.tree import tree_to_treesegment
    from nltk.draw.util import CanvasFrame
    from nltk.internals import find_binary
    _canvas_frame = CanvasFrame()
    widget = tree_to_treesegment(_canvas_frame.canvas(), self)
    _canvas_frame.add_widget(widget)
    x, y, w, h = widget.bbox()
    # print_to_file uses scrollregion to set the width and height of the pdf.
    _canvas_frame.canvas()['scrollregion'] = (0, 0, w, h)
    with tempfile.NamedTemporaryFile() as file:
        in_path = '{0:}.ps'.format(file.name)
        out_path = '{0:}.png'.format(file.name)
        _canvas_frame.print_to_file(in_path)
        _canvas_frame.destroy_widget(widget)
        subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +
                        '-q -dEPSCrop -sDEVICE=png16m -r90 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dSAFER -dBATCH -dNOPAUSE -sOutputFile={0:} {1:}'
                        .format(out_path, in_path).split())
        with open(out_path, 'rb') as sr:
            res = sr.read()
        os.remove(in_path)
        os.remove(out_path)
        return base64.b64encode(res).decode()

但是请注意,奇怪的是,当您在解释器中使用>>> entities时,python解释器会触发_repr_png,而不是{}(请参见Purpose of Python's __repr__)。当试图打印出一个对象的表示时,本机CPython解释器不可能是如何工作的,所以我们看一下Ipython.core.formatters,我们看到它允许{}在{a7}上被激发:

class PNGFormatter(BaseFormatter):
    """A PNG formatter.
    To define the callables that compute the PNG representation of your
    objects, define a :meth:`_repr_png_` method or use the :meth:`for_type`
    or :meth:`for_type_by_name` methods to register functions that handle
    this.
    The return value of this formatter should be raw PNG data, *not*
    base64 encoded.
    """
    format_type = Unicode('image/png')

    print_method = ObjectName('_repr_png_')

    _return_type = (bytes, unicode_type)

我们可以看到,当IPython初始化一个DisplayFormatter对象时,它试图激活所有格式化程序:https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L66

def _formatters_default(self):
    """Activate the default formatters."""
    formatter_classes = [
        PlainTextFormatter,
        HTMLFormatter,
        MarkdownFormatter,
        SVGFormatter,
        PNGFormatter,
        PDFFormatter,
        JPEGFormatter,
        LatexFormatter,
        JSONFormatter,
        JavascriptFormatter
    ]
    d = {}
    for cls in formatter_classes:
        f = cls(parent=self)
        d[f.format_type] = f
    return d

请注意,在Ipython之外,在本机CPython解释器中,它只调用__repr__,而不是{}:

>>> from nltk import ne_chunk
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> Sentence  = "Betty Botter bought some butter, but she said the butter is  bitter, I f I put it in my batter, it will make my batter bitter."
>>> sentence  = "Betty Botter bought some butter, but she said the butter is  bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sentence)))
>>> entities
Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])

所以现在的解决方案是:

解决方案1

当打印出ne_chunk的字符串输出时,可以使用

>>> print entities.__repr__()

IPython应该只显式地调用__repr__,而不是那样>>> entities,而不是调用所有可能的格式化程序。在

解决方案2

如果您真的需要使用_repr_png_来可视化树对象,那么我们需要找出如何将ghostscript二进制文件添加到NLTK环境变量中。在

在您的例子中,似乎默认的nltk.internals无法找到二进制文件。更具体地说,我们指的是https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L599

如果我们回到https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L726,我们会看到,它试图寻找

env_vars=['PATH']

当NLTK试图初始化它的环境变量时,它正在查看os.environ,请参见{a11}

注意,find_binary调用find_binary_iter,后者调用find_binary_iter,后者试图通过获取os.environ来寻找{}

因此,如果我们在路径中添加:

>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs

现在这应该在Ipython中起作用:

^{pr2}$

相关问题 更多 >

    热门问题