<p>简而言之,</strong>:</p>
<p>请执行以下操作,而不是<code>>>> entities</code>:</p>
<pre><code>>>> print entities.__repr__()
</code></pre>
<p>或者:</p>
^{pr2}$
<hr/>
<p><strong>长</strong>:</p>
<p>问题在于您试图打印<code>ne_chunk</code>的输出,这将触发ghostscript以获取带有NE标记的句子的字符串和绘图表示,该语句是一个<code>nltk.tree.Tree</code>对象。这将需要ghostscript,以便您可以使用小部件来可视化它。在</p>
<p>让我们一步一步来。在</p>
<p>首先,当您使用<code>ne_chunk</code>时,可以直接在顶层导入它:</p>
<pre><code>from nltk import ne_chunk
</code></pre>
<p>建议在导入时使用名称空间,即:</p>
<pre><code>from nltk import word_tokenize, pos_tag, ne_chunk
</code></pre>
<p>当您使用<code>ne_chunk</code>时,它来自<a href="https://github.com/nltk/nltk/blob/develop/nltk/chunk/__init__.py" rel="nofollow noreferrer">https://github.com/nltk/nltk/blob/develop/nltk/chunk/<strong>init</strong>.py</a></p>
<p>目前还不清楚pickle加载的是什么类型的函数,但是经过一番检查,我们发现只有一个内置的NE chunker不是基于规则的,而且由于pickle二进制状态maxent的名称,我们可以假设它是一个统计chunker,因此它很可能来自于<code>NEChunkParser</code>对象:<a href="https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py" rel="nofollow noreferrer">https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py</a>。还有ACE数据API函数,比如pickle二进制文件的名称。在</p>
<p>现在,每当您可以使用<code>ne_chunk</code>函数时,它实际上是在调用
<code>NEChunkParser.parse()</code>返回<code>nltk.tree.Tree</code>对象的函数:<a href="https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py#L118" rel="nofollow noreferrer">https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py#L118</a></p>
<pre><code>class NEChunkParser(ChunkParserI):
"""
Expected input: list of pos-tagged words
"""
def __init__(self, train):
self._train(train)
def parse(self, tokens):
"""
Each token should be a pos-tagged word
"""
tagged = self._tagger.tag(tokens)
tree = self._tagged_to_parse(tagged)
return tree
def _train(self, corpus):
# Convert to tagged sequence
corpus = [self._parse_to_tagged(s) for s in corpus]
self._tagger = NEChunkParserTagger(train=corpus)
def _tagged_to_parse(self, tagged_tokens):
"""
Convert a list of tagged tokens to a chunk-parse tree.
"""
sent = Tree('S', [])
for (tok,tag) in tagged_tokens:
if tag == 'O':
sent.append(tok)
elif tag.startswith('B-'):
sent.append(Tree(tag[2:], [tok]))
elif tag.startswith('I-'):
if (sent and isinstance(sent[-1], Tree) and
sent[-1].label() == tag[2:]):
sent[-1].append(tok)
else:
sent.append(Tree(tag[2:], [tok]))
return sent
</code></pre>
<p>如果我们看一下<a href="https://github.com/nltk/nltk/blob/develop/nltk/tree.py" rel="nofollow noreferrer">^{<cd3>}</a>对象,当它试图调用<code>_repr_png_</code>函数时,会出现ghostscript问题:<a href="https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L702" rel="nofollow noreferrer">https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L702</a>:</p>
<pre><code>def _repr_png_(self):
"""
Draws and outputs in PNG for ipython.
PNG is used instead of PDF, since it can be displayed in the qt console and
has wider browser support.
"""
import os
import base64
import subprocess
import tempfile
from nltk.draw.tree import tree_to_treesegment
from nltk.draw.util import CanvasFrame
from nltk.internals import find_binary
_canvas_frame = CanvasFrame()
widget = tree_to_treesegment(_canvas_frame.canvas(), self)
_canvas_frame.add_widget(widget)
x, y, w, h = widget.bbox()
# print_to_file uses scrollregion to set the width and height of the pdf.
_canvas_frame.canvas()['scrollregion'] = (0, 0, w, h)
with tempfile.NamedTemporaryFile() as file:
in_path = '{0:}.ps'.format(file.name)
out_path = '{0:}.png'.format(file.name)
_canvas_frame.print_to_file(in_path)
_canvas_frame.destroy_widget(widget)
subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +
'-q -dEPSCrop -sDEVICE=png16m -r90 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dSAFER -dBATCH -dNOPAUSE -sOutputFile={0:} {1:}'
.format(out_path, in_path).split())
with open(out_path, 'rb') as sr:
res = sr.read()
os.remove(in_path)
os.remove(out_path)
return base64.b64encode(res).decode()
</code></pre>
<p>但是请注意,奇怪的是,当您在解释器中使用<code>>>> entities</code>时,python解释器会触发<code>_repr_png</code>,而不是{<cd13>}(请参见<a href="https://stackoverflow.com/questions/1984162/purpose-of-pythons-repr">Purpose of Python's __repr__</a>)。当试图打印出一个对象的表示时,本机CPython解释器不可能是如何工作的,所以我们看一下<code>Ipython.core.formatters</code>,我们看到它允许{<cd12>}在{a7}上被激发:</p>
<pre><code>class PNGFormatter(BaseFormatter):
"""A PNG formatter.
To define the callables that compute the PNG representation of your
objects, define a :meth:`_repr_png_` method or use the :meth:`for_type`
or :meth:`for_type_by_name` methods to register functions that handle
this.
The return value of this formatter should be raw PNG data, *not*
base64 encoded.
"""
format_type = Unicode('image/png')
print_method = ObjectName('_repr_png_')
_return_type = (bytes, unicode_type)
</code></pre>
<p>我们可以看到,当IPython初始化一个<code>DisplayFormatter</code>对象时,它试图激活所有格式化程序:<a href="https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L66" rel="nofollow noreferrer">https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L66</a></p>
<pre><code>def _formatters_default(self):
"""Activate the default formatters."""
formatter_classes = [
PlainTextFormatter,
HTMLFormatter,
MarkdownFormatter,
SVGFormatter,
PNGFormatter,
PDFFormatter,
JPEGFormatter,
LatexFormatter,
JSONFormatter,
JavascriptFormatter
]
d = {}
for cls in formatter_classes:
f = cls(parent=self)
d[f.format_type] = f
return d
</code></pre>
<p>请注意,在<code>Ipython</code>之外,在本机CPython解释器中,它只调用<code>__repr__</code>,而不是{<cd12>}:</p>
<pre><code>>>> from nltk import ne_chunk
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> Sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sentence)))
>>> entities
Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])
</code></pre>
<hr/>
<p>所以现在的解决方案是:</p>
<p><strong>解决方案1</strong>:</p>
<p>当打印出<code>ne_chunk</code>的字符串输出时,可以使用</p>
<pre><code>>>> print entities.__repr__()
</code></pre>
<p>IPython应该只显式地调用<code>__repr__</code>,而不是那样<code>>>> entities</code>,而不是调用所有可能的格式化程序。在</p>
<p><strong>解决方案2</strong></p>
<p>如果您真的需要使用<code>_repr_png_</code>来可视化树对象,那么我们需要找出如何将ghostscript二进制文件添加到NLTK环境变量中。在</p>
<p>在您的例子中,似乎默认的<code>nltk.internals</code>无法找到二进制文件。更具体地说,我们指的是<a href="https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L599" rel="nofollow noreferrer">https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L599</a></p>
<p>如果我们回到<a href="https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L726" rel="nofollow noreferrer">https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L726</a>,我们会看到,它试图寻找</p>
<pre><code>env_vars=['PATH']
</code></pre>
<p>当NLTK试图初始化它的环境变量时,它正在查看<code>os.environ</code>,请参见{a11}</p>
<p>注意,<code>find_binary</code>调用<code>find_binary_iter</code>,后者调用<code>find_binary_iter</code>,后者试图通过获取<code>os.environ</code>来寻找{<cd30>}</p>
<p>因此,如果我们在路径中添加:</p>
<pre><code>>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
</code></pre>
<p>现在这应该在Ipython中起作用:</p>
^{pr2}$