<p>首先,我假设您是从旧的<code>NLTK</code>的第05章:<a href="https://nltk.googlecode.com/svn/trunk/doc/book/ch05.py" rel="noreferrer">https://nltk.googlecode.com/svn/trunk/doc/book/ch05.py</a>得到的代码,特别是您将看到这一节:<a href="http://pastebin.com/EC8fFqLU" rel="noreferrer">http://pastebin.com/EC8fFqLU</a></p>
<p>现在,让我们看看<code>NLTK</code>中的混淆矩阵,试试:</p>
<pre><code>from nltk.metrics import ConfusionMatrix
ref = 'DET NN VB DET JJ NN NN IN DET NN'.split()
tagged = 'DET VB VB DET NN NN NN IN DET NN'.split()
cm = ConfusionMatrix(ref, tagged)
print cm
</code></pre>
<p>[出来]:</p>
^{pr2}$
<p>嵌入在<code><></code>中的数字是真正数(tp)。从上面的示例中,您可以看到引用中的一个<code>JJ</code>被错误地标记为来自标记输出的<code>NN</code>。例如,<code>NN</code>的一个假阳性,<code>JJ</code>的一个假阴性。在</p>
<p>要访问混淆矩阵(用于计算精度/召回率/fscore),您可以通过以下方式访问假阴性、假阳性和真阳性:</p>
<pre><code>labels = set('DET NN VB IN JJ'.split())
true_positives = Counter()
false_negatives = Counter()
false_positives = Counter()
for i in labels:
for j in labels:
if i == j:
true_positives[i] += cm[i,j]
else:
false_negatives[i] += cm[i,j]
false_positives[j] += cm[i,j]
print "TP:", sum(true_positives.values()), true_positives
print "FN:", sum(false_negatives.values()), false_negatives
print "FP:", sum(false_positives.values()), false_positives
</code></pre>
<p>[出来]:</p>
<pre><code>TP: 8 Counter({'DET': 3, 'NN': 3, 'VB': 1, 'IN': 1, 'JJ': 0})
FN: 2 Counter({'NN': 1, 'JJ': 1, 'VB': 0, 'DET': 0, 'IN': 0})
FP: 2 Counter({'VB': 1, 'NN': 1, 'DET': 0, 'JJ': 0, 'IN': 0})
</code></pre>
<p>要计算每个标签的Fscore:</p>
<pre><code>for i in sorted(labels):
if true_positives[i] == 0:
fscore = 0
else:
precision = true_positives[i] / float(true_positives[i]+false_positives[i])
recall = true_positives[i] / float(true_positives[i]+false_negatives[i])
fscore = 2 * (precision * recall) / float(precision + recall)
print i, fscore
</code></pre>
<p>[出来]:</p>
<pre><code>DET 1.0
IN 1.0
JJ 0
NN 0.75
VB 0.666666666667
</code></pre>
<p>我希望上面的内容可以消除<code>NLTK</code>中混淆矩阵的用法,下面是上面示例的完整代码:</p>
<pre><code>from collections import Counter
from nltk.metrics import ConfusionMatrix
ref = 'DET NN VB DET JJ NN NN IN DET NN'.split()
tagged = 'DET VB VB DET NN NN NN IN DET NN'.split()
cm = ConfusionMatrix(ref, tagged)
print cm
labels = set('DET NN VB IN JJ'.split())
true_positives = Counter()
false_negatives = Counter()
false_positives = Counter()
for i in labels:
for j in labels:
if i == j:
true_positives[i] += cm[i,j]
else:
false_negatives[i] += cm[i,j]
false_positives[j] += cm[i,j]
print "TP:", sum(true_positives.values()), true_positives
print "FN:", sum(false_negatives.values()), false_negatives
print "FP:", sum(false_positives.values()), false_positives
print
for i in sorted(labels):
if true_positives[i] == 0:
fscore = 0
else:
precision = true_positives[i] / float(true_positives[i]+false_positives[i])
recall = true_positives[i] / float(true_positives[i]+false_negatives[i])
fscore = 2 * (precision * recall) / float(precision + recall)
print i, fscore
</code></pre>