如何在NLTK中使用混淆矩阵模块？

2条回答

网友

1楼 · 编辑于 2024-09-27 17:53:46

这是一个真正的文本分类器，与sklearn和NLTK一起工作

from collections import defaultdict
refsets = defaultdict(set)
testsets = defaultdict(set)
labels = []
tests = []
for i, (feats, label) in enumerate(testset):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)
    labels.append(label)
    tests.append(observed)

print(metrics.confusion_matrix(labels, tests))
print(nltk.ConfusionMatrix(labels, tests))

网友

2楼 · 编辑于 2024-09-27 17:53:46

首先，我假设您是从旧的NLTK的第05章：https://nltk.googlecode.com/svn/trunk/doc/book/ch05.py得到的代码，特别是您将看到这一节：http://pastebin.com/EC8fFqLU

现在，让我们看看NLTK中的混淆矩阵，试试：

from nltk.metrics import ConfusionMatrix
ref  = 'DET NN VB DET JJ NN NN IN DET NN'.split()
tagged = 'DET VB VB DET NN NN NN IN DET NN'.split()
cm = ConfusionMatrix(ref, tagged)
print cm

[出来]：

^{pr2}$

嵌入在<>中的数字是真正数（tp）。从上面的示例中，您可以看到引用中的一个JJ被错误地标记为来自标记输出的NN。例如，NN的一个假阳性，JJ的一个假阴性。在

要访问混淆矩阵（用于计算精度/召回率/fscore），您可以通过以下方式访问假阴性、假阳性和真阳性：

labels = set('DET NN VB IN JJ'.split())

true_positives = Counter()
false_negatives = Counter()
false_positives = Counter()

for i in labels:
    for j in labels:
        if i == j:
            true_positives[i] += cm[i,j]
        else:
            false_negatives[i] += cm[i,j]
            false_positives[j] += cm[i,j]

print "TP:", sum(true_positives.values()), true_positives
print "FN:", sum(false_negatives.values()), false_negatives
print "FP:", sum(false_positives.values()), false_positives

[出来]：

TP: 8 Counter({'DET': 3, 'NN': 3, 'VB': 1, 'IN': 1, 'JJ': 0})
FN: 2 Counter({'NN': 1, 'JJ': 1, 'VB': 0, 'DET': 0, 'IN': 0})
FP: 2 Counter({'VB': 1, 'NN': 1, 'DET': 0, 'JJ': 0, 'IN': 0})

要计算每个标签的Fscore：

for i in sorted(labels):
    if true_positives[i] == 0:
        fscore = 0
    else:
        precision = true_positives[i] / float(true_positives[i]+false_positives[i])
        recall = true_positives[i] / float(true_positives[i]+false_negatives[i])
        fscore = 2 * (precision * recall) / float(precision + recall)
    print i, fscore

[出来]：

DET 1.0
IN 1.0
JJ 0
NN 0.75
VB 0.666666666667

我希望上面的内容可以消除NLTK中混淆矩阵的用法，下面是上面示例的完整代码：

from collections import Counter
from nltk.metrics import ConfusionMatrix

ref  = 'DET NN VB DET JJ NN NN IN DET NN'.split()
tagged = 'DET VB VB DET NN NN NN IN DET NN'.split()
cm = ConfusionMatrix(ref, tagged)

print cm

labels = set('DET NN VB IN JJ'.split())

true_positives = Counter()
false_negatives = Counter()
false_positives = Counter()

for i in labels:
    for j in labels:
        if i == j:
            true_positives[i] += cm[i,j]
        else:
            false_negatives[i] += cm[i,j]
            false_positives[j] += cm[i,j]

print "TP:", sum(true_positives.values()), true_positives
print "FN:", sum(false_negatives.values()), false_negatives
print "FP:", sum(false_positives.values()), false_positives
print 

for i in sorted(labels):
    if true_positives[i] == 0:
        fscore = 0
    else:
        precision = true_positives[i] / float(true_positives[i]+false_positives[i])
        recall = true_positives[i] / float(true_positives[i]+false_negatives[i])
        fscore = 2 * (precision * recall) / float(precision + recall)
    print i, fscore

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在NLTK中使用混淆矩阵模块？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >