什么特征有助于对句尾进行分类？序列分类

背景：

我使用pycrfsuite来执行序列分类并找到第一句话的结尾，如下所示：

从brown语料库中，我把每两句话连接起来，得到它们的词性标签。然后，我用'S'标记句子中的每个标记，如果空格跟在它后面，如果句点跟在它后面，则用'P'。然后我删除句子之间的句点，并降低以下标记。我得到这样的结果：

输入：

data = ['I love Harry Potter.', 'It is my favorite book.']

输出：

sent = [('I', 'PRP'), ('love', 'VBP'), ('Harry', 'NNP'), ('Potter', 'NNP'), ('it', 'PRP'), ('is', 'VBZ'), ('my', 'PRP$'), ('favorite', 'JJ'), ('book', 'NN')] labels = ['S', 'S', 'S', 'P', 'S', 'S', 'S', 'S', 'S']

目前，我提取了以下一般特征：

def word2features2(sent, i): word = sent[i][0] postag = sent[i][1] # Common features for all words features = [ 'bias', 'word.lower=' + word.lower(), 'word[-3:]=' + word[-3:], 'word[-2:]=' + word[-2:], 'word.isupper=%s' % word.isupper(), 'word.isdigit=%s' % word.isdigit(), 'postag=' + postag ] # Features for words that are not # at the beginning of a document if i > 0: word1 = sent[i-1][0] postag1 = sent[i-1][1] features.extend([ '-1:word.lower=' + word1.lower(), '-1:word.isupper=%s' % word1.isupper(), '-1:word.isdigit=%s' % word1.isdigit(), '-1:postag=' + postag1 ]) else: # Indicate that it is the 'beginning of a sentence' features.append('BOS') # Features for words that are not # at the end of a document if i < len(sent)-1: word1 = sent[i+1][0] postag1 = sent[i+1][1] features.extend([ '+1:word.lower=' + word1.lower(), '+1:word.isupper=%s' % word1.isupper(), '+1:word.isdigit=%s' % word1.isdigit(), '+1:postag=' + postag1 ]) else: # Indicate that it is the 'end of a sentence' features.append('EOS')

并用这些参数训练crf：

trainer = pycrfsuite.Trainer(verbose=True) # Submit training data to the trainer for xseq, yseq in zip(X_train, y_train): trainer.append(xseq, yseq) # Set the parameters of the model trainer.set_params({ # coefficient for L1 penalty 'c1': 0.1, # coefficient for L2 penalty 'c2': 0.01, # maximum number of iterations 'max_iterations': 200, # whether to include transitions that # are possible, but not observed 'feature.possible_transitions': True }) trainer.train('crf.model')

结果：

准确度报告显示：

precision recall f1-score support S 0.99 1.00 0.99 214627 P 0.81 0.57 0.67 5734 micro avg 0.99 0.99 0.99 220361 macro avg 0.90 0.79 0.83 220361 weighted avg 0.98 0.99 0.98 220361

有什么方法可以编辑word2features2()来改进模型？（或任何其他部分）

这里是完整代码的link，就像今天一样。你知道吗

另外，我只是nlp的初学者，所以我非常感谢所有的反馈，相关或有用的资源的链接，以及非常简单的解释。非常感谢！你知道吗

1条回答

网友

1楼 · 发布于 2024-10-03 02:41:00

由于问题的性质，您的类是非常不平衡的，因此我建议使用加权损失，其中p标记的损失比S类的损失的值更高。我认为问题可能是由于两个类的权重相等，分类器没有对这些P标签给予足够的关注，因为它们对损失的影响很小。你知道吗

您可以尝试的另一件事是超参数优化，确保为宏f1评分进行优化，因为无论支持实例的数量如何，它都将为两个类提供相等的权重。你知道吗

问题：

背景：

结果：

相关问题更多 >

编程相关推荐

热门问题

热门文章