如何改进NLTK的分句技术?

2024-10-03 09:20:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我在维基百科上看到这样一段文字:

An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law. It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.

我用NLTKnltk.sent_tokenize来获取句子。这将返回:

['An ambitious campus expansion plan was proposed by Fr.', 
'Vernon F. Gallagher in 1952.', 
'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', 
'It was during the tenure of Fr.', 
'Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
 ] 

而NTLK可以将F.Henry J.McAnulty作为一个实体处理, 它失败了。这把句子分成了两句。在

正确的标记化将是:

^{pr2}$

如何提高标记器的性能?在


Tags: andoftheinanfrwasplan
1条回答
网友
1楼 · 发布于 2024-10-03 09:20:52

Kiss和Strunk(2006)Punkt算法的可怕之处在于它是无监督的。所以给一个新的文本,你应该重新训练这个模型并将它应用到你的文本中,例如

>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
>>> text = "An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law. It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."

# Training a new model with the text.
>>> tokenizer = PunktSentenceTokenizer()
>>> tokenizer.train(text)
<nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>

# It automatically learns the abbreviations.
>>> tokenizer._params.abbrev_types
{'f', 'fr', 'j'}

# Use the customized tokenizer.
>>> tokenizer.tokenize(text)
['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]

如果在重新训练模型时没有足够的数据来生成良好的统计数据,您也可以在训练之前放入一个预先确定的缩写列表;请参见How to avoid NLTK's sentence tokenizer spliting on abbreviations?

^{pr2}$

相关问题 更多 >