我有一些文字,我一直在玩在上面的地方,我有一个英文的内容的总结拉丁语。我试图在两个文本中执行NER,以提取日期、地点和人员。我从英语部分开始认为应该更容易,而且使用块。日期未被识别,并非所有实体都被捕获。是否有方法自定义输出,以使其更精确。 下面是我的代码示例:
text = 'Thursday, 3 September 1467. The Jew Azar Nifusi leases his fields called Ta Xellula and Gnien Hagem in the district of Dejr is-Safsaf for ten years to Nicolaus Delia and his son Lemus for the price of eight salme of wheat each harvest-time. The tenants also bind themselves to give Nifusi each year ten salme of brushwood and two salme of straw. On his part the Jew promised to build a surrounding wall for the fields at his own expense.'
import nltk
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary = True)
entity_names = []
for tree in chunked_sentences:
entity_names.extend(extract_entity_names(tree))
print set(entity_names)
这是我得到的输出:
^{pr2}$我当时预计至少要提取的日期是犹太人、阿扎尔·尼弗西、塔·希勒拉、格宁·哈根、德伊尔是萨夫、尼古拉斯·迪利亚和柠檬。有什么帮助吗?
使用这一行代码可以得到日期等信息。它是树格式的,但我假设您以后可以自己以更干净的格式提取内容。在
输出:
^{pr2}$相关问题 更多 >
编程相关推荐