从拉丁语和英语tex中提取日期、人员和地点

2024-10-02 18:26:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一些文字,我一直在玩在上面的地方,我有一个英文的内容的总结拉丁语。我试图在两个文本中执行NER,以提取日期、地点和人员。我从英语部分开始认为应该更容易,而且使用块。日期未被识别,并非所有实体都被捕获。是否有方法自定义输出,以使其更精确。 下面是我的代码示例:

text = 'Thursday, 3 September 1467. The Jew Azar Nifusi leases his fields called Ta Xellula and Gnien Hagem in the district of Dejr is-Safsaf for ten years to Nicolaus Delia and his son Lemus for the price of eight salme of wheat each harvest-time. The tenants also bind themselves to give Nifusi each year ten salme of brushwood and two salme of straw. On his part the Jew promised to build a surrounding wall for the fields at his own expense.'
import nltk 
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary = True)

entity_names = []
for tree in chunked_sentences:
    entity_names.extend(extract_entity_names(tree))
print set(entity_names)

这是我得到的输出:

^{pr2}$

我当时预计至少要提取的日期是犹太人、阿扎尔·尼弗西、塔·希勒拉、格宁·哈根、德伊尔是萨夫、尼古拉斯·迪利亚和柠檬。有什么帮助吗?


Tags: andofthetotextinfornames
1条回答
网友
1楼 · 发布于 2024-10-02 18:26:06

使用这一行代码可以得到日期等信息。它是树格式的,但我假设您以后可以自己以更干净的格式提取内容。在

ne_chunk(pos_tag(word_tokenize(text)))

输出:

^{pr2}$

相关问题 更多 >