从拉丁语和英语tex中提取日期、人员和地点

2024-10-02 18:26:06 发布

男 | 程序猿一只，喜欢编程写python代码。

我有一些文字，我一直在玩在上面的地方，我有一个英文的内容的总结拉丁语。我试图在两个文本中执行NER，以提取日期、地点和人员。我从英语部分开始认为应该更容易，而且使用块。日期未被识别，并非所有实体都被捕获。是否有方法自定义输出，以使其更精确。下面是我的代码示例：

text = 'Thursday, 3 September 1467. The Jew Azar Nifusi leases his fields called Ta Xellula and Gnien Hagem in the district of Dejr is-Safsaf for ten years to Nicolaus Delia and his son Lemus for the price of eight salme of wheat each harvest-time. The tenants also bind themselves to give Nifusi each year ten salme of brushwood and two salme of straw. On his part the Jew promised to build a surrounding wall for the fields at his own expense.'
import nltk 
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary = True)

entity_names = []
for tree in chunked_sentences:
    entity_names.extend(extract_entity_names(tree))
print set(entity_names)

这是我得到的输出：

^{pr2}$

我当时预计至少要提取的日期是犹太人、阿扎尔·尼弗西、塔·希勒拉、格宁·哈根、德伊尔是萨夫、尼古拉斯·迪利亚和柠檬。有什么帮助吗？

Tags： and of the to text in for names

1条回答

网友

1楼 · 发布于 2024-10-02 18:26:06

使用这一行代码可以得到日期等信息。它是树格式的，但我假设您以后可以自己以更干净的格式提取内容。在

ne_chunk(pos_tag(word_tokenize(text)))

输出：

^{pr2}$

从拉丁语和英语tex中提取日期、人员和地点

相关问题更多 >

编程相关推荐

热门问题

热门文章

从拉丁语和英语tex中提取日期、人员和地点

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >