用n改进人名提取

import nltk from nameparser.parser import HumanName def get_human_names(text): tokens = nltk.tokenize.word_tokenize(text) pos = nltk.pos_tag(tokens) sentt = nltk.ne_chunk(pos, binary = False) person_list = [] person = [] name = "" for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'): for leaf in subtree.leaves(): person.append(leaf[0]) if len(person) > 1: #avoid grabbing lone surnames for part in person: name += part + ' ' if name[:-1] not in person_list: person_list.append(name[:-1]) name = '' person = [] return (person_list) text = """ Some economists have responded positively to Bitcoin, including Francois R. Velde, senior economist of the Federal Reserve in Chicago who described it as "an elegant solution to the problem of creating a digital currency." In November 2013 Richard Branson announced that Virgin Galactic would accept Bitcoin as payment, saying that he had invested in Bitcoin and found it "fascinating how a whole new global currency has been created", encouraging others to also invest in Bitcoin. Other economists commenting on Bitcoin have been critical. Economist Paul Krugman has suggested that the structure of the currency incentivizes hoarding and that its value derives from the expectation that others will accept it as payment. Economist Larry Summers has expressed a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market strategist for ConvergEx Group, has remarked on the effect of increasing use of Bitcoin and its restricted supply, noting, "When incremental adoption meets relatively fixed supply, it should be no surprise that prices go up. And that’s exactly what is happening to BTC prices." """ names = get_human_names(text) print "LAST, FIRST" for name in names: last_first = HumanName(name).last + ', ' + HumanName(name).first print last_first

3条回答

网友

1楼 · 编辑于 2024-06-13 15:20:05

对于其他人来说，我发现这篇文章很有用：http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code

>>> import nltk
>>> def extract_entities(text):
...     for sent in nltk.sent_tokenize(text):
...         for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
...             if hasattr(chunk, 'node'):
...                 print chunk.node, ' '.join(c[0] for c in chunk.leaves())
...

网友

2楼 · 编辑于 2024-06-13 15:20:05

必须同意“让我的代码更好”不太适合这个网站的建议，但我可以给你一些方法，让你可以尝试深入了解。

看看Stanford Named Entity Recognizer (NER)。它的绑定已经包含在NLTK v 2.0中，但是您必须下载一些核心文件。这里有script可以帮你完成所有这些。

我写了这个剧本：

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

输出也不错：

('Francois', 'PERSON') ('R.', 'PERSON') ('Velde', 'PERSON') ('Richard', 'PERSON') ('Branson', 'PERSON') ('Virgin', 'PERSON') ('Galactic', 'PERSON') ('Bitcoin', 'PERSON') ('Bitcoin', 'PERSON') ('Paul', 'PERSON') ('Krugman', 'PERSON') ('Larry', 'PERSON') ('Summers', 'PERSON') ('Bitcoin', 'PERSON') ('Nick', 'PERSON') ('Colas', 'PERSON')

希望这对你有帮助。

网友

3楼 · 编辑于 2024-06-13 15:20:05

您可以尝试解析找到的名称，并检查是否可以在诸如freebase.com这样的数据库中找到它们。在本地获取数据并查询它（在RDF中），或者使用google的api:https://developers.google.com/freebase/v1/getting-started。大多数大公司、地理位置等（可能会被你的代码片段捕捉到）都可以根据freebase数据被丢弃。

相关问题更多 >

编程相关推荐

热门问题

热门文章