regex方法同时捕获1个单词和2个单词的专有名词

3条回答

网友

1楼 · 编辑于 2024-06-26 12:49:53

我将使用NLP工具，python最流行的似乎是nltk。正则表达式确实不是正确的方法。。。在nltk网站的首页上有一个例子，链接到前面的答案中，复制粘贴在下面：

import nltk
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)    
tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)

实体现在包含根据the Penn treebank标记的单词

网友

2楼 · 编辑于 2024-06-26 12:49:53

不完全正确，但这将匹配您要查找的大多数内容，但On除外。在

import re
text = """
#'On its 25th anniversary, Ashoka',

#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth     
Cole',
"""
proper_noun_regex = r'([A-Z]{1}[a-z]{1,}(\s[A-Z]{1}[a-z]{1,})?)'
p = re.compile(proper_noun_regex)
matches = p.findall(text)

print matches

输出：

^{pr2}$

然后也许你可以实现一个过滤器来检查这个列表。在

def filter_false_positive(unfiltered_matches):
    filtered_matches = []
    black_list = ["an","on","in","foo","bar"] #etc
    for match in unfiltered_matches:
        if match.lower() not in black_list:
            filtered_matches.append(match)
    return filtered_matches

或者因为python很酷：

def filter_false_positive(unfiltered_matches):
    black_list = ["an","on","in","foo","bar"] #etc
    return [match for match in filtered_matches if match.lower() not in black_list]

你可以这样使用它：

# CONTINUED FROM THE CODE ABOVE
matches = [i[0] for i in matches]
matches = filter_false_positive(matches)
print matches

给出最终输出：

['Ashoka', 'Shift Series', 'Compass Partners', 'Kenneth Cole']

判断一个词是否因为出现在句子开头而大写，或者它是否是一个专有名词，这个问题并不是那么简单。在

'Kenneth Cole is a brand name.' v.s. 'Can I eat something now?' v.s. 'An English man had tea'

在这种情况下，这是相当困难的，所以如果没有其他标准可以知道专有名词的东西，黑名单，数据库等等，就不会那么容易了。regex太棒了，但我不认为它能以任何微不足道的方式在语法层面上解释英语。。。在

尽管如此，祝你好运！在

网友

3楼 · 编辑于 2024-06-26 12:49:53

您在这里要做的是自然语言处理中的“命名实体识别”。如果您真的想要一种能够找到专有名词的方法，那么您可能需要考虑加快到命名实体识别。谢天谢地，nltk库中有一些易于使用的函数：

import nltk
s2 = 'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole'
tokens2 = nltk.word_tokenize(s2)
tags = nltk.pos_tag(tokens2)
res = nltk.ne_chunk(tags)

结果：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章