如何防止专有名词词干的产生

2024-06-23 03:23:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图写一个关键字提取程序使用斯坦福POS taggers和NER。对于关键词提取,我只对专有名词感兴趣。这是基本方法

  1. 通过删除除字母以外的任何内容来清理数据
  2. 删除停止字
  3. 每一个单词的词干
  4. 确定每个单词的词性标记
  5. 如果POS标记是一个名词,那么就把它输入到NER
  6. 然后,NER将确定这个词是一个人、一个组织还是一个地点

示例代码

docText="'Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London"

words = re.split("\W+",docText) 

stops = set(stopwords.words("english"))

#remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]

# Stemming
pstem = PorterStemmer()

words = [pstem.stem(w) for w in words]    

nounsWeWant = set(['NN' ,'NNS', 'NNP', 'NNPS'])

finalWords = []

stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') 

for w in words:
    if stp.tag([w.lower()])[0][1] not in nounsWeWant:
        finalWords.append(w.lower())
    else:
        finalWords.append(w)

finalString = " ".join(finalWords)
print finalString

tagged = stn.tag(finalWords)
print tagged

这给了我

^{pr2}$

很明显,我不想让波音公司受到牵制。也没有同伴。我需要用词干作为输入,因为我的输入可能包含Performing之类的术语。我见过像Performing这样的词会被NER作为专有名词,因此可以归类为Organization。因此,首先我把所有的单词都改成小写。然后我检查单词的POS标记是否是名词。如果是的话,我还是保持原样。如果不是,我将单词转换成小写,并将其添加到最终单词列表中,该列表将传递给NER。在

你知道如何避免使用专有名词词干吗?在


Tags: andin标记posforifenglish单词
1条回答
网友
1楼 · 发布于 2024-06-23 03:23:00

使用完整的斯坦福CoreNLP管道来处理您的NLP工具链。避免使用你自己的标记器、清洁器、POS标记器等。它不能很好地使用NER工具。在

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip
unzip http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip
cd stanford-corenlp-full-2015-12-09
echo "Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London" > test.txt
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file test.txt
cat test.txt.out 

[出来]:

^{pr2}$

或获取json输出:

java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file test.txt -outputFormat json

如果您真的需要python包装器,请参见https://github.com/smilli/py-corenlp

$ cd stanford-corenlp-full-2015-12-09
$ export CLASSPATH=protobuf.jar:joda-time.jar:jollyday.jar:xom-1.2.10.jar:stanford-corenlp-3.6.0.jar:stanford-corenlp-3.6.0-models.jar:slf4j-api.jar 
$ java -mx4g edu.stanford.nlp.pipeline.StanfordCoreNLPServer &
cd
$ git clone https://github.com/smilli/py-corenlp.git
$ cd py-corenlp
$ python
>>> from corenlp import StanfordCoreNLP
>>> nlp = StanfordCoreNLP('http://localhost:9000')
>>> text = ("Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London")
>>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner',  'outputFormat': 'json'})
>>> output
{u'sentences': [{u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE', u'index': 0, u'tokens': [{u'index': 1, u'word': u'Jack', u'lemma': u'Jack', u'after': u' ', u'pos': u'NNP', u'characterOffsetEnd': 4, u'characterOffsetBegin': 0, u'originalText': u'Jack', u'ner': u'PERSON', u'before': u''}, {u'index': 2, u'word': u'Frost', u'lemma': u'Frost', u'after': u' ', u'pos': u'NNP', u'characterOffsetEnd': 10, u'characterOffsetBegin': 5, u'originalText': u'Frost', u'ner': u'PERSON', u'before': u' '}, {u'index': 3, u'word': u'works', u'lemma': u'work', u'after': u' ', u'pos': u'VBZ', u'characterOffsetEnd': 16, u'characterOffsetBegin': 11, u'originalText': u'works', u'ner': u'O', u'before': u' '}, {u'index': 4, u'word': u'for', u'lemma': u'for', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 20, u'characterOffsetBegin': 17, u'originalText': u'for', u'ner': u'O', u'before': u' '}, {u'index': 5, u'word': u'Boeing', u'lemma': u'Boeing', u'after': u' ', u'pos': u'NNP', u'characterOffsetEnd': 27, u'characterOffsetBegin': 21, u'originalText': u'Boeing', u'ner': u'ORGANIZATION', u'before': u' '}, {u'index': 6, u'word': u'Company', u'lemma': u'Company', u'after': u'', u'pos': u'NNP', u'characterOffsetEnd': 35, u'characterOffsetBegin': 28, u'originalText': u'Company', u'ner': u'ORGANIZATION', u'before': u' '}, {u'index': 7, u'word': u'.', u'lemma': u'.', u'after': u' ', u'pos': u'.', u'characterOffsetEnd': 36, u'characterOffsetBegin': 35, u'originalText': u'.', u'ner': u'O', u'before': u''}]}, {u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE', u'index': 1, u'tokens': [{u'index': 1, u'word': u'He', u'lemma': u'he', u'after': u' ', u'pos': u'PRP', u'characterOffsetEnd': 39, u'characterOffsetBegin': 37, u'originalText': u'He', u'ner': u'O', u'before': u' '}, {u'index': 2, u'word': u'manages', u'lemma': u'manage', u'after': u' ', u'pos': u'VBZ', u'characterOffsetEnd': 47, u'characterOffsetBegin': 40, u'originalText': u'manages', u'ner': u'O', u'before': u' '}, {u'index': 3, u'after': u' ', u'word': u'5', u'lemma': u'5', u'normalizedNER': u'5.0', u'pos': u'CD', u'characterOffsetEnd': 49, u'characterOffsetBegin': 48, u'originalText': u'5', u'ner': u'NUMBER', u'before': u' '}, {u'index': 4, u'word': u'aircraft', u'lemma': u'aircraft', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 58, u'characterOffsetBegin': 50, u'originalText': u'aircraft', u'ner': u'O', u'before': u' '}, {u'index': 5, u'word': u'and', u'lemma': u'and', u'after': u' ', u'pos': u'CC', u'characterOffsetEnd': 62, u'characterOffsetBegin': 59, u'originalText': u'and', u'ner': u'O', u'before': u' '}, {u'index': 6, u'word': u'their', u'lemma': u'they', u'after': u' ', u'pos': u'PRP$', u'characterOffsetEnd': 68, u'characterOffsetBegin': 63, u'originalText': u'their', u'ner': u'O', u'before': u' '}, {u'index': 7, u'word': u'crew', u'lemma': u'crew', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 73, u'characterOffsetBegin': 69, u'originalText': u'crew', u'ner': u'O', u'before': u' '}, {u'index': 8, u'word': u'in', u'lemma': u'in', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 76, u'characterOffsetBegin': 74, u'originalText': u'in', u'ner': u'O', u'before': u' '}, {u'index': 9, u'word': u'London', u'lemma': u'London', u'after': u'', u'pos': u'NNP', u'characterOffsetEnd': 83, u'characterOffsetBegin': 77, u'originalText': u'London', u'ner': u'LOCATION', u'before': u' '}]}]}
>>> annotated_sent0 = output['sentences'][0]
>>> for token in annotated_sent0['tokens']:
...     print token['word'], token['lemma'], token['pos'], token['ner']
... 
Jack Jack NNP PERSON
Frost Frost NNP PERSON
works work VBZ O
for for IN O
Boeing Boeing NNP ORGANIZATION
Company Company NNP ORGANIZATION
. . . O

这可能是您想要的输出:

>>> " ".join(token['lemma'] for token in annotated_sent0['tokens'])
Jack Frost work for Boeing Company
>>> " ".join(token['word'] for token in annotated_sent0['tokens'])
Jack Frost works for Boeing Company

如果您想要NLTK附带的包装器,那么您必须稍等片刻,直到this issue得到解析;p

相关问题 更多 >

    热门问题