<p>我建议使用具有一定识别和区分专有名词能力的标记器。spacy的功能非常广泛,它的默认标记器在这方面做得很好</p>
<p>如果使用一系列的名字,就好像它们是停止语,那么会有危险——让我举例说明:</p>
<pre><code>import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
texts=["Melissa's home was clean and spacious. I would love to visit again soon.",
"Kevin was nice and Kevin's home had a huge parking spaces."
"Bill sold a work of art to Art and gave him a bill"]
tokenList = []
for i, sentence in enumerate(texts):
doc = nlp(sentence)
for token in doc:
tokenList.append([i, token.text, token.lemma_, token.pos_, token.tag_, token.dep_])
tokenDF = pd.DataFrame(tokenList, columns=["i", "text", "lemma", "POS", "tag", "dep"]).set_index("i")
</code></pre>
<p>因此前两句很简单,spacy识别专有名词“PROPN”:
<a href="https://i.stack.imgur.com/eh8Pa.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/eh8Pa.png" alt="enter image description here"/></a></p>
<p>现在,第三句话已经表达了这个问题——很多人的名字也是事物。spacy的默认标记器并不完美,但它在任务的两个方面都做得很好:当名称被用作常规词(例如,商品清单、艺术品)时,不要删除它们,当它们被用作名称时,一定要识别它们。(你可以看到,它把艺术(人物)的一个提法弄乱了</p>
<p><a href="https://i.stack.imgur.com/0KbI9.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/0KbI9.png" alt="enter image description here"/></a></p>