
2024-05-05 20:46:10 发布

您现在位置:Python中文网/ 问答频道 /正文

是否有包含人名(英文)的python库?或者,如果不是,有什么好方法可以从语料库中的每个文档中删除人名? 下面是一个简单的例子:

texts=['Melissa\'s home was clean and spacious. I would love to visit again soon.','Kevin was nice and Kevin\'s home had a huge parking spaces.'] 

Tags: and数据方法文档文本实体名称home


import re

# get a list of existed names (over 18 000) from the file
with open('names.txt', 'r') as f:
    NAMES = set(f.read().splitlines())

# your list of texts
texts=["Melissa's home was clean and spacious. I would love to visit again soon.",
"Kevin was nice and Kevin's home had a huge parking spaces."]

# join the texts into one string
texts = ' | '.join(texts)

# find all the words that look like names
pattern = r"(\b[A-Z][a-z]+('s)?\b)"
found_names = re.findall(pattern, texts)

# get singular forms, and remove doubles
found_names = set([name[0].replace("'s","") for name in found_names])

# remove all the words that look like names but are not included in the NAMES
found_names = [name for name in found_names if name in NAMES]

# loop trough the found names and remove every name from the texts
for name in found_names:
    texts = re.sub(name + "('s)?", "", texts) # include plural forms

# split the texts back to the list
texts = texts.split(' | ')



[' home was clean and spacious. I would love to visit again soon.',
' was nice and  home had a huge parking spaces.']





import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
texts=["Melissa's home was clean and spacious. I would love to visit again soon.",
       "Kevin was nice and Kevin's home had a huge parking spaces."
      "Bill sold a work of art to Art and gave him a bill"]
tokenList = []
for i, sentence in enumerate(texts):
    doc = nlp(sentence)
    for token in doc:
        tokenList.append([i, token.text, token.lemma_, token.pos_, token.tag_, token.dep_])
tokenDF = pd.DataFrame(tokenList, columns=["i", "text", "lemma", "POS", "tag", "dep"]).set_index("i")

因此前两句很简单,spacy识别专有名词“PROPN”: enter image description here


enter image description here

相关问题 更多 >