我正试图从文件路径中提取所有人名。我的方法是将文件路径拆分为单个单词,然后应用NTLK的词性标记来识别专有名词,然后使用ne_chunk
函数来识别人员
import nltk
import re
def extract_entities(y):
#make an empty list to receive results of operation
AggPeople = []
#split the filepath by backslashes
for y in y.split("\\"):
#separate the product above into words, then attach nltk tags (ie. NNP), then attach more specific ntlk tags (ie. Person)
for chunk in nltk.ne_chunk(nltk.pos_tag(re.findall(r"[\w]+", y))) :
#filter out everything but the person labels
if hasattr(chunk, 'label') and chunk.label() == "PERSON":
#bring the results of the above into a list
AggPeople.append(' '.join(c[0] for c in chunk.leaves()).capitalize())
#filter out words you don't want
AggPeople = [x for x in AggPeople if (x not in ['Schedules','Old'])]
#get rid of duplicate words with 'set'
return set(AggPeople)
text = "O:\Country\Province\District\city\Cricket, Jimmy (Y1617F)\Old Schedules\Cricket, Jimmy (78655) Golick doo wop 7 Sept 2016.xlsx"
print(extract_entities(text))
问题是结果是“Jimmy y1617f”,我希望它是“Jimmy”
我认为nltk.ne_chunk
是以一种在处理文本时有意义的方式对单词进行分组,而不是在处理文件路径时。为了解决这个问题,我尝试定义我自己的nltk.ne_chunk
等价物,如下所示:
import nltk
import re
from nltk import RegexpParser
def extract_entities(y):
AggPeople = []
patterns= r"<NP:{<NNP>+}"
chunker = RegexpParser(patterns)
print(chunker)
for y in y.split("\\"):
for chunk in chunker(nltk.pos_tag(re.findall(r"[\w]+", y))) :
if hasattr(chunk, 'label') and chunk.label() == "PERSON":
AggPeople.append(' '.join(c[0] for c in chunk.leaves()).capitalize())
AggPeople = [x for x in AggPeople if (x not in ['Schedules','Old'])]
return set(AggPeople)
收到错误代码:
'RegexpParser' object is not callable
完全回溯:
chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
<ChunkRule: '<NNP>'>
Traceback (most recent call last):
File "<ipython-input-282-cb323eff63b4>", line 1, in <module>
runfile('C:/Users//.spyder-py3/ExtractingNames.py', wdir='C:/Users//.spyder-py3')
File "C:\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users//.spyder-py3/ExtractingNames.py", line 32, in <module>
print(extract_entities(text))
File "C:/Users//.spyder-py3/ExtractingNames.py", line 23, in extract_entities
for chunk in chunker(nltk.pos_tag(re.findall(r"[\w]+", y))) :
TypeError: 'RegexpParser' object is not callable
相关问题 更多 >
编程相关推荐