如何找到写得不同的相同职位名称?

2024-10-01 07:20:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我想找到那些在简历上有职位头衔的人,但他们可能写得不一样,例如:

Marketing Research Coordinator
Market Researching Coordinator
Markets Research Coordinator
Market Researches Coordinator
Marketing Research Coordinator
Markets Researchers Coordinator
Market Researcher Coordinators
Marketing Researcher Coordinators
...

如果我想匹配==,我不会得到好的结果,词干分析和柠檬化也很难识别这些差异。
另一种选择是在两个字符串(which is discussed in this question)之间使用相似性度量,但这将非常耗时,而且可能不是一种好方法,同样在这种方法中,确定阈值是另一个问题。
聪明人有想法吗


Tags: 方法职位marketmarketing柠檬research头衔词干
3条回答

我不接受词干和柠檬化不起作用!您可以标记您的输入。然后获取词干,在营销的情况下,如果语言选择正确(检查词干分析包中的语言选择正确),您将获得市场。您还应该确保在if语句的两个元素上应用词干

如果存在听写问题或细微差异,您可以使用Levenstein包并接受类似于比率T的输入

例如:

import nltk.stem.porter

p_stemmer = PorterStemmer()
print("the stem of marketing:",p_stemmer.stem('Marketing'))        
print("the stem of marketing research:",p_stemmer.stem('Marketing Research'))

结果如下:

the stem of marketing: 'market' (correct)

the stem of marketing research: 'marketing research' (not want we want)

如您所见,如果未应用标记化,则词干分析器将无法按预期工作

我建议将所有这些结合起来(标记化、词干和levenstein)

您可以使用Python包textdistance来计算字符串之间的规范化相似性,并且仅当相似性高于某个阈值时才保留它们

import textdistance

main_job = 'Marketing Research Coordinator'

other_jobs = ['Market Researching Coordinator', 'Markets Research Coordinator', 
              'Market Researches Coordinator', 'Marketing Research Coordinator', 
              'Markets Researchers Coordinator', 'Market Researcher Coordinators',
              'Marketing Researcher Coordinators', 'Marketing Researcher Executive',
              'Senior Advertising Analyst']

for job in other_jobs:
    distance = textdistance.jaccard.normalized_similarity(main_job, job)
    print(f'Similarity "{main_job}" & "{job}": {distance:.3f}')
Similarity "Marketing Research Coordinator" & "Market Researching Coordinator": 1.000
Similarity "Marketing Research Coordinator" & "Markets Research Coordinator": 0.871
Similarity "Marketing Research Coordinator" & "Market Researches Coordinator": 0.844
Similarity "Marketing Research Coordinator" & "Marketing Research Coordinator": 1.000
Similarity "Marketing Research Coordinator" & "Markets Researchers Coordinator": 0.794
Similarity "Marketing Research Coordinator" & "Market Researcher Coordinators": 0.818
Similarity "Marketing Research Coordinator" & "Marketing Researcher Coordinators": 0.909
Similarity "Marketing Research Coordinator" & "Marketing Researcher Executive": 0.579
Similarity "Marketing Research Coordinator" & "Senior Advertising Analyst": 0.436

看看最后两个例子

使用下面的正则表达式模式并检查职务是否匹配

import re
pattern = r'Market(\w*?) Research(\w*?) Coordinator'
print('Enter job title')
job_title = input()
if re.search(pattern, job_title):
    print('Job title matching with Market Research Coordinator')
else:
    print('Job title not matching with Market Research Coordinator')

相关问题 更多 >