<p>您可以使用Python包<code>textdistance</code>来计算字符串之间的规范化相似性,并且仅当相似性高于某个阈值时才保留它们</p>
<pre><code>import textdistance
main_job = 'Marketing Research Coordinator'
other_jobs = ['Market Researching Coordinator', 'Markets Research Coordinator',
'Market Researches Coordinator', 'Marketing Research Coordinator',
'Markets Researchers Coordinator', 'Market Researcher Coordinators',
'Marketing Researcher Coordinators', 'Marketing Researcher Executive',
'Senior Advertising Analyst']
for job in other_jobs:
distance = textdistance.jaccard.normalized_similarity(main_job, job)
print(f'Similarity "{main_job}" & "{job}": {distance:.3f}')
</code></pre>
<pre><code>Similarity "Marketing Research Coordinator" & "Market Researching Coordinator": 1.000
Similarity "Marketing Research Coordinator" & "Markets Research Coordinator": 0.871
Similarity "Marketing Research Coordinator" & "Market Researches Coordinator": 0.844
Similarity "Marketing Research Coordinator" & "Marketing Research Coordinator": 1.000
Similarity "Marketing Research Coordinator" & "Markets Researchers Coordinator": 0.794
Similarity "Marketing Research Coordinator" & "Market Researcher Coordinators": 0.818
Similarity "Marketing Research Coordinator" & "Marketing Researcher Coordinators": 0.909
Similarity "Marketing Research Coordinator" & "Marketing Researcher Executive": 0.579
Similarity "Marketing Research Coordinator" & "Senior Advertising Analyst": 0.436
</code></pre>
<p>看看最后两个例子</p>