谢谢你的帮助。我相信这是一个常见的问题,但我无法找到一个解决方案,以便解决这种局部形式的问题。我是一个新的程序员,非常感谢任何帮助
我有两组关于医疗保健公司的数据。df1
中的数据杂乱无章并且包含空值,而df2
中的数据要完整得多
我需要匹配df1
和df2
中的公司,确定是否存在匹配,如果不是直接匹配,则确定匹配的接近程度。这两个集合都有成千上万的公司,每天都在变化/更新,所以我正在尝试构建一个可以扩展的东西
以下是我迄今为止尝试过的一个可复制的程序:
import pandas as pd
from fuzzywuzzy import process
data1 = [['1001', 'Lutheran Family Hospital', 'Omaha', 'NE'],
['1020', 'Lutheran Family Hospital', 'Dallas', 'TX'],
['1021', 'Lutheran Regional Family Hospital', 'Des Plaines', 'IL'],
['1002', 'Independent Health', 'Fairbanks', 'AK'],
['1003', 'Lucky You Community Clinic', '', ''],
['1004', 'Belmont General Hospital', 'Belmont', 'CA'],
['1005', 'Louisiana Chiro', 'Lafayette', 'LA'],
['1006', 'Steven, Even', 'Chicago', 'IL'],
['1007', 'Kind Kare 4 Kids', 'New Mexico', 'New Mexico'],
['1008', 'Independence Mem', '', ''],
['1009', 'Gerald Griffin Health', 'Missoula', 'Montana'],
['1010', 'INTERNAL MED', 'CHARLESTON', 'SC'],
['1011', 'Belmont Hospital', '', ''],
['1012', 'Belmont Gnrl', 'Belmont', 'CA'],
['1013', 'St Mary Rehab', '', ''],
['1014', 'Saint Mary Med Center', 'Los Angeles', 'California'],
['1025', "St. Mary's Of Lourdes Regional Medical Center", 'Lincoln', 'NE'],
['1015', 'Bryan Bennington, MD', 'Huntsville', 'AL']]
data2 = [['1', 'Lutheran General Hospital', 'Fort Wayne', 'IN'],
['2', 'Lutheran Family Hospital', 'Omaha', 'NE'],
['3', 'Independence Memorial Health', 'Fairbanks', 'AK'],
['4', 'Lucky-You Community Clinic', 'New York', 'NY'],
['5', 'Belmont General Hospital', 'Belmont', 'CA'],
['6', 'Lafayette Joints R Us (DBA Louisiana Best Chiropractic)', 'Lafayette', 'LA'],
['7', 'Even Steven, MD', 'Chicago', 'IL'],
['8', 'Kind Kare 4 Kids, LLC Inc (FKA The Kindest Care)', 'Albequerque', 'NM'],
['9', 'The Best Doctor Group', 'Philadelphia', 'PA'],
['10', 'Internal Medical Group, PLLC', 'Charleston', 'SC'],
['11', "Saint Mary's Holy Name Rehabilitation", 'Lexington', 'KY'],
['12', 'St. Mary Regional Medical Center', 'Los Angeles', 'CA'],
['13', 'Advanced Outpatient Surgical Center', 'Seattle', 'WA']]
df1 = pd.DataFrame(data1, columns=['ID', 'Org_Name', 'City', 'State'])
df2 = pd.DataFrame(data2, columns=['ID', 'Org_Name', 'City', 'State'])
i = 0
scorethreshold = 80
df1["fuzzy"] = 0
for x in df1.Org_Name:
noun,score,record = process.extractOne(x,df2.Org_Name)
if score > scorethreshold:
df1.loc[i,'fuzzy'] = noun
else:
df1.loc[i,'fuzzy'] = None
i = i + 1
以上结果如下:
+----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+
| | ID | Org_Name | City | State | fuzzy |
+----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+
| 0 | 1001 | Lutheran Family Hospital | Omaha | NE | Lutheran Family Hospital |
| 1 | 1020 | Lutheran Family Hospital | Dallas | TX | Lutheran Family Hospital |
| 2 | 1021 | Lutheran Regional Family Hospital | Des Plaines | IL | Lutheran Family Hospital |
| 3 | 1002 | Independent Health | Fairbanks | AK | Independence Memorial Health |
| 4 | 1003 | Lucky You Community Clinic | | | Lucky-You Community Clinic |
| 5 | 1004 | Belmont General Hospital | Belmont | CA | Belmont General Hospital |
| 6 | 1005 | Louisiana Chiro | Lafayette | LA | Lafayette Joints R Us (DBA Louisiana Best Chiropractic) |
| 7 | 1006 | Steven, Even | Chicago | IL | Even Steven, MD |
| 8 | 1007 | Kind Kare 4 Kids | New Mexico | New Mexico | Kind Kare 4 Kids, LLC Inc (FKA The Kindest Care) |
| 9 | 1008 | Independence Mem | | | Independence Memorial Health |
| 10 | 1009 | Gerald Griffin Health | Missoula | Montana | |
| 11 | 1010 | INTERNAL MED | CHARLESTON | SC | Internal Medical Group, PLLC |
| 12 | 1011 | Belmont Hospital | | | Lutheran General Hospital |
| 13 | 1012 | Belmont Gnrl | Belmont | CA | Belmont General Hospital |
| 14 | 1013 | St Mary Rehab | | | Saint Mary's Holy Name Rehabilitation |
| 15 | 1014 | Saint Mary Med Center | Los Angeles | California | Saint Mary's Holy Name Rehabilitation |
| 16 | 1025 | St. Mary's Of Lourdes Regional Medical Center | Lincoln | NE | St. Mary Regional Medical Center |
| 17 | 1015 | Bryan Bennington, MD | Huntsville | AL | |
+----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+
然而,我正在尝试创建一种方法,通过它,我不仅可以确定公司名称是否匹配,还可以确定城市和州是否匹配,以及所有这些匹配的紧密程度。我试图创建一个更像这样的输出,其中Fuzzy_ID
表示匹配项的索引位置,Matched?
表示布尔判断:
+---+------+-----------------------------------+-------------+-------+----------+------------+----------+
| | ID | Org_Name | City | State | Fuzzy_ID | Score | Matched? |
+---+------+-----------------------------------+-------------+-------+----------+------------+----------+
| 0 | 1001 | Lutheran Family Hospital | Omaha | NE | 2 | 100 | YES |
| 1 | 1020 | Lutheran Family Hospital | Dallas | TX | 2 | some_score | NO |
| 2 | 1021 | Lutheran Regional Family Hospital | Des Plaines | IL | 2 | some_score | NO |
| 3 | 1002 | Independent Health | Fairbanks | AK | 3 | some_score | YES |
| 4 | 1003 | Lucky You Community Clinic | | | 4 | some_score | YES |
+---+------+-----------------------------------+-------------+-------+----------+------------+----------+
如何做到这一点?有哪些方法更适合于需要完成的任务?非常感谢您的帮助
这项任务相当困难,涉及许多步骤,但至少 我试图阐述一些一般原则
从整理状态列开始。 如果某个地方有一个州的全名,用州代码替换它
也许你还应该花点时间来澄清df1中的“无状态”案例, 作为清理数据的另一个步骤
然后,对于df1中的每个行,尝试在df2中找到最佳匹配行。 为此,请使用以下步骤:
使用process.extract,在df2中找到一个最佳匹配池,按名称, 对于当前行,假设限制和得分截止的某些值。 如果行包含状态,则只签入df2此状态的行。 将找到的每个匹配项的匹配比率另存为name\u ratio
对于上述池中的每个项目,计算城市列中的WRatio, 另存为城市比率
使用一些聚合公式,为每个匹配计算总\u比率 从姓名比率和城市比率。 我也不知道这个公式应该是什么
以最大的总比值进行匹配,但是如果这个(最佳)比值 低于某个总比值\u截止值,假设当前行有不匹配
当然,你还是要去试验一些特殊的价值观 参数并查看其值的更改如何影响最终结果
相关问题 更多 >
编程相关推荐