如何使用多准则和模糊逻辑匹配数据帧中的条目?

2024-05-18 11:41:11 发布

您现在位置:Python中文网/ 问答频道 /正文

谢谢你的帮助。我相信这是一个常见的问题,但我无法找到一个解决方案,以便解决这种局部形式的问题。我是一个新的程序员,非常感谢任何帮助

我有两组关于医疗保健公司的数据。df1中的数据杂乱无章并且包含空值,而df2中的数据要完整得多

我需要匹配df1df2中的公司,确定是否存在匹配,如果不是直接匹配,则确定匹配的接近程度。这两个集合都有成千上万的公司,每天都在变化/更新,所以我正在尝试构建一个可以扩展的东西

以下是我迄今为止尝试过的一个可复制的程序:

import pandas as pd
from fuzzywuzzy import process

data1 = [['1001', 'Lutheran Family Hospital', 'Omaha', 'NE'],
         ['1020', 'Lutheran Family Hospital', 'Dallas', 'TX'],
         ['1021', 'Lutheran Regional Family Hospital', 'Des Plaines', 'IL'],
         ['1002', 'Independent Health', 'Fairbanks', 'AK'],
         ['1003', 'Lucky You Community Clinic', '', ''],
         ['1004', 'Belmont General Hospital', 'Belmont', 'CA'],
         ['1005', 'Louisiana Chiro', 'Lafayette', 'LA'],
         ['1006', 'Steven, Even', 'Chicago', 'IL'],
         ['1007', 'Kind Kare 4 Kids', 'New Mexico', 'New Mexico'],
         ['1008', 'Independence Mem', '', ''],
         ['1009', 'Gerald Griffin Health', 'Missoula', 'Montana'],
         ['1010', 'INTERNAL MED', 'CHARLESTON', 'SC'],
         ['1011', 'Belmont Hospital', '', ''],
         ['1012', 'Belmont Gnrl', 'Belmont', 'CA'],
         ['1013', 'St Mary Rehab', '', ''],
         ['1014', 'Saint Mary Med Center', 'Los Angeles', 'California'],
         ['1025', "St. Mary's Of Lourdes Regional Medical Center", 'Lincoln', 'NE'],
         ['1015', 'Bryan Bennington, MD', 'Huntsville', 'AL']]

data2 = [['1', 'Lutheran General Hospital', 'Fort Wayne', 'IN'],
         ['2', 'Lutheran Family Hospital', 'Omaha', 'NE'],
         ['3', 'Independence Memorial Health', 'Fairbanks', 'AK'],
         ['4', 'Lucky-You Community Clinic', 'New York', 'NY'],
         ['5', 'Belmont General Hospital', 'Belmont', 'CA'],
         ['6', 'Lafayette Joints R Us (DBA Louisiana Best Chiropractic)', 'Lafayette', 'LA'],
         ['7', 'Even Steven, MD', 'Chicago', 'IL'],
         ['8', 'Kind Kare 4 Kids, LLC Inc (FKA The Kindest Care)', 'Albequerque', 'NM'],
         ['9', 'The Best Doctor Group', 'Philadelphia', 'PA'],
         ['10', 'Internal Medical Group, PLLC', 'Charleston', 'SC'],
         ['11', "Saint Mary's Holy Name Rehabilitation", 'Lexington', 'KY'],
         ['12', 'St. Mary Regional Medical Center', 'Los Angeles', 'CA'],
         ['13', 'Advanced Outpatient Surgical Center', 'Seattle', 'WA']]

df1 = pd.DataFrame(data1, columns=['ID', 'Org_Name', 'City', 'State'])
df2 = pd.DataFrame(data2, columns=['ID', 'Org_Name', 'City', 'State'])

i = 0
scorethreshold = 80
df1["fuzzy"] = 0
for x in df1.Org_Name:
    noun,score,record = process.extractOne(x,df2.Org_Name)
    if score > scorethreshold:     
        df1.loc[i,'fuzzy'] = noun
    else:
        df1.loc[i,'fuzzy'] = None
    i = i + 1

以上结果如下:

+----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+
|    |  ID  |                   Org_Name                    |    City     |   State    |                          fuzzy                          |
+----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+
|  0 | 1001 | Lutheran Family Hospital                      | Omaha       | NE         | Lutheran Family Hospital                                |
|  1 | 1020 | Lutheran Family Hospital                      | Dallas      | TX         | Lutheran Family Hospital                                |
|  2 | 1021 | Lutheran Regional Family Hospital             | Des Plaines | IL         | Lutheran Family Hospital                                |
|  3 | 1002 | Independent Health                            | Fairbanks   | AK         | Independence Memorial Health                            |
|  4 | 1003 | Lucky You Community Clinic                    |             |            | Lucky-You Community Clinic                              |
|  5 | 1004 | Belmont General Hospital                      | Belmont     | CA         | Belmont General Hospital                                |
|  6 | 1005 | Louisiana Chiro                               | Lafayette   | LA         | Lafayette Joints R Us (DBA Louisiana Best Chiropractic) |
|  7 | 1006 | Steven, Even                                  | Chicago     | IL         | Even Steven, MD                                         |
|  8 | 1007 | Kind Kare 4 Kids                              | New Mexico  | New Mexico | Kind Kare 4 Kids, LLC Inc (FKA The Kindest Care)        |
|  9 | 1008 | Independence Mem                              |             |            | Independence Memorial Health                            |
| 10 | 1009 | Gerald Griffin Health                         | Missoula    | Montana    |                                                         |
| 11 | 1010 | INTERNAL MED                                  | CHARLESTON  | SC         | Internal Medical Group, PLLC                            |
| 12 | 1011 | Belmont Hospital                              |             |            | Lutheran General Hospital                               |
| 13 | 1012 | Belmont Gnrl                                  | Belmont     | CA         | Belmont General Hospital                                |
| 14 | 1013 | St Mary Rehab                                 |             |            | Saint Mary's Holy Name Rehabilitation                   |
| 15 | 1014 | Saint Mary Med Center                         | Los Angeles | California | Saint Mary's Holy Name Rehabilitation                   |
| 16 | 1025 | St. Mary's Of Lourdes Regional Medical Center | Lincoln     | NE         | St. Mary Regional Medical Center                        |
| 17 | 1015 | Bryan Bennington, MD                          | Huntsville  | AL         |                                                         |
+----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+

然而,我正在尝试创建一种方法,通过它,我不仅可以确定公司名称是否匹配,还可以确定城市和州是否匹配,以及所有这些匹配的紧密程度。我试图创建一个更像这样的输出,其中Fuzzy_ID表示匹配项的索引位置,Matched?表示布尔判断:

+---+------+-----------------------------------+-------------+-------+----------+------------+----------+
|   |  ID  |             Org_Name              |    City     | State | Fuzzy_ID |   Score    | Matched? |
+---+------+-----------------------------------+-------------+-------+----------+------------+----------+
| 0 | 1001 | Lutheran Family Hospital          | Omaha       | NE    |        2 | 100        | YES      |
| 1 | 1020 | Lutheran Family Hospital          | Dallas      | TX    |        2 | some_score | NO       |
| 2 | 1021 | Lutheran Regional Family Hospital | Des Plaines | IL    |        2 | some_score | NO       |
| 3 | 1002 | Independent Health                | Fairbanks   | AK    |        3 | some_score | YES      |
| 4 | 1003 | Lucky You Community Clinic        |             |       |        4 | some_score | YES      |
+---+------+-----------------------------------+-------------+-------+----------+------------+----------+

如何做到这一点?有哪些方法更适合于需要完成的任务?非常感谢您的帮助


Tags: namefamilyilcageneralcenterdf1st
1条回答
网友
1楼 · 发布于 2024-05-18 11:41:11

这项任务相当困难,涉及许多步骤,但至少 我试图阐述一些一般原则

从整理状态列开始。 如果某个地方有一个州的全名,用州代码替换它

也许你还应该花点时间来澄清df1中的“无状态”案例, 作为清理数据的另一个步骤

然后,对于df1中的每个,尝试在df2中找到最佳匹配行。 为此,请使用以下步骤:

  1. 使用process.extract,在df2中找到一个最佳匹配池,按名称, 对于当前,假设限制得分截止的某些值。 如果包含状态,则只签入df2此状态的行。 将找到的每个匹配项的匹配比率另存为name\u ratio

  2. 对于上述池中的每个项目,计算城市列中的WRatio, 另存为城市比率

  3. 使用一些聚合公式,为每个匹配计算总\u比率姓名比率城市比率。 我也不知道这个公式应该是什么

  4. 以最大的总比值进行匹配,但是如果这个(最佳)比值 低于某个总比值\u截止值,假设当前不匹配

当然,你还是要去试验一些特殊的价值观 参数并查看其值的更改如何影响最终结果

相关问题 更多 >