如何使用多准则和模糊逻辑匹配数据帧中的条目？

import pandas as pd from fuzzywuzzy import process data1 = [['1001', 'Lutheran Family Hospital', 'Omaha', 'NE'], ['1020', 'Lutheran Family Hospital', 'Dallas', 'TX'], ['1021', 'Lutheran Regional Family Hospital', 'Des Plaines', 'IL'], ['1002', 'Independent Health', 'Fairbanks', 'AK'], ['1003', 'Lucky You Community Clinic', '', ''], ['1004', 'Belmont General Hospital', 'Belmont', 'CA'], ['1005', 'Louisiana Chiro', 'Lafayette', 'LA'], ['1006', 'Steven, Even', 'Chicago', 'IL'], ['1007', 'Kind Kare 4 Kids', 'New Mexico', 'New Mexico'], ['1008', 'Independence Mem', '', ''], ['1009', 'Gerald Griffin Health', 'Missoula', 'Montana'], ['1010', 'INTERNAL MED', 'CHARLESTON', 'SC'], ['1011', 'Belmont Hospital', '', ''], ['1012', 'Belmont Gnrl', 'Belmont', 'CA'], ['1013', 'St Mary Rehab', '', ''], ['1014', 'Saint Mary Med Center', 'Los Angeles', 'California'], ['1025', "St. Mary's Of Lourdes Regional Medical Center", 'Lincoln', 'NE'], ['1015', 'Bryan Bennington, MD', 'Huntsville', 'AL']] data2 = [['1', 'Lutheran General Hospital', 'Fort Wayne', 'IN'], ['2', 'Lutheran Family Hospital', 'Omaha', 'NE'], ['3', 'Independence Memorial Health', 'Fairbanks', 'AK'], ['4', 'Lucky-You Community Clinic', 'New York', 'NY'], ['5', 'Belmont General Hospital', 'Belmont', 'CA'], ['6', 'Lafayette Joints R Us (DBA Louisiana Best Chiropractic)', 'Lafayette', 'LA'], ['7', 'Even Steven, MD', 'Chicago', 'IL'], ['8', 'Kind Kare 4 Kids, LLC Inc (FKA The Kindest Care)', 'Albequerque', 'NM'], ['9', 'The Best Doctor Group', 'Philadelphia', 'PA'], ['10', 'Internal Medical Group, PLLC', 'Charleston', 'SC'], ['11', "Saint Mary's Holy Name Rehabilitation", 'Lexington', 'KY'], ['12', 'St. Mary Regional Medical Center', 'Los Angeles', 'CA'], ['13', 'Advanced Outpatient Surgical Center', 'Seattle', 'WA']] df1 = pd.DataFrame(data1, columns=['ID', 'Org_Name', 'City', 'State']) df2 = pd.DataFrame(data2, columns=['ID', 'Org_Name', 'City', 'State']) i = 0 scorethreshold = 80 df1["fuzzy"] = 0 for x in df1.Org_Name: noun,score,record = process.extractOne(x,df2.Org_Name) if score > scorethreshold: df1.loc[i,'fuzzy'] = noun else: df1.loc[i,'fuzzy'] = None i = i + 1

+----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+ | | ID | Org_Name | City | State | fuzzy | +----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+ | 0 | 1001 | Lutheran Family Hospital | Omaha | NE | Lutheran Family Hospital | | 1 | 1020 | Lutheran Family Hospital | Dallas | TX | Lutheran Family Hospital | | 2 | 1021 | Lutheran Regional Family Hospital | Des Plaines | IL | Lutheran Family Hospital | | 3 | 1002 | Independent Health | Fairbanks | AK | Independence Memorial Health | | 4 | 1003 | Lucky You Community Clinic | | | Lucky-You Community Clinic | | 5 | 1004 | Belmont General Hospital | Belmont | CA | Belmont General Hospital | | 6 | 1005 | Louisiana Chiro | Lafayette | LA | Lafayette Joints R Us (DBA Louisiana Best Chiropractic) | | 7 | 1006 | Steven, Even | Chicago | IL | Even Steven, MD | | 8 | 1007 | Kind Kare 4 Kids | New Mexico | New Mexico | Kind Kare 4 Kids, LLC Inc (FKA The Kindest Care) | | 9 | 1008 | Independence Mem | | | Independence Memorial Health | | 10 | 1009 | Gerald Griffin Health | Missoula | Montana | | | 11 | 1010 | INTERNAL MED | CHARLESTON | SC | Internal Medical Group, PLLC | | 12 | 1011 | Belmont Hospital | | | Lutheran General Hospital | | 13 | 1012 | Belmont Gnrl | Belmont | CA | Belmont General Hospital | | 14 | 1013 | St Mary Rehab | | | Saint Mary's Holy Name Rehabilitation | | 15 | 1014 | Saint Mary Med Center | Los Angeles | California | Saint Mary's Holy Name Rehabilitation | | 16 | 1025 | St. Mary's Of Lourdes Regional Medical Center | Lincoln | NE | St. Mary Regional Medical Center | | 17 | 1015 | Bryan Bennington, MD | Huntsville | AL | | +----+------+-----------------------------------------------+-------------+------------+---------------------------------------------------------+

+---+------+-----------------------------------+-------------+-------+----------+------------+----------+ | | ID | Org_Name | City | State | Fuzzy_ID | Score | Matched? | +---+------+-----------------------------------+-------------+-------+----------+------------+----------+ | 0 | 1001 | Lutheran Family Hospital | Omaha | NE | 2 | 100 | YES | | 1 | 1020 | Lutheran Family Hospital | Dallas | TX | 2 | some_score | NO | | 2 | 1021 | Lutheran Regional Family Hospital | Des Plaines | IL | 2 | some_score | NO | | 3 | 1002 | Independent Health | Fairbanks | AK | 3 | some_score | YES | | 4 | 1003 | Lucky You Community Clinic | | | 4 | some_score | YES | +---+------+-----------------------------------+-------------+-------+----------+------------+----------+

1条回答

网友

1楼 · 发布于 2024-05-18 11:41:11

这项任务相当困难，涉及许多步骤，但至少我试图阐述一些一般原则

从整理状态列开始。如果某个地方有一个州的全名，用州代码替换它

也许你还应该花点时间来澄清df1中的“无状态”案例，作为清理数据的另一个步骤

然后，对于df1中的每个行，尝试在df2中找到最佳匹配行。为此，请使用以下步骤：

使用process.extract，在df2中找到一个最佳匹配池，按名称，对于当前行，假设限制和得分截止的某些值。如果行包含状态，则只签入df2此状态的行。将找到的每个匹配项的匹配比率另存为name\u ratio
对于上述池中的每个项目，计算城市列中的WRatio，另存为城市比率
使用一些聚合公式，为每个匹配计算总\u比率 从姓名比率和城市比率。我也不知道这个公式应该是什么
以最大的总比值进行匹配，但是如果这个（最佳）比值低于某个总比值\u截止值，假设当前行有不匹配

当然，你还是要去试验一些特殊的价值观参数并查看其值的更改如何影响最终结果

相关问题更多 >

编程相关推荐

热门问题

热门文章