str.contains Pandas自定义函数中出错

3条回答

网友

1楼 · 编辑于 2024-06-01 06:22:16

专业函数接收的x是字符串本身。所以没有x.str，因为它是字符串，所以您可以使用“in”进行检查，如下所示。修改了一些数据以查看结果提示：您应该使用字典或列表，而不是使用elif链

代码：

import pandas as pd
import numpy as np

def specialty(x):
    print(x)
    if x in 'Urolog':
        return 'Urology'
    elif x in 'Nurse':
        return 'Nurse Practioner'
    elif x in 'Oncology':
        return 'Oncology'
    elif x in 'Physician':
        return 'Physician Assistant'
    elif x in 'Family Medicine':
        return 'Family Medicine'
    elif x in 'Anesthes':
        return 'Anesthesiology'
    else:
        return 'Other'
            
df = pd.DataFrame(data={'person_id': {39063: 33081476009, 50538: 33033519093, 56075: 33170508793, 36593: 33061707789, 51656: 33047685345, 95512: 33022026049, 40286: 33038034707, 3887: 33076466195, 40161: 33052807819, 52905: 33190526939, 35418: 33008425164, 35934: 33015737122, 3389: 33055125864, 136: 33139641318, 105460: 33113871389, 52568: 33075745388, 24725: 33052090907, 34838: 33205449839, 31908: 33183672635, 36115: 33006692696}, 
'final_desc': {39063: 'None', 50538: 'Urolog', 56075: 'Anesthes', 36593: 'None', 51656: 'Urology', 95512: 'None', 40286: 'Anesthes', 3887: 'Specialist', 40161: 'None', 52905: 'Anesthesiology', 35418: 'Urology', 35934: 'None', 3389: 'Ophthalmology', 136: 'Rheumatology', 105460: 'None', 52568: 'Urology', 24725: 'Family Medicine', 34838: 'None', 31908: 'Nurse', 36115: 'None'}})

df['desc_clean'] = df['final_desc'].apply(specialty)
print(df)

输出：

          person_id       final_desc        desc_clean
39063   33081476009             None             Other
50538   33033519093           Urolog           Urology
56075   33170508793         Anesthes    Anesthesiology
36593   33061707789             None             Other
51656   33047685345          Urology             Other
95512   33022026049             None             Other
40286   33038034707         Anesthes    Anesthesiology
3887    33076466195       Specialist             Other
40161   33052807819             None             Other
52905   33190526939   Anesthesiology             Other
35418   33008425164          Urology             Other
35934   33015737122             None             Other
3389    33055125864    Ophthalmology             Other
136     33139641318     Rheumatology             Other
105460  33113871389             None             Other
52568   33075745388          Urology             Other
24725   33052090907  Family Medicine   Family Medicine
34838   33205449839             None             Other
31908   33183672635            Nurse  Nurse Practioner
36115   33006692696             None             Other

网友

2楼 · 编辑于 2024-06-01 06:22:16

为此，我们可以定义匹配项之间的映射，然后遍历它们并设置列的值，跟踪已更改的列。最后，我们从未匹配的任何列都被设置为'Other'

mapping = {'Urolog': 'Urology',
 'Nurse': 'Nurse Practioner',
 'Oncology': 'Oncology',
 'Physician': 'Physician Assistant',
 'Family Medicine': 'Family Medicine',
 'Anesthes': 'Anesthesiology'}

def specialty(column):
    column = column.copy()
    matches = pd.Series(False, index=column.index)
    for k,v in mapping.items():
        match = column.str.contains(k)
        column[match] = v
        matches[match] = True
    column[~matches] = 'Other'
    return column


specialty(df['final_desc'])

39063                Other
50538              Urology
56075       Anesthesiology
36593                Other
51656              Urology
95512                Other
40286       Anesthesiology
3887                 Other
40161                Other
52905       Anesthesiology
35418              Urology
35934                Other
3389                 Other
136                  Other
105460               Other
52568              Urology
24725      Family Medicine
34838                Other
31908     Nurse Practioner
36115                Other
Name: final_desc, dtype: object

网友

3楼 · 编辑于 2024-06-01 06:22:16

您可以使用像fuzzywuzzy这样的库进行模糊字符串匹配。这种方法的好处是比某些规则集更灵活，如下所示

此解决方案生成子字符串和候选类别的最大分数，返回最匹配的一个。如果低于阈值，则返回默认值（“无”）：

from fuzzywuzzy import fuzz

CATEGORIES = [
 'Urology',
 'Nurse Practioner',
 'Oncology',
 'Physician Assistant',
 'Family Medicine',
 'Anesthesiology',
 'Specialist',
]    


def best_match(
    text, 
    categories=CATEGORIES, 
    default="None", 
    threshold=65
):
    matches = {fuzz.partial_ratio(cat, text): cat for cat in categories}
    best_score = max(matches)
    best_match = matches[best_score]
    if best_score >= threshold:
        return best_match
    else:
        return default


df["final_desc"] = df.desc.apply(best_match)

结果:

         person_id           final_desc                     desc
52568  33075745388              Urology                urologist
36593  33061707789     Nurse Practioner         nruse practition
136    33139641318           Specialist      oncology specialist
50538  33033519093  Physician Assistant    physicians assistant
3389   33055125864      Family Medicine            fam. medicine
51656  33047685345       Anesthesiology           anesthesiology
35418  33008425164       Anesthesiology         anesthesiologist
52905  33190526939     Nurse Practioner      Nurses practitioner
36115  33006692696           Specialist  Occupational specialist
31908  33183672635             Oncology               Oncologist

相关问题更多 >

编程相关推荐

热门问题

热门文章

str.contains Pandas自定义函数中出错

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >