如何使用numpy和dictionary对变量中的类别进行分组

2024-06-26 14:42:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我想使用努比。哪里还有字典。你知道吗

目前我正在尝试这个使用只是努比。哪里如果我有很多类别的话,我的代码就会增加很多。我想使用字典创建一个地图,然后在中使用该地图努比。哪里. 你知道吗

示例数据帧:

dataF = pd.DataFrame({'TITLE':['CEO','CHIEF EXECUTIVE','EXECUTIVE OFFICER','FOUNDER',
                 'CHIEF OP','TECH OFFICER','CHIEF TECH','VICE PRES','PRESIDENT','PRESIDANTE','OWNER','CO OWNER',
                 'DIRECTOR','MANAGER',np.nan]})
dataF
    TITLE
0   CEO
1   CHIEF EXECUTIVE
2   EXECUTIVE OFFICER
3   FOUNDER
4   CHIEF OP
5   TECH OFFICER
6   CHIEF TECH
7   VICE PRES
8   PRESIDENT
9   PRESIDANTE
10  OWNER
11  CO OWNER
12  DIRECTOR
13  MANAGER
14  NaN

Numpy操作

dataF['TITLE_GRP'] = np.where(dataF['TITLE'].isna(),'NOTAVAILABLE',
                     np.where(dataF['TITLE'].str.contains('CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN'),'CEO_FOUNDER',
                     np.where(dataF['TITLE'].str.contains('CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$'),'OTHER_OFFICERS',
                     np.where(dataF['TITLE'].str.contains('VICE|VP'),'VP',
                     np.where(dataF['TITLE'].str.contains('PRESIDENT|PRES'),'PRESIDENT',
                     np.where(dataF['TITLE'].str.contains('OWNER'),'OWNER_CO_OWN',
                     np.where(dataF['TITLE'].str.contains('MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'),'DIR_MGR_HEAD'
                     ,dataF['TITLE'])))))))

转换的数据

    TITLE   TITLE_GRP
0   CEO CEO_FOUNDER
1   CHIEF EXECUTIVE CEO_FOUNDER
2   EXECUTIVE OFFICER   CEO_FOUNDER
3   FOUNDER CEO_FOUNDER
4   CHIEF OP    OTHER_OFFICERS
5   TECH OFFICER    OTHER_OFFICERS
6   CHIEF TECH  OTHER_OFFICERS
7   VICE PRES   VP
8   PRESIDENT   PRESIDENT
9   PRESIDANTE  PRESIDENT
10  OWNER   OWNER_CO_OWN
11  CO OWNER    OWNER_CO_OWN
12  DIRECTOR    DIR_MGR_HEAD
13  MANAGER DIR_MGR_HEAD
14  NaN NOTAVAILABLE

我想做的是创建一些映射,如下所示:

TITLE_REPLACE = {'CEO_FOUNDER':'CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN',
                'OTHER_OFFICERS':'CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$',
                'VP':'VICE|VP',
                'PRESIDENT':'PRESIDENT|PRES',
                'OWNER_CO_OWN':'OWNER',
                'DIR_MGR_HEAD':'MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'}

然后把它交给一个函数,这个函数应用了逐步numpy运算,得到了与上面相同的结果。你知道吗

我这样做,我必须参数化我的代码,这样所有的参数数据操作将提供一个json文件。你知道吗

我在努力1.更换因为它有字典功能,但是它没有像嵌套中那样保留层次结构np.哪里,它也不能替换整个标题,因为它只是在找到匹配项时替换字符串。你知道吗

如果您能够为上述问题提供解决方案,我还想知道如何解决以下两种其他情况:

  1. 此场景包含.isin操作而不是regex
dataF['INDUSTRY'] = np.where(dataF['INDUSTRY'].isin(['AEROSPACE','AGRICULTURE/MINING','EDUCATION','ENERGY']),'AER_AGR_MIN_EDU_ENER',
                    np.where(dataF['INDUSTRY'].isin(['TRAVEL','INSURANCE','GOVERNMENT','FINANCIAL SERVICES','AUTO','PHARMACEUTICALS']),'TRA_INS_GOVT_FIN_AUT_PHAR',
                    np.where(dataF['INDUSTRY'].isin(['BUSINESS GOODS/SERVICES','CHEMICALS ','TELECOM','TRANSPORTATION']),'BS_CHEM_TELE_TRANSP',
                    np.where(dataF['INDUSTRY'].isin(['CONSUMER GOODS','ENTERTAINMENT','FOOD AND BEVERAGE','HEALTHCARE','INDUSTRIAL/MANUFACTURING','TECHNOLOGY']),'CG_ENTER_FB_HLTH_IND_TECH',
                    np.where(dataF['INDUSTRY'].isin(['ADVERTISING','ASSOCIATION','CONSULTING/ACCOUNTING','PUBLISHING/MEDIA','TECHNOLOGY']),'ADV_ASS_CONS_ACC_PUBL_MED_TECH',
                    np.where(dataF['INDUSTRY'].isin(['RESTAURANT','SOFTWARE']),'REST_SOFT',
                                            'NOTAVAILABLE'))))))
  1. 此场景包含操作之间的
dataF['annual_revn'] = np.where(dataF['annual_revn'].between(1000000,10000000),'1_10_MILLION',
                       np.where(dataF['annual_revn'].between(10000000,15000000),'10_15_MILLION',
                       np.where(dataF['annual_revn'].between(15000000,20000000),'15_20_MILLION',
                       np.where(dataF['annual_revn'].between(20000000,50000000),'20_50_MILLION',
                       np.where(dataF['annual_revn'].between(50000000,1000000000),'50_1000_MILLION',
                                           'NOTAVAILABLE_OUTLIER')))))

Tags: titlenpwheretechownerindustrycoceo
1条回答
网友
1楼 · 发布于 2024-06-26 14:42:49

下面的方法是可行的,但它不是特别优雅,也可能没有那么快。你知道吗

import pandas as pd
import numpy as np
import re

dataF = pd.DataFrame({'TITLE':['CEO','CHIEF EXECUTIVE','EXECUTIVE OFFICER','FOUNDER',
                 'CHIEF OP','TECH OFFICER','CHIEF TECH','VICE PRES','PRESIDENT','PRESIDANTE','OWNER','CO OWNER',
                 'DIRECTOR','MANAGER',np.nan]})

TITLE_REPLACE = {'CEO_FOUNDER':'CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN',
                'OTHER_OFFICERS':'CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$',
                'VP':'VICE|VP',
                'PRESIDENT':'PRESIDENT|PRES',
                'OWNER_CO_OWN':'OWNER',
                'DIR_MGR_HEAD':'MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'}

# Swap the keys and values from the raw data, and split regex by '|'
reverse_replace = {}
for key, value in TITLE_REPLACE.items():
    for value_single in value.split('|'):
        reverse_replace[value_single] = key

def mapping_func(x):
    if not x is np.nan:
        for key, value in reverse_replace.items():
            if re.compile(key).search(x):
                return value
    return 'NOTAVAILABLE'

dataF['TITLE_GRP'] = dataF['TITLE'].apply(mapping_func)


                TITLE       TITLE_GRP
0                 CEO     CEO_FOUNDER
1     CHIEF EXECUTIVE     CEO_FOUNDER
2   EXECUTIVE OFFICER     CEO_FOUNDER
3             FOUNDER     CEO_FOUNDER
4            CHIEF OP  OTHER_OFFICERS
5        TECH OFFICER  OTHER_OFFICERS
6          CHIEF TECH  OTHER_OFFICERS
7           VICE PRES              VP
8           PRESIDENT       PRESIDENT
9          PRESIDANTE       PRESIDENT
10              OWNER    OWNER_CO_OWN
11           CO OWNER    OWNER_CO_OWN
12           DIRECTOR    DIR_MGR_HEAD
13            MANAGER    DIR_MGR_HEAD
14                NaN    NOTAVAILABLE

对于您的其他场景,使用行业映射数据构造df可能是有意义的,然后执行df.merge从行业中确定分组

相关问题 更多 >