如何根据另一个数据框中的值在数据框中创建分类字段

2024-06-28 19:47:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要根据另一个数据帧中的值创建一个分类变量。考虑表1,其中有医院就诊和患者ID。请注意,患者可以多次就诊:

+----------+------------+
| visit_id | patient_id |
+----------+------------+
|       10 |          1 |
|       20 |          1 |
|       50 |          2 |
|      100 |          3 |
|      110 |          3 |
+----------+------------+

我需要添加一个带有1或0的新字段,指示患者在医院就诊期间是否服用阿司匹林,如表2所示:

+----------+------------+---------------+
| visit_id | patient_id |  medication   |
+----------+------------+---------------+
|       10 |          1 | aspirin       |
|       10 |          1 | ibuprofin     |
|       20 |          1 | codine        |
|       50 |          2 | aspirin       |
|      100 |          3 | ibuprofin     |
|      110 |          3 | acetaminophin |
|      110 |          3 | vicodin       |
+----------+------------+---------------+

你可以再次看到多层次-你可以从医生那里得到不止一种药物,对吗?当然,这只是一个例子

我尝试合并表(内部连接),这很有效

tab1 = pd.merge(tab1, tab2, on=['visit_id','patient_id'])

tab1['aspirin_index'] = np.where(tab1['medication'].str.contains('aspirin', 
       flags=re.IGNORECASE, regex=True, na=False),1,0)

……但后来我给1号病人买了两份,他们同时服用阿司匹林和布洛芬。我只想知道他们是否至少吃过一次阿司匹林

+----------+------------+---------------+
| visit_id | patient_id | aspirin_index |
+----------+------------+---------------+
|       10 |          1 |             1 |
|       10 |          1 |             0 |
+----------+------------+---------------+

我需要到这里…和表1一样的形状,但只是有新的索引

+----------+------------+---------------+
| visit_id | patient_id | aspirin_index |
+----------+------------+---------------+
|       10 |          1 |             1 |
|       20 |          1 |             0 |
|       50 |          2 |             1 |
|      100 |          3 |             0 |
|      110 |          3 |             0 |
+----------+------------+---------------+

Tags: 数据患者idindex分类visittab1medication
1条回答
网友
1楼 · 发布于 2024-06-28 19:47:18

首先,让我们设置示例数据

# setup tab1 & tab2
tab1 = pd.DataFrame([[10, 1], [20, 1], [50, 2], [100, 3], [110, 3]], columns=["visit_id","patient_id"])
tab2 = pd.DataFrame([[10, 1, "aspirin"], [10, 1, "ibuprofin"], [20, 1, "codine"], [50, 2, "aspirin"], [100, 3, "ibuprofin"], [110, 3, "acetominophin"], [110, 3, "vicodin"]], columns=["visit_id","patient_id", "medication"])

有很多方法可以做到这一点。一种方法可能是将tab2过滤为仅阿司匹林,使用“left”连接将其连接到tab1,然后用0填充空值

# filter tab2 to aspirin only
# change column name
# change to 1/0 instead of text since it now only refers to aspirin
aspirin = tab2.loc[tab2.medication=="aspirin"].copy()
aspirin.columns = ["visit_id", "patient_id", "aspirin_index"]
aspirin["aspirin_index"] = 1

# left-outer merge and fill nulls
tab1 = pd.merge(tab1, aspirin, how="left", on=["visit_id","patient_id"])
tab1.aspirin_index.fillna(0, inplace=True)
tab1["aspirin_index"] = tab1.aspirin_index.astype("int")

#  visit_id  patient_id  aspirin_index
#        10           1              1
#        20           1              0
#        50           2              1
#       100           3              0
#       110           3              0

这会给你一列“阿司匹林指数”。这样就完成了你的目标

但是一次用所有的药物做同样的练习怎么样。。。包括阿司匹林?sklearn有一些预处理函数,使这很容易

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
lb = preprocessing.LabelBinarizer()

# convert each drug into a column of 1's and 0's
all_drugs = pd.DataFrame(lb.fit_transform(le.fit_transform(tab2.medication)), columns=le.classes_)

# concat with source data, aggregate, and clean up
tab2 = pd.concat((tab2.loc[:,["visit_id", "patient_id"]].copy(), all_drugs), axis=1)
tab2 = tab2.groupby(["visit_id", "patient_id"]).agg(np.sum)
tab2.reset_index(inplace=True)

#  visit_id  patient_id  acetominophin  aspirin  codine  ibuprofin  vicodin
#        10           1              0        1       0          1        0
#        20           1              0        0       1          0        0
#        50           2              0        1       0          0        0
#       100           3              0        0       0          1        0
#       110           3              1        0       0          0        1

这是将分类数据作为二进制特征列获取的一种非常常见的方法。但它占用了大量的空间

如果坚持用一个列,把每次就诊的药物都列在一张单子上呢?这样你就可以进行文本搜索,而不必为稀有药物设置密集的0列

# create tab1 with ALL meds taken on each visit
tab2 = tab2.groupby(["visit_id", "patient_id"]).agg({"medication": list})
tab1 = pd.merge(tab1, tab2, how="left", on=["visit_id","patient_id"])

#  visit_id  patient_id                medication
#        10           1      [aspirin, ibuprofin]
#        20           1                  [codine]
#        50           2                 [aspirin]
#       100           3               [ibuprofin]
#       110           3  [acetominophin, vicodin]

# helper function to extract records for ANY drug
def drug_finder(drug):
  idx = tab1.medication.apply(lambda drugs: drug in drugs)
  return tab1.loc[idx].copy()

# find aspirin
drug_finder("aspirin")

#  visit_id  patient_id            medication
#        10           1  [aspirin, ibuprofin]
#        50           2             [aspirin]

# find ibuprofin
drug_finder("ibuprofin")

#  visit_id  patient_id            medication
#        10           1  [aspirin, ibuprofin]
#       100           3           [ibuprofin]

相关问题 更多 >