我有以下数据帧(不包括其余列):
| customer_id | department |
| ----------- | ----------------------------- |
| 11 | ['nail', 'men_skincare'] |
| 23 | ['nail', 'fragrance'] |
| 25 | [] |
| 45 | ['skincare', 'men_fragrance'] |
我正在对数据进行预处理,以使其适合模型。我想将department变量转换为每个惟一department类别的虚拟变量(不管有多少惟一的department,而不仅仅限于这里)
要获得此结果:
| customer_id | department | nail | men_skincare | fragrance | skincare | men_fragrance |
| ----------- | ---------- | ---- | ------------ | --------- | -------- | ------------- |
| 11 | ['nail', 'men_skincare'] | 1 | 1 | 0 | 0 | 0 |
| 23 | ['nail', 'fragrance'] | 1 | 0 | 1 | 0 | 0 |
| 25 | [] | 0 | 0 | 0 | 0 | 0 |
| 45 | ['skincare', 'men_fragrance'] | 0 | 0 | 0 | 1 | 1 |
我尝试过这个link,但是当我拼接它时,它将它视为一个字符串,并且只为字符串中的每个字符创建一列;我用的是:
df['1st'] = df['department'].str[0]
df['2nd'] = df['department'].str[1]
df['3rd'] = df['department'].str[2]
df['4th'] = df['department'].str[3]
df['5th'] = df['department'].str[4]
df['6th'] = df['department'].str[5]
df['7th'] = df['department'].str[6]
df['8th'] = df['department'].str[7]
df['9th'] = df['department'].str[8]
df['10th'] = df['department'].str[9]
然后,我尝试拆分字符串并使用以下命令将其转换为列表:
df['new_column'] = df['department'].apply(lambda x: x.split(","))
然后再试一次,仍然只为每个角色创建列
有什么建议吗
编辑:我使用anky发送过来的链接找到了答案,特别是我使用了这个链接:https://stackoverflow.com/a/29036042
对我有用的是:
df['department'] = df['department'].str.replace("'",'').str.replace("]",'').str.replace("[",'').str.replace(' ','')
df['department'] = df['department'].apply(lambda x: x.split(","))
s = df['department']
df1 = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.merge(df, df1, right_index=True, left_index=True, how = 'left')
这是一个基于anky链接的fast binarizer method使用sklearn的^{} 的fast binarizer method:
注意:这假设实际数据的
department
列包含实际的python列表,而不是类似列表的字符串。如果它们实际上是字符串(即type(df.department[0])
输出str
),则需要首先进行此转换:尝试:
输出:
您可以通过
explode()
、value_counts()
和fillna()
方法执行此操作:现在使用
crosstab()
方法:由于
concat()
方法会给您一个错误,所以请使用merge()
方法和drop()
方法:现在,如果您打印
data
,您将获得所需的输出:相关问题 更多 >
编程相关推荐