优化将值转换为0和1的性能

2024-09-25 12:34:25 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我们有一份员工名单和一些其他数据:

  Employee   Location   Title
0        1  Location1  Title1
1        2  Location2  Title1
2        3  Location3  Title2
3        4  Location1  Title3
4        5  Location1  Title2

我将其转换为具有(1,0)值的功能和标签,它可以工作,但需要在6k记录的数据库上使用较长的时间。逻辑:从位置获取值,将其设置为列,如果员工位置与列put 1匹配,则将其设置为0

我的问题:是否有可能以某种方式优化性能?我缺乏术语,因此很难找到更好的解决方案,但我相信应该有一些东西。

最终输出如下所示:

 Employee  Location1  Location2  Location3  Title1  Title2  Title3
0        1          1          0          0       1       0       0
1        2          0          1          0       1       0       0
2        3          0          0          1       0       1       0
3        4          1          0          0       0       0       1
4        5          1          0          0       0       1       0

需要很长时间才能完成的工作代码:

import pandas as pd
df = pd.DataFrame.from_dict({'Employee': ['1','2','3','4','5'], 
      'Location': ['Location1', 'Location2','Location3','Location1','Location1'],
      'Title': ['Title1','Title1','Title2','Title3','Title2']
     })
df_tr = df['Employee'] #temporary employee ids

# transposing the data, which takes ages:

df_newcols = {}
for column in list(df)[1:]:
    newcols = df[column].unique()
    for key in newcols:
        temp_ar = []
        for value in df[column]:
            if key == value:
                temp_ar.append(1)
            else:
                temp_ar.append(0)
        df_newcols[key] = temp_ar
print (df_newcols)

# adding transposed to the temp df

df_temp = pd.DataFrame.from_dict(df_newcols)

# merging with df with employee ids

new_df = pd.concat([df_tr,df_temp],axis=1)

Tags: indfforemployeecolumntemppdar
3条回答

使用^{}的另一种解决方案:

print( pd.concat([df['Employee'],
                  pd.get_dummies(df['Location']),
                  pd.get_dummies(df['Title'])], axis=1) )

印刷品:

  Employee  Location1  Location2  Location3  Title1  Title2  Title3
0        1          1          0          0       1       0       0
1        2          0          1          0       1       0       0
2        3          0          0          1       0       1       0
3        4          1          0          0       0       0       1
4        5          1          0          0       0       1       0

你应该尝试使用更多的“应用”方法和熊猫的方法。在熊猫中使用“for循环”是非常糟糕的。。。这会毁了你的表演

一种可能的解决方案如下:

import pandas as pd


# read the file
emp=pd.read_csv("employee_huge.txt", sep=" ")


# generate unique lists containing LocationX and TitleX
lnewcols_location=set(emp["Location"].to_list())
lnewcols_title=set(emp["Title"].to_list())


# a function to compare a cell (like "Location1") to a string that is the name of the column
# like "Location2".  If they match return 1, otherwise 0
def same_as_col(acell, col):
    if(acell==col):
        return(1)
    else:
        return(0)


# generate all the LocationN columns with 1 or 0 if there is a match
for i in lnewcols_location:
  emp[i]=emp["Location"].apply(same_as_col, col=i)

# generate all the TitleN columns with 1 or 0 if there is a match
for i in lnewcols_title:
  emp[i]=emp["Title"].apply(same_as_col, col=i)

# removing Location and Title columns
emp=emp.drop(["Location", "Title"], axis=1)

最后,我生成了一个名为employee_hug.txt的文件。其内容的格式如下所示:

Employee Location Title
0 Location4 Title1
1 Location1 Title3
2 Location1 Title2
3 Location1 Title4
4 Location4 Title1

这应该可以做到:

df["_dummy"]=1
df2=pd.concat([
    df.pivot_table(index="Employee", columns="Location", values="_dummy", aggfunc=max), 
    df.pivot_table(index="Employee", columns="Title", values="_dummy", aggfunc=max)
], axis=1).fillna(0).astype(int).reset_index(drop=False)

参考:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html

相关问题 更多 >