从数据帧中的每个值中减去

2024-05-05 06:00:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个如下所示的数据帧:

userId   movie1   movie2   movie3   movie4   score
0        4.1      2.1      1.0      NaN      2
1        3.1      1.1      3.4      1.4      1
2        2.8      NaN      1.7      NaN      3
3        NaN      5.0      NaN      2.3      4
4        NaN      NaN      NaN      NaN      1
5        2.3      NaN      2.0      4.0      1

我想从每部电影中减去电影分数,因此输出如下:

userId   movie1   movie2   movie3   movie4   score
0        2.1      0.1     -1.0      NaN      2
1        2.1      0.1      2.4      0.4      1
2       -0.2      NaN     -2.3      NaN      3
3        NaN      1.0      NaN     -1.7      4
4        NaN      NaN      NaN      NaN      1
5        1.3      NaN      1.0      3.0      1

实际的数据帧有数千部电影,这些电影都是按名称引用的,所以我试图找到一个解决方案来满足这一要求

我还应该提到的是,电影不是按[“电影1”、“电影2”、“电影3]的顺序排列的,而是按片名排列的,如[《星球大战》《哈利波特》《指环王》]。数据集可以更改,因此我不知道列表中的最后一部电影是什么。


Tags: 数据名称列表电影nan解决方案分数score
3条回答

您可以使用NumPy广播在此处进行减法

v = df.loc[:, 'movie1':'movie4'].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, 'movie1':'movie4'] =  out

df
   userId  movie1  movie2  movie3  movie4  score
0       0     2.1     0.1    -1.0     NaN      2
1       1     2.1     0.1     2.4     0.4      1
2       2    -0.2     NaN    -1.3     NaN      3
3       3     NaN     1.0     NaN    -1.7      4
4       4     NaN     NaN     NaN     NaN      5
5       5    -3.7     NaN    -4.0    -2.0      6

如果您不知道列名,请在此处使用^{}

cols = df.columns.difference(['userId', 'score']) 
# Every column name is extracted expect for 'userId' and 'score'
cols
# Index(['movie1', 'movie2', 'movie3', 'movie4'], dtype='object')

现在,用cols替换'movie1':'movie4'

v = df.loc[:, cols].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, cols] =  out

可能的解决方案

import numpy  as np
import pandas as pd

df = pd.DataFrame()
df['userId'] = [0     , 1  , 2     , 3     , 4     , 5     ]
df['movie1'] = [4.1   , 3.1, 2.8   , np.nan, np.nan, 2.3   ]
df['movie2'] = [2.1   , 1.1, np.nan, 5.0   , np.nan, np.nan]
df['movie3'] = [1.0   , 3.4, 1.7   , np.nan, np.nan, 2.0   ]
df['movie4'] = [np.nan, 1.4, np.nan, 2.3   , np.nan, 4.0   ]
df['score'] = [2, 1, 3, 4, 5, 6]

print('before = ', df)
df.iloc[:,1:-1] = df.iloc[:,1:-1].sub(df.iloc[:,-1].values, axis='rows')

print('after = ', df)

它应该返回

   userId  movie1  movie2  movie3  movie4  score
0       0     2.1     0.1    -1.0     NaN      2
1       1     2.1     0.1     2.4     0.4      1
2       2    -0.2     NaN    -1.3     NaN      3
3       3     NaN     1.0     NaN    -1.7      4
4       4     NaN     NaN     NaN     NaN      5
5       5    -3.7     NaN    -4.0    -2.0      6

使用^{}标识movie列,然后subtractscore数组中标识这些列:

In [35]: x = df.filter(like='movie', axis=1).columns.tolist()

In [36]: df[x] = df.filter(like='movie', axis=1) - df.score.values[:, None]

In [37]: df
Out[37]: 
   userId  movie1  movie2  movie3  movie4  score
0       0     2.1     0.1    -1.0     NaN      2
1       1     2.1     0.1     2.4     0.4      1
2       2    -0.2     NaN    -1.3     NaN      3
3       3     NaN     1.0     NaN    -1.7      4
4       4     NaN     NaN     NaN     NaN      5
5       5    -3.7     NaN    -4.0    -2.0      6

编辑:当电影列名是随机的时。选择除'userId', 'score':

x = df.columns[~df.columns.isin(['userId', 'score'])]
df[x] = df[x] - df.score.values[:, None]

相关问题 更多 >