Pandasgroupby+scikit学习能量转换 - 问答 - Python中文网

Pandasgroupby+scikit学习能量转换

2024-09-27 17:56:47 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

作为我在Multiple distribution normality testing and transformation in pandas dataframe中的问题的后续，我从sciket learn中找到了关于电源变压器的函数。你知道吗

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_map_data_to_normal.html

让我们考虑一下大规模零售网络（数百种产品和数千家商店）的销售情况，简化如下： -1号店，2号店 -产品A、产品B、产品C

我想通过运行参数测试来检测销售水平的异常情况，这要求所有的分布都是正态的。你知道吗

我试着让电力变压器的功能通过一组一组的功能来工作，以尽可能有效地使所有的分布正常化，但没有效果。你知道吗

数据实际上包含了一些负值，所以我决定使用Yeo-Johnson参数来允许负值。你知道吗

我尝试了以下方法：

from sklearn.preprocessing import PowerTransformer
yj = PowerTransformer(method='yeo-johnson') 

df['ScaledSales'] = df.groupby(['Store', 'Product'])['Sales'].transform(lambda x: yj.fit(x))

这返回了一个错误。 “应为2D数组，改为1D数组。使用以下两种方法重塑数据数组.重塑（-1，1）如果您的数据具有单个特征或数组.重塑（1，-1）如果它包含单个样本。”

我还尝试使用它声明一个函数，使用pandas.to\u transform（）将sales值列表转换为可以视为二维数据集的数据帧，但它返回了相同的错误：

def scale (x):
    x.to_frame()
    yj.fit(x)
    yj.transform(x)

df['ScaledSales'] = df.groupby(['Store','Product'])['Sales'].transform(scale)

理想情况下，我希望在dataframe中添加一个ScaledSales列，该列包含PowerTransformer根据store+product group by function缩放的值，使每个store+product组合的销售分布正常化。你知道吗

就我对电力变压器的了解而言，这应该是可能的，对吧？你知道吗

谢谢你的帮助。你知道吗

Tags： to 数据函数 dataframe pandas df 产品 transform

1条回答

网友

1楼 · 发布于 2024-09-27 17:56:47

假设你的df是这样的

import pandas as pd
import numpy as np
np.random.seed(1)

df = pd.DataFrame({ 
    'Store': ['Store 1', 'Store 2'] * 50,
    'Product': ['Product A', 'Product B', 'Product C', 'Product D'] * 25,
    'Sales': [int(x) for x in np.random.randn(100)*10000]
    })

df

      Store    Product  Sales
0   Store 1  Product A  16243
1   Store 2  Product B  -6117
2   Store 1  Product C  -5281
3   Store 2  Product D -10729
4   Store 1  Product A   8654
..      ...        ...    ...
95  Store 2  Product D    773
96  Store 1  Product A  -3438
97  Store 2  Product B    435
98  Store 1  Product C  -6200
99  Store 2  Product D   6980

[100 rows x 3 columns]

创建分组数据框：

df_groupby = df.groupby(['Store', 'Product']).agg(Sales_sum=('Sales', 'sum')).reset_index()
df_groupby

     Store    Product  Sales_sum
0  Store 1  Product A       8696
1  Store 1  Product C      60152
2  Store 2  Product B     -24319
3  Store 2  Product D      16054

然后重塑数据并进行规范化

from sklearn.preprocessing import PowerTransformer
yj = PowerTransformer(method='yeo-johnson') 

data = np.array(df_groupby['Sales_sum'])
reshaped_data = np.array(data).reshape(-1, 1)
print(yj.fit(reshaped_data))
print(yj.lambdas_)
print(yj.transform(reshaped_data))

PowerTransformer(copy=True, method='yeo-johnson', standardize=True)
[0.99939608]
[[-0.21109932]
 [ 1.49251682]
 [-1.31406424]
 [ 0.03264674]]

相关问题更多 >

编程相关推荐

热门问题

热门文章