Python pandas dataframe获取列值的所有组合?

2024-09-28 21:36:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个pandas数据帧,如下所示:

        colour   points
0         red         1
1      yellow        10
2       black        -3

然后我尝试执行以下算法:

^{pr2}$

我想得到所有的组合和和点数:

  • 如果颜色是主要的,我想把他的价值加起来
  • 如果颜色被用作二次值,我要求反值的和

示例:

red_yellow =  1 + (-10) = -9
red_black  =  1 + ( +3) =  4
black_red  = -3 + ( -1) = -4

我目前得到的输出:

         colours  points
0        red_red       2
1     red_yellow      11
2      red_black      -2
3     yellow_red      11
4  yellow_yellow      20
5   yellow_black       7
6      black_red      -2
7   black_yellow       7
8    blac_kblack      -6

我要寻找的输出:

red_yellow       -9
red_black         4
yellow_red        9
yellow_black     13
black_red        -4
black_yellow    -13

我不知道如何将我的逻辑应用到这段代码中,而且我敢打赌有一种更简单的方法可以在不执行两个循环的情况下获得所有组合,但目前,这是我唯一想到的事情。在

我想:

  • 得到应得的产出
  • 当我们得到20种输入颜色时,提高性能
  • 删除重复项,如红色

Tags: 数据算法示例pandas颜色redpointsblack
3条回答

下面是对几种备选方案的timeit比较。在

| method             | ms per loop |
|--------------------+-------------|
| alt2               |        2.36 |
| using_concat       |        3.26 |
| using_double_merge |        22.4 |
| orig               |        22.6 |
| alt                |        45.8 |

使用timeit生成timeit结果:

^{pr2}$
import numpy as np
import pandas as pd


def alt(df):
    df['const'] = 1
    result = pd.merge(df, df, on='const', how='outer')
    result = result.loc[(result['colour_x'] != result['colour_y'])]

    result['color'] = result['colour_x'] + '_' + result['colour_y']
    result['points'] = result['points_x'] - result['points_y']
    result = result[['color', 'points']]
    return result

def alt2(df):
     points = np.add.outer(df['points'], -df['points'])
     color = pd.MultiIndex.from_product([df['colour'], df['colour']])
     mask = color.labels[0] != color.labels[1]
     color = color.map('_'.join)
     result = pd.DataFrame({'points':points.ravel(), 'color':color})
     result = result.loc[mask]
     return result

def orig(df):
    combos = []
    points = []

    for i1 in range(len(df)):
        for i2 in range(len(df)):
            colour_main = df['colour'].iloc[i1]
            colour_secondary = df['colour'].iloc[i2]
            if colour_main != colour_secondary:
                combo = colour_main + "_" + colour_secondary

                point1 = df['points'].values[i1]
                point2 = df['points'].values[i2]
                new_points = point1 - point2

                combos.append(combo)
                points.append(new_points)

    return pd.DataFrame({'color':combos, 'points':points})

def using_concat(df):
    """https://stackoverflow.com/a/51641085/190597 (RafaelC)"""
    d = df.set_index('colour').to_dict()['points']
    s = pd.Series(list(itertools.combinations(df.colour, 2)))
    s = pd.concat([s, s.transform(lambda k: k[::-1])])
    v = s.map(lambda k: d[k[0]] - d[k[1]])
    df2 = pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})
    return df2

def using_double_merge(df):
    """https://stackoverflow.com/a/51641007/190597 (sacul)"""    
    new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
           .reset_index()
           .drop(['colour', 'points'], 1)
           .merge(df.set_index('colour'), left_on='level_0', right_index=True)
           .merge(df.set_index('colour'), left_on='level_1', right_index=True))

    new['points_y'] *= -1
    new['sum'] = new.sum(axis=1)
    new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)
    new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)
    return new[['colours', 'sum']]

def make_df(N):
    df = pd.DataFrame({'colour': np.arange(N), 
                       'points': np.random.randint(10, size=N)})
    df['colour'] = df['colour'].astype(str)
    return df

alt2中的主要思想是使用np.add_outer构造一个加法表 超出df['points']

In [149]: points = np.add.outer(df['points'], -df['points'])
In [151]: points
Out[151]: 
array([[  0,  -9,   4],
       [  9,   0,  13],
       [ -4, -13,   0]])

ravel用于使数组一维:

In [152]: points.ravel()
Out[152]: array([  0,  -9,   4,   9,   0,  13,  -4, -13,   0])

使用pd.MultiIndex.from_product生成颜色组合:

In [153]: color = pd.MultiIndex.from_product([df['colour'], df['colour']])
In [155]: color = color.map('_'.join)
In [156]: color
Out[156]: 
Index(['red_red', 'red_yellow', 'red_black', 'yellow_red', 'yellow_yellow',
       'yellow_black', 'black_red', 'black_yellow', 'black_black'],
      dtype='object')

生成一个掩码以删除重复项:

mask = color.labels[0] != color.labels[1]

然后从这些部分生成result

 result = pd.DataFrame({'points':points.ravel(), 'color':color})
 result = result.loc[mask]

在我的original answer, here中解释了alt背后的思想。在

d = df.set_index('colour').to_dict()['points']
s = pd.Series(list(itertools.combinations(df.colour, 2)))
s = pd.concat([s, s.transform(lambda k: k[::-1])])
v = s.map(lambda k: d[k[0]] - d[k[1]])
df2= pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})

    comb             values
0   red_yellow       -9
1   red_black        4
2   yellow_black     13
0   yellow_red       9
1   black_red        -4
2   black_yellow    -13

这有点冗长,但可以获得所需的输出:

new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
       .reset_index()
       .drop(['colour', 'points'], 1)
       .merge(df.set_index('colour'), left_on='level_0', right_index=True)
       .merge(df.set_index('colour'), left_on='level_1', right_index=True))


new['points_x'] *= -1

new['sum'] = new.sum(axis=1)

new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)

new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)


>>> new
  level_0 level_1  sum       colours
3  yellow     red   -9    yellow_red
6   black     red    4     black_red
1     red  yellow    9    red_yellow
7   black  yellow   13  black_yellow
2     red   black   -4     red_black
5  yellow   black  -13  yellow_black

相关问题 更多 >