在每个列中查找最后一列匹配的模式

3条回答

网友

1楼 · 编辑于 2024-10-05 10:41:55

您可以使用pandas.melt和groupby完成此操作：

In [123]: molten = pd.melt(df, id_vars='name', var_name='last_referred')

In [124]: molten
Out[124]:
     name last_referred       value
0    bill      action_1    referred
1     bob      action_1  introduced
2    mary      action_1  introduced
3    june      action_1  introduced
4    dale      action_1    referred
5   donna      action_1  introduced
6    bill      action_2    referred
7     bob      action_2    referred
8    mary      action_2         NaN
9    june      action_2    referred
10   dale      action_2         NaN
11  donna      action_2         NaN
12   bill      action_3         NaN
13    bob      action_3    referred
14   mary      action_3         NaN
15   june      action_3         NaN
16   dale      action_3         NaN
17  donna      action_3         NaN

In [125]: gb = molten.groupby('name')

In [126]: col = gb.apply(lambda x: x[x.value == 'referred'].tail(1)).last_referred

In [127]: col.index = col.index.droplevel(1)

In [128]: col
Out[128]:
name
bill    action_2
bob     action_3
dale    action_1
june    action_2
Name: last_referred, dtype: object

In [129]: newdf = df.join(col, on='name')

In [130]: newdf
Out[130]:
    name    action_1  action_2  action_3 last_referred
0   bill    referred  referred       NaN      action_2
1    bob  introduced  referred  referred      action_3
2   mary  introduced       NaN       NaN           NaN
3   june  introduced  referred       NaN      action_2
4   dale    referred       NaN       NaN      action_1
5  donna  introduced       NaN       NaN           NaN

网友

2楼 · 编辑于 2024-10-05 10:41:55

矢量化方法，使用arange查找最后一个索引，max，并进行连接：

df['last_referred'] = np.r_[[np.NaN], df.columns][
        ((df == 'referred') * (np.arange(df.shape[1]) + 1)).max(axis=1).values]

说明：

我们要在每一行中找到值为'referred'的最右边的单元格：

^{pr2}$

一个选项是^{}，但这是第一个（即最左边的）出现。但是，假设我们可以用它们的列索引替换True值，我们可以只使用普通的max。由于True是1，而{}是{}，我们可以通过乘以垂直广播的整数范围[0, 1, 2, ...]来实现这一点：

>>> np.arange(df.shape[1])
array([0, 1, 2, 3])
>>> (df == 'referred') * np.arange(df.shape[1])
   name  action_1  action_2  action_3
0     0         1         2         0
1     0         0         2         3
2     0         0         0         0
3     0         0         2         0
4     0         1         0         0
5     0         0         0         0
>>> ((df == 'referred') * np.arange(df.shape[1])).max(axis=1)
0    2
1    3
2    0
3    2
4    1
5    0
dtype: int32

不过，有一个问题：我们无法区分“name”列中的'referred'与根本不发生的区别。很容易修复；只需从1开始整数范围：

>>> ((df == 'referred') * (np.arange(df.shape[1]) + 1)).max(axis=1)
0    3
1    4
2    0
3    3
4    2
5    0
dtype: int32

现在只需使用此数组索引列名：

>>> df.columns[((df == 'referred') * (np.arange(df.shape[1]) + 1)).max(axis=1).values]
IndexError: index 4 is out of bounds for size 4

哦！我们需要使0以NaN的形式出现，并将其余的列移动。我们可以使用np.r_来实现这一点，它连接了数组：

>>> np.r_[[np.NaN], df.columns]
array([nan, 'name', 'action_1', 'action_2', 'action_3'], dtype=object)
>>> np.r_[[np.NaN], df.columns][
        ((df == 'referred') * (np.arange(df.shape[1]) + 1)).max(axis=1).values]
array(['action_2', 'action_3', nan, 'action_2', 'action_1', nan], dtype=object)

就在这里。在

网友

3楼 · 编辑于 2024-10-05 10:41:55

只需沿着axis=1使用apply函数，并将pattern参数作为附加参数传递给函数。在

In [3]: def func(row, pattern):
            referrer = np.nan
            for key in row.index:
                if row[key] == pattern:
                    referrer = key
            return referrer
        df['last_referred'] = df.apply(func, pattern='referred', axis=1)
        df
Out[3]:     name    action_1  action_2  action_3 last_referred
        0   bill    referred  referred      None      action_2
        1    bob  introduced  referred  referred      action_3
        2   mary  introduced                               NaN
        3   june  introduced  referred                action_2
        4   dale    referred                          action_1
        5  donna  introduced                               NaN

相关问题更多 >

编程相关推荐

热门问题

热门文章