计算多列中每个唯一行的字符串出现次数

2024-05-17 11:14:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我想计算某些字符串在多列中的出现次数,并在新列中返回总计数

所以我知道我可以使用value\u counts来计算给定列中值的总出现次数:

data['col'].value_counts(dropna=False)

结果:

[["win" TKO technical knockout]     336
[["win" UD unanimous decision]      307
[["win" KO knockout]                225
[["loss" UD unanimous decision]      97
[["loss" TKO technical knockout]     64
[["win" nan null]                    53
[["draw" MD majority decision]       43
[["loss" KO knockout]                41
[["loss" MD majority decision]       35
[["loss" nan null]                   32
[["loss" SD split decision]          29
[["unknown" nan null]                29
[["win" SD split decision]           27
[["draw" PTS null]                   18
[["win" RTD corner retirement]       17
[["draw" SD split decision]          12
[["loss" RTD corner retirement]      11
[["win" MD majority decision]         9
[["loss" DQ disqualification]         6
[["win" PTS null]                     6
[["unknown" NC null]                  3

问题是我想计算[[“win”KO knockout]在每个相关列中的出现次数(相关列是col1到col20)

以下是我的数据示例:

{'col1': {0: ['["win" UD unanimous decision'],
  1: ['["win" UD unanimous decision'],
  2: ['["win" TKO technical knockout'],
  3: ['["win" UD unanimous decision'],
  4: ['["win" UD unanimous decision']},
 'col2': {0: ['["win" TKO technical knockout'],
  1: ['["win" TKO technical knockout'],
  2: ['["win" TKO technical knockout'],
  3: ['["win" UD unanimous decision'],
  4: ['["win" UD unanimous decision']},
 'col3': {0: ['["win" TKO technical knockout'],
  1: ['["win" KO knockout'],
  2: ['["win" TKO technical knockout'],
  3: ['["win" TKO technical knockout'],
  4: ['["win" UD unanimous decision']},
 'col4': {0: ['["win" UD unanimous decision'],
  1: ['["win" UD unanimous decision'],
  2: ['["win" KO knockout'],
  3: ['["win" TKO technical knockout'],
  4: ['["win" UD unanimous decision']}}

在这种情况下,所需的输出是:

      win UD   win TKO   win KO 
0       2         2         0
1       2         1         1
2       0         3         1
3       2         2         0
4       4         0         0

更新:

我也尝试过使用大小和groupby:

#list of column names
col_outcome = ['col'+str(i) for i in range(1,11)]
data.groupby(col_outcome).size()

但是,这将返回以下错误消息:

TypeError: unhashable type: 'list'


Tags: colnan次数nullmdwinkoknockout
1条回答
网友
1楼 · 发布于 2024-05-17 11:14:50

IIUC,让我们用stack将“wide”数据帧重塑为“long”,然后做一些数据字符串清理,然后使用regex extractreplace,接下来groupbyapplyvalue_count,最后使用unstack重塑结果:

df.stack().str[0].str.replace('\[|\"','')\
  .str.extract('(\w+\s\w+)')\
  .groupby(level=0)[0].apply(pd.Series.value_counts).unstack(fill_value=0)

输出:

   win KO  win TKO  win UD
0       0        2       2
1       1        1       2
2       1        3       0
3       0        2       2
4       0        0       4

相关问题 更多 >