从一个具有不同值和类型的列创建新的dataframe列

2024-10-02 12:36:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图通过鱼种名称创建新的列,并将整数作为值,保留索引以在之后进行数据帧连接

import pandas as pd
df = pd.read_csv("fishCounts.csv",index_col=0)
countsdf = df[["Fish Count"]].copy()
countsdf.head()
    
Fish Count
0   38 Sand Bass, 16 Sculpin, 10 Blacksmith
1   138 Sculpin, 28 Sand Bass
2   150 Sculpin Released, 102 Sculpin, 40 Sanddab
3   156 Sculpin, 29 Sand Bass, 5 Black Croaker, 3 ...
4   161 Sculpin

countsdf.columns = ["fish"]
countsdf.fish = countsdf.fish.str.split(", ", expand=False)
countsdf.head()

fish
0   [38 Sand Bass, 16 Sculpin, 10 Blacksmith]
1   [138 Sculpin, 28 Sand Bass]
2   [150 Sculpin Released, 102 Sculpin, 40 Sanddab]
3   [156 Sculpin, 29 Sand Bass, 5 Black Croaker, 3...
4   [161 Sculpin]

这就是我不知道该去哪里的地方。遍历数据帧行?列一张字典的清单? 我是否可以以不同的方式导入数据以简化此操作

编辑:这就是我想说的

  Sand Bass   Sculpin   Blacksmith   Sculpin Released  Sanddab  Black Croaker
0        38        16           10
1        28        138
2                  102                            150       40
3        29        156                                                      5
4                  161

Tags: csv数据dfpdblackfishreleasedbass
3条回答

类似于@Manakin的东西

Fish Count转换为整数列表

df['Fish Count']=df['Fish Count'].str.split(',')

爆炸以使用其id分隔每条鱼

df2=df.explode('Fish Count')

创建字典。在这里,我使用列表理解将Fish Count中的值拆分为数字后的空格,然后派生键和值

{i:j for i,j in df2['Fish Count'].str.split(r'(?<=\d)\s')}

结果

{'38': 'Sand Bass',
 ' 16': 'Sculpin',
 ' 10': 'Blacksmith',
 '138': 'Sculpin',
 ' 28': 'Sand Bass',
 '150': 'Sculpin Released',
 ' 102': 'Sculpin',
 ' 40': 'Sanddab',
 '156': 'Sculpin',
 ' 29': 'Sand Bass',
 ' 5': 'Black Croaker',
 '161': 'Sculpin'}

如果需要,可以打印

print(pd.DataFrame.from_dict({i:j for i,j in df2['Fish Count'].str.split(r'(?<=\d)\s')}, orient='index'))

                     0
38           Sand Bass
 16            Sculpin
 10         Blacksmith
138            Sculpin
 28          Sand Bass
150   Sculpin Released
 102           Sculpin
 40            Sanddab
156            Sculpin
 29          Sand Bass
 5       Black Croaker
161            Sculpin

首先,您需要分解所创建的列表,然后可以使用extract和regex两次,一次匹配数字,然后匹配文本

用数据

data = '38 Sand Bass, 16 Sculpin, 10 Blacksmith\n138 Sculpin, 28 Sand Bass\n150 Sculpin Released, 102 Sculpin, 40 Sanddab\n156 Sculpin, 29 Sand Bass, 5 Black Croaker\n161 Sculpin'
df = pd.DataFrame(data.split('\n'), columns=['Fish Count'])

countsdf = df['Fish Count'].str.split(', ')
countsdf = countsdf.explode('Fish Count').rename('fish').to_frame()
countsdf['count'] = countsdf.fish.str.extract('([0-9]+)')
countsdf['species'] = countsdf.fish.str.extract('([a-zA-Z]+[ a-zA-Z]*)')
countsdf.drop('fish', axis=1, inplace=True)

输出

   count           species
0     38         Sand Bass
1     16           Sculpin
2     10        Blacksmith
3    138           Sculpin
4     28         Sand Bass
5    150  Sculpin Released
6    102           Sculpin
7     40           Sanddab
8    156           Sculpin
9     29         Sand Bass
10     5     Black Croaker
11   161           Sculpin

IIUC,我们可以使用str.splitstr.extractstack

s = df['Fish Count'].str.split(',',expand=True).stack()
s.str.extract('(\d+)(\D+)')

收益率-

       0                  1
0 0   38          Sand Bass
  1   16            Sculpin
  2   10         Blacksmith
1 0  138            Sculpin
  1   28          Sand Bass
2 0  150   Sculpin Released
  1  102            Sculpin
  2   40            Sanddab
3 0  156            Sculpin
  1   29          Sand Bass
  2    5      Black Croaker
  3    3                ...
4 0  161            Sculpin

那么,你想要/需要的格式就取决于你了

s.str.extract('(\d+)(\D+)').groupby(level=[1]).agg(list)

                          0                                                  1
0  [38, 138, 150, 156, 161]  [ Sand Bass,  Sculpin,  Sculpin Released,  Scu...
1         [16, 28, 102, 29]       [ Sculpin,  Sand Bass,  Sculpin,  Sand Bass]
2               [10, 40, 5]            [ Blacksmith,  Sanddab,  Black Croaker]
3                       [3]                                             [ ...]

s.str.extract('(\d+)(\D+)').unstack(1)

     0                                 1                                  
     0    1    2    3                  0           1               2     3
0   38   16   10  NaN          Sand Bass     Sculpin      Blacksmith   NaN
1  138   28  NaN  NaN            Sculpin   Sand Bass             NaN   NaN
2  150  102   40  NaN   Sculpin Released     Sculpin         Sanddab   NaN
3  156   29    5    3            Sculpin   Sand Bass   Black Croaker   ...
4  161  NaN  NaN  NaN            Sculpin         NaN             NaN   NaN

s.str.extract('(\d+)(\D+)').values


array([['38', ' Sand Bass'],
       ['16', ' Sculpin'],
       ['10', ' Blacksmith'],
       ['138', ' Sculpin'],
       ['28', ' Sand Bass'],
       ['150', ' Sculpin Released'],
       ['102', ' Sculpin'],
       ['40', ' Sanddab'],
       ['156', ' Sculpin'],
       ['29', ' Sand Bass'],
       ['5', ' Black Croaker'],
       ['3', ' ...'],
       ['161', ' Sculpin']], dtype=object)

你可以把它变成口述

# actually i'd use fish : num - 
# sorry closed my ide keys can only be unique in a dict.
{num : fish for num, fish in s.str.extract('(\d+)(\D+)').values}

{'38': ' Sand Bass',
 '16': ' Sculpin',
 '10': ' Blacksmith',
 '138': ' Sculpin',
 '28': ' Sand Bass',
 '150': ' Sculpin Released',
 '102': ' Sculpin',
 '40': ' Sanddab',
 '156': ' Sculpin',
 '29': ' Sand Bass',
 '5': ' Black Croaker',
 '3': ' ...',
 '161': ' Sculpin'}

相关问题 更多 >

    热门问题