如何将dataframe的列值拆分为多个列

2条回答

网友

1楼 · 编辑于 2024-09-30 16:38:44

从数据帧开始：

>>> import pandas as pd

>>> df = pd.DataFrame({'PLUGS\nDESIGN\nGEAR': ['700\nDaewoo 8000  Gearless', '300\nHyundai 4400  Gearless', '600\nSTX 2600  Gearless', '200\nB170 \nGeared', '362 Wenchong 1700 Mk II \nGeared', '252\nRichMax 1550  Gearless'], }, 
...                   index = [0, 1, 2, 3, 4, 5]) 
>>> df
    PLUGS\nDESIGN\nGEAR
0   700\nDaewoo 8000 Gearless
1   300\nHyundai 4400 Gearless
2   600\nSTX 2600 Gearless
3   200\nB170 \nGeared
4   362 Wenchong 1700 Mk II \nGeared
5   252\nRichMax 1550 Gearless

确实可以在几个分隔符上使用split方法，这里是\n和space：

>>> df = pd.DataFrame(df['PLUGS\nDESIGN\nGEAR'].str.split('\n| '))
    PLUGS\nDESIGN\nGEAR
0   [700, Daewoo, 8000, , Gearless]
1   [300, Hyundai, 4400, , Gearless]
2   [600, STX, 2600, , Gearless]
3   [200, B170, , Geared]
4   [362, Wenchong, 1700, Mk, II, , Geared]
5   [252, RichMax, 1550, , Gearless]

然后，可以将第一个和最后一个元素分配给正确的列，将其余元素分配给DESIGN列：

>>> df['PLUGS'] = df['PLUGS\nDESIGN\nGEAR'].str[0]
>>> df['DESIGN'] = df['PLUGS\nDESIGN\nGEAR'].str[1:-1]
>>> df['GEAR'] = df['PLUGS\nDESIGN\nGEAR'].str[-1]
>>> df
    PLUGS\nDESIGN\nGEAR                         PLUGS   DESIGN                      GEAR
0   [700, Daewoo, 8000, , Gearless]             700     [Daewoo, 8000, ]            Gearless
1   [300, Hyundai, 4400, , Gearless]            300     [Hyundai, 4400, ]           Gearless
2   [600, STX, 2600, , Gearless]                600     [STX, 2600, ]               Gearless
3   [200, B170, , Geared]                       200     [B170, ]                    Geared
4   [362, Wenchong, 1700, Mk, II, , Geared]     362     [Wenchong, 1700, Mk, II, ]  Geared
5   [252, RichMax, 1550, , Gearless]            252     [RichMax, 1550, ]           Gearless

最后一件事是改进DESIGN列，使用join方法将其映射为字符串而不是列表，并删除PLUGS\nDESIGN\nGEAR列，如下所示：

>>> df['DESIGN'] = df['DESIGN'].apply(lambda x: ' '.join(map(str, x)))
>>> df.drop(['PLUGS\nDESIGN\nGEAR'], axis=1)
    PLUGS   DESIGN               GEAR
0   700     Daewoo 8000          Gearless
1   300     Hyundai 4400         Gearless
2   600     STX 2600             Gearless
3   200     B170                 Geared
4   362     Wenchong 1700 Mk II  Geared
5   252     RichMax 1550         Gearless

网友

2楼 · 编辑于 2024-09-30 16:38:44

正如评论部分所建议的，正则表达式在这里应该工作得很好

数据帧示例：

>>> df
                   PLUGS\nDESIGN\nGEAR
0        700\nDaewoo 8000  Gearless
1       300\nHyundai 4400  Gearless
2           600\nSTX 2600  Gearless
3                200\nB170 \nGeared
4  362 Wenchong 1700 Mk II \nGeared
5       252\nRichMax 1550  Gearless
6        220\nCV 1100 Plus \nGeared
7      232\nOrskov Mk VII  Gearless
8         119\nKouan 1000  Gearless
9            100\nHanjin 700  Gearless

只需从列名中删除换行符，即可使可读性易于使用

>>> df.columns = df.columns.str.replace(r"\\n", " ", regex=True)

现在，列名没有任何特殊的汽车：

>>> df
                     PLUGS DESIGN GEAR
0        700\nDaewoo 8000  Gearless
1       300\nHyundai 4400  Gearless
2           600\nSTX 2600  Gearless
3                200\nB170 \nGeared
4  362 Wenchong 1700 Mk II \nGeared
5       252\nRichMax 1550  Gearless
6        220\nCV 1100 Plus \nGeared
7      232\nOrskov Mk VII  Gearless
8         119\nKouan 1000  Gearless
9            100\nHanjin 700  Gearless

现在，我们可以使用pandas.Series.str.extract。使用regex方法时，所有命名组()将成为结果中的列名

由于，命名组将成为具有预定义名称的列，如0,1,2，因此我们可以使用所需名称对它们进行重命名，以获得所需结果，如下所示：

>>> df = df['PLUGS DESIGN GEAR'].str.extract(r"^(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\\n|^Gear][a-z]+)").rename(columns={0: 'PLUGS', 1: 'DESIGN', 2: 'GEAR'})

结果:

>>> print(df)
  PLUGS                DESIGN      GEAR
0   700          Daewoo 8000   Gearless
1   300         Hyundai 4400   Gearless
2   600             STX 2600   Gearless
3   200                 B170     Geared
4   362  Wenchong 1700 Mk II     Geared
5   252         RichMax 1550   Gearless
6   220         CV 1100 Plus     Geared
7   232        Orskov Mk VII   Gearless
8   119           Kouan 1000   Gearless
9   100           Hanjin 700   Gearless

正则表达式解释：

你可以在regex101.com查看

(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\|^Gear][a-z]+)

第一个捕获组（\d+）

    \d matches a digit (equivalent to [0-9])
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    Match a single character present in the list below [\\n\s]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    n matches the character n literally (case sensitive)
    \s matches any whitespace character (equivalent to [\r\n\t\f\v ])

第二捕获组（[^\]+）

    Match a single character not present in the list below [^\\]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    Match a single character present in the list below [\\n\s]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    n matches the character n literally (case sensitive)
    \s matches any whitespace character (equivalent to [\r\n\t\f\v ])

第三捕获组（[^Gear][a-z]+）

Match a single character present in the list below [\|^Gear]
\| matches the character | literally (case sensitive)
^Gear matches a single character in the list ^Gear (case sensitive)
Match a single character present in the list below [a-z]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

数据帧示例：

结果:

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何将dataframe的列值拆分为多个列

数据帧示例：

结果:

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >