在python数据框中将点和文本与数字分开

2024-06-01 11:59:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在数据框中将点和文本与数字分开

数据帧如下所示:

Net.Liq.37584957
Haircut48216354
Deficit10631397
             NaN
Haircutperassetclass
Equity31349682
Commodity12461964
FixedIncome663451
Currency3741257

尝试了以下方法:df.col.str.extract(“([a-zA-Z]+)([^a-zA-Z]+)”,expand=True),但第一行同时有点和文本,因此它是这样显示的

             0         1
0          Net         .
1      Haircut  48216354
2      Deficit  10631397
3          NaN       NaN
4          NaN       NaN
5       Equity  31349682
6    Commodity  12461964
7  FixedIncome    663451
8     Currency   3741257

我如何解决这个问题


Tags: 数据方法文本net数字nan中将za
3条回答

看起来您需要模式([a-zA-Z.]+)(\d+)?

Ex:

df = pd.DataFrame({"Col": ['Net.Liq.37584957', 'Haircut48216354', 'Deficit10631397', 'NaN', 'Haircutperassetclass', 'Equity31349682', 'Commodity12461964', 'FixedIncome663451', 'Currency3741257']})
df[['A', "B"]] = df['Col'].str.extract(r"([a-zA-Z.]+)(\d+)?", expand=True)
print(df)

输出:

                    Col                     A         B
0      Net.Liq.37584957              Net.Liq.  37584957
1       Haircut48216354               Haircut  48216354
2       Deficit10631397               Deficit  10631397
3                   NaN                   NaN       NaN
4  Haircutperassetclass  Haircutperassetclass       NaN
5        Equity31349682                Equity  31349682
6     Commodity12461964             Commodity  12461964
7     FixedIncome663451           FixedIncome    663451
8       Currency3741257              Currency   3741257

假设源数据帧中感兴趣的列具有名称Txt,请运行:

df.Txt.str.extract(r'(?P<Letters>[a-z.]*)(?P<Digits>\d*)', flags=re.I)

import re必需)

您的数据样本的结果是:

                Letters    Digits
0              Net.Liq.  37584957
1               Haircut  48216354
2               Deficit  10631397
3                   NaN       NaN
4  Haircutperassetclass          
5                Equity  31349682
6             Commodity  12461964
7           FixedIncome    663451
8              Currency   3741257

注意:第一列有名称字母,但您写下要分隔:

  • 点和文本(实际上是字母
  • 位开始

所以这列实际上包含字母和点

你可以用

^(.*?)(?:\.?(\d+))?$

regex demo

详细信息

  • ^-字符串的开头
  • (.*?)-组1:任何0+字符,尽可能少
  • (?:\.?(\d+))?-可选的序列:
    • \.?-可选点
    • (\d+)-第2组:一个或多个数字
  • $-字符串的结尾

在代码中

df[['A', 'B']] = df['Col'].str.extract(r'(.*?)(?:\.?(\d+))?$', expand=True)

输出:

>>> df
                    Col                     A         B
0      Net.Liq.37584957               Net.Liq  37584957
1       Haircut48216354               Haircut  48216354
2       Deficit10631397               Deficit  10631397
3                   NaN                   NaN       NaN
4  Haircutperassetclass  Haircutperassetclass       NaN
5        Equity31349682                Equity  31349682
6     Commodity12461964             Commodity  12461964
7     FixedIncome663451           FixedIncome    663451
8       Currency3741257              Currency   3741257

相关问题 更多 >