正在获取“@”和“;”之间的子字符串在“@”之前

2024-09-30 03:24:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个pandas列Amort,每行包含字符串值,如每行中的3,312.50 @ Mar 31, 2020; 3,312.50 @ Jun 30, 2020; 3,312.50 @ Sep 30, 2020; 3,312.50 @ Dec 31, 2020; 3,312.50 @ Mar 31, 2021,我想创建与每一年关联的列,其中包含与每一年关联的浮点值的总和。因此,对于上面的字符串,新创建的列Amort_2020的值为3312.50*4。但是我已经意识到Amort中有一些类似于0.64 @ Mar 31, 2020; 0.64 @ Jun 30, 2020; 0.64 @ Sep 30, 2020; 0.63 @ Dec 31, 2020; 0.64 @ Mar 31, 2021; 238.75 @ Jul 31, 2021的值,所以我下面的初始代码(我希望在此基础上构建的代码)不起作用。我想知道是否有更好的方法来做我想做的事情。我考虑过使用re,但没想到一个好方法

for i in range(0, df.shape[0]):
    if df['Amort'].iloc[i] is not None:
        l = []
        no_periods = (str(df['Amort'].iloc[i])).count('2020') ##for summation
        temp = (df['Amort'].iloc[i]).replace("@", "") 
        temp = temp.replace(",", "") ###so that I can convert to float
        for k in range(no_periods):
            l.append(float(temp[:8]))
        df['Amort_2020'].iloc[i] = sum(l)

编辑:

df['Amort']列中添加:

0    3,312.50 @ Mar 31, 2020; 3,312.50 @ Jun 30, 20...
1    1,137.50 @ Jun 17, 2020; 1,137.50 @ Sep 17, 20...
2    394.51 @ Jun 07, 2020; 394.50 @ Sep 07, 2020; ...
3    395.72 @ Jun 07, 2020; 395.73 @ Sep 07, 2020; ...
4    448.86 @ Jun 07, 2020; 448.87 @ Sep 07, 2020; ...
Name: Amort, dtype: object

预期产出: 2020年 df['Amort_2020']

0    13250
1    3412.5
2    1183.53

每年都是如此。行0包含3,312.50 @ Mar 31, 2020; 3,312.50 @ Jun 30, 2020; 3,312.50 @ Sep 30, 2020; 3,312.50 @ Dec 31, 2020; 3,312.50 @ Mar 31, 2021,因为我想总结与每年相关的浮动值,2020年有4个这样的3312.5值,因此它将是3312.5*4=13250。第一行的浮点值乘以4,第1行和第2行的浮点值乘以3,因为2020年仅出现3次


Tags: 方法字符串代码indfforrangetemp
2条回答

以下是我的猜测:

  1. 初始化df
>>> df = pd.DataFrame({'Amort': {0: '3,312.50 @ Mar 31, 2020; 3,312.50 @ Jun 30, 2020; 3,312.50 @ Sep 30, 2020; 3,312.50 @ Dec 31, 2020; 3,312.50 @ Mar 31, 2021',
  1: '0.64 @ Mar 31, 2020; 0.64 @ Jun 30, 2020; 0.64 @ Sep 30, 2020; 0.63 @ Dec 31, 2020; 0.64 @ Mar 31, 2021; 238.75 @ Jul 31, 2021',
  2: '394.51 @ Jun 07, 2020; 394.50 @ Sep 07, 2020;'}})

>>> print(df)
                                               Amort
0  3,312.50 @ Mar 31, 2020; 3,312.50 @ Jun 30, 20...
1  0.64 @ Mar 31, 2020; 0.64 @ Jun 30, 2020; 0.64...
2      394.51 @ Jun 07, 2020; 394.50 @ Sep 07, 2020;
  1. 定义如何分析一行:
import re 
from collections import defaultdict 
def parse_amort(amort): 
    records = defaultdict(list) 
    for record in amort.split(";"): 
        if record.strip(): 
            amount, _, year = [s.strip() for s in re.split(r"@|, ", record)] 
            records[year].append(float(amount.replace(",", "")))  
    return records 
  1. 合计:
>>> df.Amort.apply(parse_amort)  \
            .apply(pd.Series)  \
            .fillna(0)  \
            .applymap(lambda l: sum(l) if isinstance(l, list) else 0)  \
            .add_prefix("Amort_")

   Amort_2020  Amort_2021
0    13250.00     3312.50
1        2.55      239.39
2      789.01        0.00

IIUC,您可以使用extractall

s = df.Amort.str.extractall('(?P<Amort>[\d,\.]+) \@ (?P<date>[\w ,]+);')

s['date'] = pd.to_datetime(s['date'])
s['Amort'] = s['Amort'].str.replace(',','').astype(float)
s = s.reset_index('match',drop=True).set_index(s['date'].dt.year.rename('year'), append=True)

s.groupby(level=(0,1)).Amort.sum()

输出:

   year
0  2020    6625.00
1  2020    2275.00
2  2020     789.01
3  2020     791.45
4  2020     897.73
Name: Amort, dtype: float64

相关问题 更多 >

    热门问题