Pandas从列值中提取子级别,并在其他列中填充子级别值

2024-09-22 20:30:09 发布

您现在位置:Python中文网/ 问答频道 /正文

假设此数据帧:

vals = [['1.00.00.00', 'Total Assets', 1000], ['1.01.00.00', 'Cash', 200], ['1.02.00.00', 'Inventory', 800], ['1.02.01.00', 'Goods in process', 300], ['1.02.02.00', 'Goods for sale', 500], ['2.00.00.00', 'Liabilities', 750], ['2.01.00.00', 'Commercial Liabilities', 700], ['2.02.00.00', 'Other liabilities', 50], ['3.00.00.00', 'Net equity', 250]]

cols = ['account', 'name', 'balance']

df = pd.DataFrame(valores, columns=cols)

df

enter image description here

因此,来自上一级的每个帐户都将来自下一级的帐户相加

我想创建包含每个级别的帐户组名称的列,该列获取帐户的级别以及该帐户组的最终级别。为清楚起见,我希望得到以下输出:

enter image description here

我尝试了以下方法,但当我需要以某种方式读取前几行的值时,我被卡住了,并且在其中一个步骤(FutureWarning: Columnar iteration over characters will be deprecated in future releases.)中出现了FutureWarning错误

df['sub_account_1'], df['sub_account_2'], df['sub_account_3'], df['sub_account_4']\
= df['account'].str.split('.').str

df['clasif_1'], df['clasif_2'], df['clasif_3'], df['clasif_4'] = ("", "", "", "")

def account_level(df):
    if df['sub_account_2'] == '00':
        return '2'
    elif df['sub_account_3'] == '00':
        return '3'
    elif df['sub_account_4'] == '00':
        return '4'
    else:
        return np.nan

df['nivel_cuenta'] = df.apply(account_level, axis = 1)

停在那里。我相信必须有一种更聪明的方法来解决这个问题,也许一种已知的方法可以解决这个问题,因为这种情况并不罕见(账户和子账户级别)

真实案例场景最多有8个子级别,但帐户代码具有相同的结构。我将使用创建的列来控制原始数据集中给出的小计,然后进行一些分析

有人能给我指出正确的方向吗?(我不想回到Excel中来解决这个问题……)


Tags: 方法indfreturn帐户account级别level
2条回答

您可以循环数据帧行,并应用sub_account_x==0上的条件来填充所需的列

# initialize the sub_account columns and clasif columns
sub_account_cols = ['sub_account_1', 'sub_account_2', 'sub_account_3', 'sub_account_4']
clasif_cols = ['clasif_1', 'clasif_2', 'clasif_3', 'clasif_4']

# assign default values for new columns 
df['sub_account_1'] = df['sub_account_2'] = df['sub_account_3'] = df['sub_account_4'] = ''
df['clasif_1'] = df['clasif_2'] = df['clasif_3'] = df['clasif_4'] = np.NaN
df['level'] = 0


# iterate over the dataframe rows
for idx, row in df.iterrows():
    # check which sub_account part is 00 and assign that as the level and the clasif
    sub_accounts = row['account'].split('.')
    df.loc[idx, sub_account_cols] = sub_accounts # we really don't need to assign to sub_account_x columns here, if you dont want to populate the parent clasif columns with previous value as I have done in the groupby below
    for i, sub_account in enumerate(sub_accounts):
        if sub_account=='00':
            df.loc[idx, 'level'] = i
            df.loc[idx, 'clasif_' + str(i)] = row['name']
            break

# fill clasif for parent rows
for i, clasif_col in enumerate(clasif_cols):
    df[clasif_col] = df.groupby('sub_account_'+str(i+1))[clasif_col].apply(lambda x: x.fillna(method='ffill')).fillna('')

# Fill last_level = True if the level of current row is greater than level of next row
df['last_level'] = np.where(df['level']>=df['level'].shift(-1), True, False)
# The last row is always the last level
df.loc[len(df)-1, 'last_level'] = True

# drop unwanted columns
df.drop(sub_account_cols, axis=1, inplace=True)
print(df)

结果:

      account                    name  balance      clasif_1                 clasif_2          clasif_3 clasif_4  level  last_level  
0  1.00.00.00            Total Assets     1000  Total Assets                                                          1       False  
1  1.01.00.00                    Cash      200  Total Assets                     Cash                                 2        True  
2  1.02.00.00               Inventory      800  Total Assets                Inventory                                 2       False  
3  1.02.01.00        Goods in process      300  Total Assets                Inventory  Goods in process               3        True  
4  1.02.02.00          Goods for sale      500  Total Assets                Inventory    Goods for sale               3        True  
5  2.00.00.00             Liabilities      750   Liabilities                                                          1       False  
6  2.01.00.00  Commercial Liabilities      700   Liabilities   Commercial Liabilities                                 2        True  
7  2.02.00.00       Other liabilities       50   Liabilities        Other liabilities                                 2        True  
8  3.00.00.00              Net equity      250    Net equity                                                          1        True  

我终于能够用下面的代码做我需要的事情了。我主要使用一些python进行循环,但我仍然觉得必须有一种更“熊猫”的方式来实现这一点。基于其他stackoverflow问题,我尝试了几种方法,但在其他方面很难做到这一点。我将把代码留在这里,但我们非常感谢您的反馈,欢迎您提出更多想法

vals = [['1.00.00.00', 'Total Assets', 1800], ['1.01.00.00', 'Cash', 1000], ['1.02.00.00', 'Inventory', 900], ['1.02.01.00', 'Goods in process', 300], ['1.02.02.00', 'Goods for sale', 500], ['1.02.03.00', 'Goods other', 100], ['2.00.00.00', 'Liabilities', 900], ['2.01.00.00', 'Commercial Liabilities', 850], ['2.02.00.00', 'Other liabilities', 50], ['2.03.00.00', 'More liabilities', 20], ['2.03.01.00', 'One liability', 20], ['2.03.01.01', 'One-One liability', 20], ['2.03.01.02', 'One-Two liability', 20], ['2.03.02.00', 'Two liability', 20], ['2.04.00.00', 'One More liability', 20], ['3.00.00.00', 'Net equity', 250]]

cols = ['account', 'name', 'balance']

df = pd.DataFrame(vals, columns=cols)

level_cols = ['level_1', 'level_2', 'level_3', 'level_4',]
df[level_cols] = df['account'].str.split('.',expand=True)

df['account_level'] = ""
df['imputable'] = ""

clasif_cols = ['clasif_1', 'clasif_2', 'clasif_3', 'clasif_4']

for col in clasif_cols:
    df[col] = ""

for i in range(len(df)):
    
    row = df.loc[i]
    account_level = 0
    
    if i == 0:
        account_level += 1
        df.loc[i, 'clasif_1'] = row['name']
    
    else:
        for level in range(4):
            
            level_col = "level_"+str(level+1)
            clasif_col = "clasif_"+str(level+1)

            if row[level_col] == '00':
                break
            else:
                account_level += 1
                prev_row = df.loc[i-1]
                
                # If row level is the same as previous value, repeat previous clasification. 
                # Else, assign name to this clasification
                if row[level_col] == prev_row[level_col]: 
                    df.loc[i, clasif_col] = prev_row[clasif_col]
                else: 
                    df.loc[i, clasif_col] = row['name']
                
    df.loc[i, 'account_level'] = account_level

for i in range(len(df)):
    row = df.loc[i]
    # If this is the last row, it is an imputable account for sure. Also needed it to avoid an error at last row.

    if i == (len(df)-1):
        df.loc[i, 'imputable'] = "YES"
    # else, if the level of this row is equal or higher than the level of the next row, this is an imputable account
    else:
        next_row = df.loc[i+1]
        if row['account_level'] >= next_row['account_level']:
            df.loc[i, 'imputable'] = "YES"

df.drop(level_cols, axis=1, inplace=True)

df

输出:

enter image description here

相关问题 更多 >