使用pandas根据字符串和附加列聚合数量

2024-10-02 06:26:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个包含如下数据的数据集:

Month, Year, Quantity Sold, Product Name
11, 2017, 13, "Creatine Powder Supplement - 500g"
11, 2017, 10, "Gummies 1 bag"
11, 2017, 12, "Creatine Powder Supplement - 1000g"
11, 2017, 15, "Creatine Powder Supplement - 1500g"
11, 2017, 11, "Glucosamine - 500g"
11, 2017, 23, "Glucosamine - 1500g"
12, 2017, 17, "Creatine Powder Supplement - 1000g"
12, 2017, 24, "Glucosamine - 500g"
12, 2017, 13, "Glucosamine - 1500g"
1, 2018, 16, "Creatine Powder Supplement - 500g"
1, 2018, 13, "Creatine Powder Supplement - 1000g"
1, 2018, 10, "Gummies 1 bag"
1, 2018, 11, "Glucosamine - 500g"
1, 2018, 21, "Glucosamine - 1500g"

我想计算销售产品的总重量,用月份和年份分开,这需要从“产品名称”列中提取产品的重量,乘以“销售数量”列,然后提供相关产品的总重量

期望输出(我只计算了第一行的总销售重量):

Matched data set:

Month, Year, Product Name, Total Weight Sold
11, 2017, Creatine Powder Supplement, 41000
11, 2017, Glucosamine, <total>
12, 2017, Creatine Powder Supplement, <total>
12, 2017, Glucosamine, <total>
1, 2018, Creatine Powder Supplement, <total>
1, 2018, Glucosamine, <total>

除此之外,对于任何不以模式 - <number>g结尾的产品,我希望将它们输出到一个单独的数据集中,以便可以查看它们

UNmatched data set:

Month, Year, Quantity Sold, Product Name
11, 2017, 10, "Gummies 1 bag"
1, 2018, 10, "Gummies 1 bag"

我正在考虑使用str.extract,但我不完全确定如何进行数学运算,然后将计算出的合计结果与同一产品的其他行相加,放入新的数据帧或其他

谢谢


Tags: 数据name产品productyearquantitytotalbag
2条回答

下面是一个Python解决方案。它将错误行写入输出文件,并将正确的行写入终端

from collections import defaultdict
import re

d = defaultdict(int)

with open('f0.txt', 'r') as f, open('err.txt', 'w') as fout:
    fout.write(f.readline()) # print header to err.txt

    for row in f:
        row = row.rstrip()
        if re.search(r'- \d+g"', row):
            month, yr, qty, product = row.split(', ')
            product = product.replace('g', '').replace('"', '')
            name, grams = product.split(' - ')
            key = ','.join([month, yr, name])
            d[key] += int(qty) * int(grams)
        else:
            # handle this row (that doesn't have a Product and weight)
            fout.write(row + '\n')

print(','.join(['Month', 'Year', 'Product Name', 'Total Sold']))

for key, total in d.items():
    print(f'{key},{total}')

打印到终端:

Month,Year,Product Name,Total Sold
11,2017,Creatine,41000
11,2017,Glucosamine,40000
12,2017,Creatine,17000
12,2017,Glucosamine,31500
1,2018,Creatine,21000
1,2018,Glucosamine,37000

打印到err.txt:

Month, Year, Quantity Sold, Product Name
11, 2017, 10, "Gummies 1 bag"
1, 2018, 10, "Gummies 1 bag"

我能想到的最直接的解决办法是

product_data = df['Product Name'].str.extract('(?P<name>\w+) - (?P<weight>\d+)g')
invalid_rows = df[product_data['weight'].isnull()]
product_data.drop(labels=invalid_rows.index, inplace=True)
df.drop(labels=invalid_rows.index, inplace=True)
df['Product Name'] = product_data['name']
df['Total'] = product_data['weight'].astype(np.int32) * df['Quantity Sold']
print(df.groupby(['Month', 'Year', 'Product Name']).sum()['Total'].reset_index())
print()
print(invalid_rows)

哪个输出

  Month  Year Product Name  Total
0     1  2018     Creatine  21000
1     1  2018  Glucosamine  37000
2    11  2017     Creatine  41000
3    11  2017  Glucosamine  40000
4    12  2017     Creatine  17000
5    12  2017  Glucosamine  31500

   Month  Year Quantity Sold     Product Name
1     11  2017            10  "Gummies 1 bag"
11     1  2018            10  "Gummies 1 bag"

相关问题 更多 >

    热门问题