从字符串中提取定量信息

2024-05-17 04:04:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在分析开放式食品事实数据集。 数据集非常混乱,有一个名为“quantity”的列,其条目如下所示:

'100克',
'5盎司(142克)',
'12盎司',
'200克',
'12盎司(340克)',
“10盎司(296毫升)”,
“750毫升”,
'1升',
“250毫升”, '8盎司',
'10.5盎司(750克)',
“1加仑(3.78升)”,
'27盎司(1磅11盎司)765克',
“75 cl”

正如你所看到的,测量值和单位到处都是!有时这个量是用两种不同的量度给出的。。。 我的目标是在pandas数据框中创建一个新列'quantity\u in\u g',从字符串中提取信息,并基于'quantity'列中的克数创建一个整数值。 因此,如果数量列有'200g',我想要整数200,如果它说'1kg',我想要整数1000。我还想把其他计量单位换算成克。对于“2盎司”,我想要整数56,对于1升,我想要1000。
有人能帮我转换一下这个专栏吗? 我真的很感激!
提前谢谢


Tags: 数据字符串in信息食品目标pandascl
1条回答
网友
1楼 · 发布于 2024-05-17 04:04:31
raw_data_lst = ['100 g ','5 oz (142 g)','12 oz','200 g ','12 oz (340 g)','10 f oz (296ml)','750 ml','1 l','250 ml', '8 OZ',] 
# 10 f oz (296ml)  don't know what f is
# if more there is more data like this then gram_conv_dict.keys() loop over this instead of directly ... doing what i have done below

in_grams_colm = []
gram_conv_dict ={
    'g':1,
    'oz': 28.3495,
    'kg':1000,
    'l': 1000 # assuming 1 litre of water  > grams
    }
# ml  > g is tricky as density varies

def convert2num(string_num):
    try:
        return int(string_num)
    except ValueError:
        return float(string_num)

def get_in_grams(unit):
    try:
        return gram_conv_dict[unit.lower()]
    except:
        print('don\'t know how much grams is present in 1',unit+'.')

    return 1


for data in raw_data_lst:
    i = 0
    quantity_str =''
    quantity_num = 0
    while i < len(data):
        if  47 < ord(data[i]) < 58 or data[i] == '.':
            quantity_str+= data[i]
        else:
            # data[i] = '' most abbrv has at most length = 2 therefore data[i+1:i+3] or u can just send the whole data[i+1:]
            # gram_conv_dict[data[i+1:i+3].strip()] directly check if key exist
            break

        i+=1

    quantity_num = convert2num(quantity_str)*get_in_grams(data[i+1:i+3].strip()) # assuming each data has this format numberspace  len 2 abbrv
    in_grams_colm.append(quantity_num) # if u want only integer int(quantity_num)

#print(in_grams_colm)

def nice_print():
    for _ in in_grams_colm:
        print('{:.2f}'.format(_))

nice_print()
'''
output

don't know how much grams is present in 1 f.
don't know how much grams is present in 1 ml.
don't know how much grams is present in 1 ml.
100.00
141.75
340.19
200.00
340.19
10.00
750.00
1000.00
250.00
226.80'''

相关问题 更多 >