数据帧中的文本操作:单词提取

2024-09-28 22:04:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我想核对一下数字旁边的单词。 例如,我的数据框中有以下列: 食谱

Halve the clementine and place into the cavity along with the bay leaves. Transfer the duck to a medium roasting tray and roast for around 1 hour 20 minutes.
Add the stock, then bring to the boil and reduce to a simmer for around 15 minutes.
2 heaped teaspoons Chinese five-spice 
100 ml Marsala
1 litre organic chicken stock

我想得到一个新的专栏,在那里我提取它们:

New Column
[1 hour, 20 minutes]
15 minutes
2 heaped
100 ml
1 litre

因为我需要与一系列值进行比较:

to_compare= ["1 hour", "20 litres", "100 ml", "2", "15 minutes", "20 minutes"]

查看每行共有多少个元素。 谢谢你的帮助


Tags: andtheto数据forstock数字单词
2条回答

我们将Series.str.extractall与模式numbers - space - letter一起使用。然后我们检查to_compare中有哪些匹配项,最后我们使用GroupBy.sum来获得我们有多少个匹配项

matches = df['Col'].str.extractall('(\d+\s\w+)')
df['matches'] = matches[0].isin(to_compare).groupby(level=0).sum()

                                                 Col  matches
0  Halve the clementine and place into the cavity...      2.0
1  Add the stock, then bring to the boil and redu...      1.0
2              2 heaped teaspoons Chinese five-spice      0.0
3                                     100 ml Marsala      1.0
4                      1 litre organic chicken stock      0.0

另外,matches返回:

                  0
  match            
0 0          1 hour
  1      20 minutes
1 0      15 minutes
2 0        2 heaped
3 0          100 ml
4 0         1 litre

要在列表中获取这些信息,请使用:

matches.groupby(level=0).agg(list)

                      0
0  [1 hour, 20 minutes]
1          [15 minutes]
2            [2 heaped]
3              [100 ml]
4             [1 litre]

您可以使用regex构建模式,该模式可以提取数字和以下单词,然后将此函数应用于数据帧的整个列

import pandas as pd
import re
df = pd.DataFrame({'text':["Halve the clementine and place into the cavity along with the bay leaves. Transfer the duck to a medium roasting tray and roast for around 1 hour 20 minutes.",
           "Add the stock, then bring to the boil and reduce to a simmer for around 15 minutes.",
           "2 heaped teaspoons Chinese five-spice",
           "100 ml Marsala",
           "1 litre organic chicken stock"]})


def extract_qty(txt):
  return re.findall('\d+ \w+',txt)

df['extracted_qty'] = df['text'].apply(extract_qty)

df    
#   text                                                extracted_qty
#0  Halve the clementine and place into the cavity...   [1 hour, 20 minutes]
#1  Add the stock, then bring to the boil and redu...   [15 minutes]
#2  2 heaped teaspoons Chinese five-spice               [2 heaped]
#3  100 ml Marsala                                      [100 ml]
#4  1 litre organic chicken stock                       [1 litre]

使用to_compare和列表提取公共值:

to_compare= ["1 hour", "20 litres", "100 ml", "2", "15 minutes", "20 minutes"]

df['common'] = df['extracted_qty'].apply(lambda x: [el for el in x if el in to_compare])


#   text                        extracted_qty           common
#0  Halve the clementine ...    [1 hour, 20 minutes]    [1 hour, 20 minutes]
#1  Add the stock, then  ...    [15 minutes]            [15 minutes]
#2  2 heaped teaspoons ...      [2 heaped]              []
#3  100 ml Marsala              [100 ml]                [100 ml]
#4  1 litre organic chicken...  [1 litre]               []

相关问题 更多 >