数据帧中的文本操作：单词提取

Halve the clementine and place into the cavity along with the bay leaves. Transfer the duck to a medium roasting tray and roast for around 1 hour 20 minutes. Add the stock, then bring to the boil and reduce to a simmer for around 15 minutes. 2 heaped teaspoons Chinese five-spice 100 ml Marsala 1 litre organic chicken stock

2条回答

网友

1楼 · 编辑于 2024-09-28 22:04:35

我们将Series.str.extractall与模式numbers - space - letter一起使用。然后我们检查to_compare中有哪些匹配项，最后我们使用GroupBy.sum来获得我们有多少个匹配项

matches = df['Col'].str.extractall('(\d+\s\w+)')
df['matches'] = matches[0].isin(to_compare).groupby(level=0).sum()

                                                 Col  matches
0  Halve the clementine and place into the cavity...      2.0
1  Add the stock, then bring to the boil and redu...      1.0
2              2 heaped teaspoons Chinese five-spice      0.0
3                                     100 ml Marsala      1.0
4                      1 litre organic chicken stock      0.0

另外，matches返回：

                  0
  match            
0 0          1 hour
  1      20 minutes
1 0      15 minutes
2 0        2 heaped
3 0          100 ml
4 0         1 litre

要在列表中获取这些信息，请使用：

matches.groupby(level=0).agg(list)

                      0
0  [1 hour, 20 minutes]
1          [15 minutes]
2            [2 heaped]
3              [100 ml]
4             [1 litre]

网友

2楼 · 编辑于 2024-09-28 22:04:35

您可以使用regex构建模式，该模式可以提取数字和以下单词，然后将此函数应用于数据帧的整个列

import pandas as pd
import re
df = pd.DataFrame({'text':["Halve the clementine and place into the cavity along with the bay leaves. Transfer the duck to a medium roasting tray and roast for around 1 hour 20 minutes.",
           "Add the stock, then bring to the boil and reduce to a simmer for around 15 minutes.",
           "2 heaped teaspoons Chinese five-spice",
           "100 ml Marsala",
           "1 litre organic chicken stock"]})


def extract_qty(txt):
  return re.findall('\d+ \w+',txt)

df['extracted_qty'] = df['text'].apply(extract_qty)

df    
#   text                                                extracted_qty
#0  Halve the clementine and place into the cavity...   [1 hour, 20 minutes]
#1  Add the stock, then bring to the boil and redu...   [15 minutes]
#2  2 heaped teaspoons Chinese five-spice               [2 heaped]
#3  100 ml Marsala                                      [100 ml]
#4  1 litre organic chicken stock                       [1 litre]

使用to_compare和列表提取公共值：

to_compare= ["1 hour", "20 litres", "100 ml", "2", "15 minutes", "20 minutes"]

df['common'] = df['extracted_qty'].apply(lambda x: [el for el in x if el in to_compare])


#   text                        extracted_qty           common
#0  Halve the clementine ...    [1 hour, 20 minutes]    [1 hour, 20 minutes]
#1  Add the stock, then  ...    [15 minutes]            [15 minutes]
#2  2 heaped teaspoons ...      [2 heaped]              []
#3  100 ml Marsala              [100 ml]                [100 ml]
#4  1 litre organic chicken...  [1 litre]               []

相关问题更多 >

编程相关推荐

热门问题

热门文章