从字符串电影名称列中提取年份

2024-09-28 03:19:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下数据,在名为train_df的表中有两列,“name”和“gross”:

gross       name
760507625.0 Avatar (2009)
658672302.0 Titanic (1997)
652270625.0 Jurassic World (2015)
623357910.0 The Avengers (2012)
534858444.0 The Dark Knight (2008)
532177324.0 Rogue One (2016)
474544677.0 Star Wars: Episode I - The Phantom Menace (1999)
459005868.0 Avengers: Age of Ultron (2015)
448139099.0 The Dark Knight Rises (2012)
436471036.0 Shrek 2 (2004)
424668047.0 The Hunger Games: Catching Fire (2013)
423315812.0 Pirates of the Caribbean: Dead Man's Chest (2006)
415004880.0 Toy Story 3 (2010)
409013994.0 Iron Man 3 (2013)
408084349.0 Captain America: Civil War (2016)
408010692.0 The Hunger Games (2012)
403706375.0 Spider-Man (2002)
402453882.0 Jurassic Park (1993)
402111870.0 Transformers: Revenge of the Fallen (2009)
400738009.0 Frozen (2013)
381011219.0 Harry Potter and the Deathly Hallows: Part 2 (2011)
380843261.0 Finding Nemo (2003)
380262555.0 Star Wars: Episode III - Revenge of the Sith (2005)
373585825.0 Spider-Man 2 (2004)
370782930.0 The Passion of the Christ (2004)

我想从“name”中读取并提取日期,以创建一个名为“year”的新列,然后使用该列按特定年份过滤数据集。 新表将如下所示:

year    gross   name
2009    760507625.0 Avatar (2009)
1997    658672302.0 Titanic (1997)
2015    652270625.0 Jurassic World (2015)
2012    623357910.0 The Avengers (2012)
2008    534858444.0 The Dark Knight (2008)

我尝试了应用和lambda方法,但没有得到任何结果:

train_df[train_df.apply(lambda row: row['name'].startswith('2014'),axis=1)]

有没有一种方法可以使用contains(如在C#或“isin”中)来过滤python中的字符串


Tags: ofthe数据namedftraindarkavatar
3条回答

试试这个

df = ['Avatar (2009)', 'Titanic (1997)', 'Jurassic World (2015)','The Avengers (2012)', 'The Dark Knight (2008)', 'Rogue One (2016)','Star Wars: Episode I - The Phantom Menace (1999)','Avengers: Age of Ultron (2015)', 'The Dark Knight Rises (2012)','Shrek 2 (2004)', 'Boiling Point (1990)', 'Terror Firmer (1999)', 'Adam's Apples (2005)', 'I Want You (1998)', 'Chalet Girl (2011)','Love, Honor and Obey (2000)', 'Perrier's Bounty (2009)','Into the White (2012)', 'The Decoy Bride (2011)','I Spit on Your Grave 2 (2013)']

for i in df:
    mov_title = i[:-7]
    year = i[-5:-1]
    print(mov_title) //do your actual extraction
    print(year) //do your actual extraction

如果你确信你的年龄将是终点,你可以做到

df['year'] = df['name'].str[-5:-1].astype(int)

这将获取列name,使用^{} accessor作为字符串访问每一行的值,并从中获取-5:-1切片。然后,它将结果转换为int,并将其设置为year列。如果您有大量数据,这种方法将比迭代行快得多


或者,您可以使用str访问器的^{}方法使用regex以获得更大的灵活性

df['year'] = df['name'].str.extract(r'\((\d{4})\)').astype(int)

这将提取与表达式\((\d{4})\)Try it here)匹配的组,这意味着捕获正好包含四位数字的一对括号内的数字,并将在字符串中的任何位置工作。要将其锚定到字符串的末尾,请在正则表达式的末尾使用$,如:\((\d{4})\)$。使用正则表达式和使用字符串切片的结果是相同的


现在我们有了新的数据帧:

          gross                                               name  year
0   760507625.0                                      Avatar (2009)  2009
1   658672302.0                                     Titanic (1997)  1997
2   652270625.0                              Jurassic World (2015)  2015
3   623357910.0                                The Avengers (2012)  2012
4   534858444.0                             The Dark Knight (2008)  2008
5   532177324.0                                   Rogue One (2016)  2016
6   474544677.0   Star Wars: Episode I - The Phantom Menace (1999)  1999
7   459005868.0                     Avengers: Age of Ultron (2015)  2015
8   448139099.0                       The Dark Knight Rises (2012)  2012
9   436471036.0                                     Shrek 2 (2004)  2004
10  424668047.0             The Hunger Games: Catching Fire (2013)  2013
11  423315812.0  Pirates of the Caribbean: Dead Man's Chest (2006)  2006
12  415004880.0                                 Toy Story 3 (2010)  2010
13  409013994.0                                  Iron Man 3 (2013)  2013
14  408084349.0                  Captain America: Civil War (2016)  2016
15  408010692.0                            The Hunger Games (2012)  2012
16  403706375.0                                  Spider-Man (2002)  2002
17  402453882.0                               Jurassic Park (1993)  1993
18  402111870.0         Transformers: Revenge of the Fallen (2009)  2009
19  400738009.0                                      Frozen (2013)  2013
20  381011219.0  Harry Potter and the Deathly Hallows: Part 2 (...  2011
21  380843261.0                                Finding Nemo (2003)  2003
22  380262555.0  Star Wars: Episode III - Revenge of the Sith (...  2005
23  373585825.0                                Spider-Man 2 (2004)  2004
24  370782930.0                   The Passion of the Christ (2004)  2004

可以使用pandas.Series.str.extract为以下对象创建正则表达式:

df["year"] = df["name"].str.extract(r"\((\d{4})\)$", expand=False)
df["year"] = pd.to_numeric(df["year"])

print(df.head())
         gross                    name  year
0  760507625.0           Avatar (2009)  2009
1  658672302.0          Titanic (1997)  1997
2  652270625.0   Jurassic World (2015)  2015
3  623357910.0     The Avengers (2012)  2012
4  534858444.0  The Dark Knight (2008)  2008

正则表达式:

  • \(:查找有文字括号的地方
  • (\d{4})然后,找到4个相邻的数字
    • 这里的括号表示我们将4个数字存储为一个捕获组(在本例中,它是我们要从较大的字符串中提取的数字组)
  • \):然后,找到一个右括号
  • $:以上所有内容都必须出现在字符串的末尾

当满足上述所有条件时,获取这4位数字-或者如果没有获得匹配,则返回该行的NaN

相关问题 更多 >

    热门问题