如何从包含文本的数据框中的列中提取年份（或日期时间）

网友

1楼 · 编辑于 2024-09-23 08:30:18

简单的正则表达式如何：

text = 'Harry Potter (1997)'
re.findall('\((\d{4})\)', text)
# ['1997'] Note that this is a list of "all" the occurrences.

对于数据帧，可以这样做：

text = 'Harry Potter (1997)'
df = pd.DataFrame({'Book': text}, index=[1])
pattern = '\((\d{4})\)'
df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series

df
#                  Book   year
# 1  Harry Potter (1997)  1997

最后，如果您真的想将标题和数据分开（在另一个答案中采用Philip的数据帧重建）：

df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])

sep = df['Book'].str.extract('(.*)\((\d{4})\)', expand=False)

sep # A new df, separated into title and year
#                       0      1                           
# 0          Harry Potter   1997 
# 1       Of Mice and Men   1937
# 2  Babe Ruth Story, The   1948

网友

2楼 · 编辑于 2024-09-23 08:30:18

完整系列的答案实际上是：

books['title'].str.findall('\((\d{4})\)').str.get(0)

网友

3楼 · 编辑于 2024-09-23 08:30:18

您可以执行以下操作

import pandas as pd
df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])

df['Year'] = df['Book'].str.extract(r'(?!\()\b(\d+){1}')

行：进口熊猫
行：创建数据帧以便于理解
行：创建一个新的“年”列，该列是从列Book上的字符串提取创建的

使用正则表达式查找数字。我使用https://regex101.com/r/Bid0qA/1，这对理解正则表达式的工作原理有很大帮助

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何从包含文本的数据框中的列中提取年份（或日期时间）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >