如何使用bs4 python从HTML中提取日期和时间文本?

2024-09-27 21:27:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用PythonBS4从这些HTML代码中提取日期和时间

[<;时间 class="published-date relative-date" data-published-date="2020-07-21T18:49:14Z" datetime="2020-07-21T18:49:14Z" > < /time >, < time class="published-date relative-date" data-published-date="2020-07-21T18:48:26Z" datetime="2020-07-21T18:48:26Z" >< / time>, < time class="published-date relative-date" data-published-date="2020-07-21T18:47:00Z" datetime="2020-07-21T18:47:00Z"></ time>, < time class="published-date relative-date" data-published-date="2020-07-21T18:43:21Z" datetime="2020-07-21T18:43:21Z"> </ time>]

***

我想知道除了日期和时间之外,我怎样才能去掉其他文本?例如,'2020-07-21T18:49:14Z',并将其显示为'2020-07-21',18:49:14Z'

以下是我目前的代码:

日期和时间=汤。查找所有('time',attrs={'class':'published-date-relative-date'})


Tags: 代码文本ltdatadatetimedatetimehtml
3条回答

此脚本将创建包含timedate列的数据帧:

import pandas as pd
from bs4 import BeautifulSoup


html_string = '''
    <time class="published-date relative-date" data-published-date="2020-07-21T18:49:14Z" datetime="2020-07-21T18:49:14Z"></time>
'''

soup = BeautifulSoup(html_string, 'html.parser')

all_data = []
for t in soup.select('time.published-date.relative-date'):
    all_data.append(t.get('data-published-date'))

df = pd.DataFrame(all_data)
df[0] = pd.to_datetime(df[0])

df['date'] = df[0].dt.date
df['time'] = df[0].dt.time

print(df)

印刷品:

                          0        date      time
0 2020-07-21 18:49:14+00:00  2020-07-21  18:49:14

您可以使用dateutil来解析原始日期时间字符串。使用命令pip install python-dateutil使用pip安装dateutil

from bs4 import BeautifulSoup
from dateutil import parser

text = '<time class="published-date relative-date" date-published-date="2020-07-21T18:49:14Z" datetime="2020-07-21T18:49:14Z">'

soup = BeautifulSoup(text)
for t in soup.find_all('time', attrs={'class':'published-date relative-date'}):
    date_time_str = t.get('datetime')
    date_time = parser.parse(date_time_str)
    print (date_time.date())
    print (date_time.time())

输出:

2020-07-21
18:49:14

你可以用

soup.find(id=<ID OF TIME>)

那你就只有时间了。如果您使用的是find_all,您将获得与属性匹配的所有文本

您还可以拆分当前的文本:

date_and_time = '2020-07-21T18:49:14Z'
print(date_and_time.split('T')

['2020-07-21', '18:49:14Z']

相关问题 更多 >

    热门问题