Python Beautifulsoup从日期范围获取标题

2024-10-02 14:17:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个日期范围内获取标题、链接和日期,比如从Fourdaysago到today。 首先,我在下拉选择选项中选择当前月份,然后选择介于范围之间的日期。 我使用:

html_link = 'https://www.ksei.co.id/publications/new-securities-registration?setLocale=en-US'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
now = datetime.datetime.now()
month = now.month
soup.select('(option["{}"])'.format(month))
FourDaysAgo = (datetime.datetime.now() - datetime.timedelta(days = 4))
FourDaysAgo_day = FourDaysAgo.day
now = datetime.date.today()
today = now.day
d = range(FourDaysAgo_day,today)

我希望获得该日期范围内的标题和HREF,但我不知道如何将该日期作为“选择条件” 我使用:

dates = soup.findAll('b', text = re.compile('{}').format(d))
titles = soup.find_all("h2", {"class": "h4 no-margin"})
hrefs = soup.find_all("a", {"class": "btn btn--primary"})

有人能帮忙吗


Tags: textformat标题todaydatetimehtmllinkall
1条回答
网友
1楼 · 发布于 2024-10-02 14:17:58

这段代码可以改进,但它应该可以解决您的用例。如果你有任何问题,请让我知道,我会设法解决他们

import requests
from bs4 import BeautifulSoup
from dateutil.parser import parse
from datetime import datetime, timedelta


four_days_ago = (parse((datetime.now() - timedelta(days=10)).strftime('%Y-%m-%d')))
start_date = datetime.strptime(str(four_days_ago), "%Y-%m-%d %H:%M:%S").strftime('%Y-%m-%d')
end_date = datetime.strptime(str(datetime.now()), "%Y-%m-%d %H:%M:%S.%f").strftime('%Y-%m-%d')


html_link = 'https://www.ksei.co.id/publications/new-securities-registration?setLocale=en-US'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
for ultag in soup.find_all('ul', {'class': 'list-nostyle'}):
    for litag in ultag.find_all('li'):
        for dates in litag.find_all('small', {'class': 'muted'}):
            clean_date = datetime.strptime(str(dates.text), "%B %d, %Y").strftime('%Y-%m-%d')
            if start_date <= clean_date <= end_date:
                title = litag.find('h2', {'class': 'h4 no-margin'})
                document_link = litag.find('a', href=True)
                print(clean_date)
                print(title.text)
                print(f"https://www.ksei.co.id{document_link['href']}")
                # OUTPUT
                2021-05-11
                KSEI-3629/DIR/0521 
                https://www.ksei.co.id/Announcement/Files/127505_ksei_3629_dir_0521_202105140513.pdf
                2021-05-06
                KSEI-3512/DIR/0521 
                https://www.ksei.co.id/Announcement/Files/127181_ksei_3512_dir_0521_202105070825.pdf
                2021-05-05
                KSEI-3482/DIR/0521 
                https://www.ksei.co.id/Announcement/Files/127076_ksei_3482_dir_0521_202105051506.pdf
                truncated...

相关问题 更多 >