使用href优化组分页

2024-10-01 00:32:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从https://www.onthisday.com/events/february/5中删除所有事件。我正在从第一页获取所有事件。如何从第二页获取其他事件并合并到一个列表中

现在我试图捕获下一个页面链接并对其进行解析,但仍然无法从第一个页面获取结果

这是我的密码:

from typing import List
import requests as _requests
import bs4 as _bs4

def _generate_url(month: str, day: int) -> str:
    url = f'https://www.onthisday.com/events/{month}/{day}'
    return url

def _get_page(url: str) -> _bs4.BeautifulSoup:
    _page = _requests.get(url)
    soup = _bs4.BeautifulSoup(_page.content, 'html.parser')
    return soup

def events_of_the_day(month: str, day: int) -> List[str]:
    """
    Return the events of a given day
    """
    
    url = _generate_url(month, day)
    page = _get_page(url)
    next_link = page.select_one("a.pag__next")
    raw_events = [event.text for event in page.select("li.event")]
    if next_link:
        next_url = 'https://www.onthisday.com/events'+next_link['href']
        page_next = _get_page(next_url)
        for eve in page_next.select("li.event"):
            print(eve.text)
    
    #print(raw_events)
    

events_of_the_day("february", 5)

注意:

有些页面包含下一页,有些页面不包含下一页,因此我希望处理这两种情况


Tags: httpscomeventurlgetwwwpage页面
2条回答

查看页面后,“下一步”按钮只是一个链接

<a href="/events/february/5?p=2" class="pag__next" rel="next">
  <span>Next</span>
</a>

请注意链接/events/february/5?p=2。您所需要做的就是在一个范围内迭代并进行请求调用。每当你点击404,你就退出循环。我将把循环交给你

编辑

i = 1
while True:
  res = request.get(f"https://www.onthisday.com/events/february/5?p={i}")
  if is_visited(res.content):
    # TODO write a function to check if you have visited these contents
    break
  ...
  # TODO wirte a function to updated the visited list or something similar
  visited(res.content)

  i+=1 # incrementing i

经过数小时的搜索和调试,我终于想出了一个解决方案,可以动态地完成所有分页(如果有或没有分页),我所要做的就是创建while loop并检查next_link就是它

代码:

while True:
        page = _get_page(url)
        for event in page.select('li.event'):
            events.append(event.text)
        next_link = page.select_one('a.pag__next')
        if not next_link:
            break
        url = 'https://www.onthisday.com'+next_link['href']

相关问题 更多 >