处理具有不同分页结构的链接时遇到问题

2024-05-20 08:20:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我用python编写了一个脚本,可以在登录页的地图旁边的右侧区域刮取不同项目的标题。我在脚本中使用了两个链接:一个有分页,另一个没有。你知道吗

当我执行脚本时,它首先检查分页链接。如果它找到了一个,那么它将链接传递到get_paginated_info()函数以在那里打印结果。但是,如果它找不到分页链接,那么它将soup对象传递给get_info()函数并在那里打印结果。现在的剧本和我描述的一模一样。你知道吗

如何使我的脚本只在get_info()函数中打印结果,而不管链接是否有分页或不符合我已经尝试应用的逻辑,因为我希望从脚本中退出get_paginated_info()函数?

这是我迄今为止的尝试:

import requests 
from bs4 import BeautifulSoup
from urllib.parse import urljoin

urls = (
    'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
    'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
)

def get_names(link):
    r = requests.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    items = soup.select_one(".pagination a.next_page")
    if items:
        npagelink = items.find_previous_sibling().get("href").split("/")[-1]
        return [get_paginated_info(link + "/page/{}".format(page)) for page  in range(1,int(npagelink)+1)]

    else:
        return [get_info(soup)]

def get_info(soup):
    print("================links without pagination==============")
    for items in soup.select("td[class='table-row-price']"):
        item = items.select_one("h2 a").text
        print(item)

def get_paginated_info(url):
    r = requests.get(url)
    sauce = BeautifulSoup(r.text,"lxml")
    print("================links with pagination==============")
    for content in sauce.select("td[class='table-row-price']"):
        title = content.select_one("h2 a").text
        print(title)

if __name__ == '__main__':
    for url in urls:
        get_names(url)

任何更好的设计能够处理不同的喜欢将高度赞赏。你知道吗


Tags: 函数textininfo脚本urlforget
1条回答
网友
1楼 · 发布于 2024-05-20 08:20:45

我稍微改变了逻辑。所以现在在有分页和没有分页的情况下,脚本都将调用get_names。但在第二种情况下,在for循环中只执行一次迭代

import requests 
from bs4 import BeautifulSoup
from urllib.parse import urljoin

urls = (
    'https://www.mobilehome.net/mobile-home-park-directory/maine/all',
    'https://www.mobilehome.net/mobile-home-park-directory/rhode-island/all',
)

def get_names(link):
    r = requests.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    items = soup.select_one(".pagination a.next_page")
    try:
        npagelink = items.find_previous_sibling().get("href").split("/")[-1]
    except AttributeError:
        npagelink = 1
    return [get_info(link + "/page/{}".format(page)) for page in range(1, int(npagelink) + 1)]


def get_info(url):
    r = requests.get(url)
    sauce = BeautifulSoup(r.text,"lxml")
    for content in sauce.select("td[class='table-row-price']"):
        title = content.select_one("h2 a").text
        print(title)

if __name__ == '__main__':
    for url in urls:
        get_names(url)

请仔细检查输出,确保一切正常

相关问题 更多 >