在获取所有链接时卡住了使用请求查看更多按钮

2024-09-21 03:20:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我创建了一个脚本,从两个相同的链接中获取不同容器的链接。就第一页而言,脚本做得很好。然而,在底部有一个按钮View More,它没有任何与之相关联的链接,因此,我无法使用请求获取其余部分。为了清晰起见,下图表示第一个链接的第一个容器

enter image description here

我试过:

import requests
from bs4 import BeautifulSoup

base = 'https://hipages.com.au{}'

links = (
    'https://hipages.com.au/find/antenna_services/sa/adelaide',
    'https://hipages.com.au/find/antenna_services/vic/melbourne'
)

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'  
    for link in links:
        r = s.get(link)
        soup = BeautifulSoup(r.text,"html5lib")
        for item in soup.select("[class*='BusinessListingHeaderColumn'] a:has(> h3)[href]"):
            print(base.format(item.get("href")))

How can I get the links from all the containers making use of view more button using requests?

这是我想要的输出类型:

https://hipages.com.au/connect/cinemaathome
https://hipages.com.au/connect/mrcommunications
https://hipages.com.au/connect/adelaidevideoscreens

Tags: fromhttpsimport脚本combaseget链接
1条回答
网友
1楼 · 发布于 2024-09-21 03:20:17

requests可能不是此类作业的最佳工具,因为您必须不断动态地向页面添加更多内容

一个解决方法是使用API,因为有一个API。但是我发现这个请求有几个问题,例如:

https://hipages.com.au/api/directory/sites?suburb=adelaide&state=sa&category=145&page=2&perpage=10&code=c237ab1d599590b23f25c822a43c74528d7d55182331509852906e86cf0710b1c1d72087cbbbaa1f4ff8dcb50c9f234e
  1. 您必须以某种方式将category=145值映射到它的名称
  2. 我无法找出code部分的来源

另一个解决方法是,一直单击View more按钮,直到没有这样的按钮为止。然后,从页面的“最终”版本中删除与css选择器匹配的所有URL

重复下一个url,以此类推

怎么做?输入selenium

另外,要运行这个程序,除了selenium模块之外,还需要Chrome驱动程序。有关安装说明,请参见this

守则:

import time

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException


base = 'https://hipages.com.au{}'
links = (
    'https://hipages.com.au/find/antenna_services/sa/adelaide',
    'https://hipages.com.au/find/antenna_services/vic/melbourne'
)

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

for link in links:
    print(f"Fetching connect links for {link}...")
    driver.get(link)

    while True:
        try:
            element = driver.find_element_by_css_selector('.jRFKbg').click()
            time.sleep(2)  # wait for more content to load
        except NoSuchElementException:
            break
    follow_links = BeautifulSoup(
        driver.page_source,
        "html5lib",
    ).select("[class*='BusinessListingHeaderColumn'] a:has(> h3)[href]")

    for follow_link in follow_links:
        print(base.format(follow_link.get("href")))

此输出(对于Adeladie):

https://hipages.com.au/connect/cinemaathome
https://hipages.com.au/connect/mrcommunications
https://hipages.com.au/connect/adelaidevideoscreens
https://hipages.com.au/connect/justantennas
https://hipages.com.au/connect/celciustechnicalservices
https://hipages.com.au/connect/ljhelectricalsolutions
https://hipages.com.au/connect/voltechservices
https://hipages.com.au/connect/comtelecom
https://hipages.com.au/connect/homedigitalsystems
https://hipages.com.au/connect/samedaytvantennaservice
https://hipages.com.au/connect/pheds
https://hipages.com.au/connect/parksidedigitaltvservice
https://hipages.com.au/connect/antennatoday
https://hipages.com.au/connect/getfusedelectricalptyltd
https://hipages.com.au/connect/switchedonwiring
https://hipages.com.au/connect/lightningelectricalsolutionssa
https://hipages.com.au/connect/evanselectricalandair
https://hipages.com.au/connect/lynchelec
https://hipages.com.au/connect/asapantennas
https://hipages.com.au/connect/spacetelecommunications
https://hipages.com.au/connect/outpulseelectrical
https://hipages.com.au/connect/markgentle
https://hipages.com.au/connect/matchmastertvreceptionsystems
https://hipages.com.au/connect/empireelectricalsa
https://hipages.com.au/connect/sasecureservices
https://hipages.com.au/connect/powerlux
https://hipages.com.au/connect/ecolightselectrical
https://hipages.com.au/connect/kdiselectricalandairconditioningservices
https://hipages.com.au/connect/jptelecomptyltd
https://hipages.com.au/connect/bhullarelectricalsandsolar
https://hipages.com.au/connect/tkhelectrical
https://hipages.com.au/connect/njstechnologies
https://hipages.com.au/connect/apexelectricalsolarservices
https://hipages.com.au/connect/handymanservice6
https://hipages.com.au/connect/pricelesselectricalptyltd
https://hipages.com.au/connect/zaccelectrical
https://hipages.com.au/connect/adelaidehometheatre
https://hipages.com.au/connect/wescombeelectrical
https://hipages.com.au/connect/batterselectrical
https://hipages.com.au/connect/avanditconnections
https://hipages.com.au/connect/andersonelectric
https://hipages.com.au/connect/tappelectrical
https://hipages.com.au/connect/smartgridelectrical
https://hipages.com.au/connect/scothernselectricaldataservicesptyltd
https://hipages.com.au/connect/nexuselectricalairconditioning
https://hipages.com.au/connect/apcelectrical
https://hipages.com.au/connect/paultompkinselectricalcontracting
https://hipages.com.au/connect/aaronlampreelectricalservices
https://hipages.com.au/connect/sparrowelectricalandconstructionservices
https://hipages.com.au/connect/djairelectrical
https://hipages.com.au/connect/localchoiceelectrical
https://hipages.com.au/connect/knicelectricalservicesptyltd

编辑:

这是基于你和我分享的你自己的答案。基本上,您无法跳出循环,因为API不断地为您提供最后一个页面,即使它是相同的页面

所以,我们需要知道什么时候我们看到了页面或下面的链接。下面是我的尝试,它归结为检查API中的任何潜在链接是否在所有后续链接列表中。如果是这样,我们已经看到了这个API页面。是时候转到下一个URL了

import re
import requests

lead_link = "https://hipages.com.au/connect/"
links = (
    "https://hipages.com.au/find/antenna_services/sa/adelaide",
    "https://hipages.com.au/find/antenna_services/vic/melbourne",
)

all_follow_links = []
for link in links:
    r = requests.get(link)
    print(f"Getting links for {link}...")
    payload = {
        "suburb": link.split("/")[-1],
        "state": link.split("/")[-2],
        "category": re.search(r'category_id":(.*?),', r.text).group(1),
        "page": 1,
        "perpage": 10,
        "code": re.search(r'"code":"(.*?)",', r.text).group(1),
    }
    while True:
        response = requests.get(
            'https://hipages.com.au/api/directory/sites?',
            params=payload,
        ).json()
        leads = [f"{lead_link}{item['siteKey']}" for item in response]
        if any(lead in all_follow_links for lead in leads):
            break
        all_follow_links.extend(leads)
        payload["page"] += 1

print(all_follow_links)

相关问题 更多 >

    热门问题