如何解析HTML页面中的链接?

2024-05-18 14:51:21 发布

您现在位置:Python中文网/ 问答频道 /正文

我想解析来自此website的链接列表

我正试图用Python中的请求库来实现这一点。然而,当我尝试使用bs4阅读HTML时,没有任何链接。只需清空ul

< ul class="ais-Hits-list">< /ul >

如何获取这些链接

编辑: 到目前为止我尝试的代码:

link = "https://www.over-view.com/digital-index/"
r = requests.get(link)
soup = BeautifulSoup(r.content, 'lxml')

Tags: 代码编辑列表链接htmllinkwebsiteul
2条回答

还有一种更为奢侈的方式:不要过于苛刻,因为我第一次尝试了这种方法,但您可以向API发出与前端相同的请求。另外,由于asyncio+aiohttp,此代码可以异步执行

请记住,我使用任意数量的页面进行迭代,并且没有处理可能的错误(您需要对其进行微调)

没有Selenium WebDriver的代码

import json
import asyncio

import aiohttp

URL = "https://ai7o5ij8d5-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia for JavaScript (3.35.1); Browser (lite); react (16.13.1); react-instantsearch (5.7.0); JS Helper (2.28.0)&x-algolia-application-id=AI7O5IJ8D5&x-algolia-api-key=7f1a509e834f885835edcfd3482b990c"


async def scan_single_digital_index_page(page_num, session):
    body = {
        "requests": [
            {
                "indexName": "overview",
                "params": f"query=&hitsPerPage=30&maxValuesPerFacet=10&page={page_num}&highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&facets=%5B%22_tags.name%22%5D&tagFilters=",
            }
        ]
    }
    async with session.post(URL, json=body) as resp:
        received_data = await resp.json()
        results = received_data.get("results")
        hits = results[0].get("hits")
        links = list()
        for hit in hits:
            for key, value in hit.items():
                if key == "slug":
                    links.append("https://www.over-view.com/overviews/" + value)
        return links


async def scan_all_digital_index_pages(session):
    tasks = list()
    max_pages = 20
    for page_num in range(1, max_pages):
        task = asyncio.create_task(scan_single_digital_index_page(page_num, session))
        tasks.append(task)
    all_lists = await asyncio.gather(*tasks)
    # Unpack all lists with links into a single set of all links.
    all_links = set()
    for l in all_lists:
        all_links.update(l)
    return all_links


async def main():
    async with aiohttp.ClientSession() as session:
        all_links = await scan_all_digital_index_pages(session)
        for link in all_links:
            print(link)


if __name__ == "__main__":
    asyncio.run(main())

第一页的示例结果

https://www.over-view.com/overviews/adelaide-canola-flowers
https://www.over-view.com/overviews/adelaide-rift-complex
https://www.over-view.com/overviews/adriatic-tankers
https://www.over-view.com/overviews/adventuredome
https://www.over-view.com/overviews/agricultural-development
https://www.over-view.com/overviews/agricultural-development
https://www.over-view.com/overviews/agricultural-development
https://www.over-view.com/overviews/agriculture-development
https://www.over-view.com/overviews/akimiski-island
https://www.over-view.com/overviews/al-falah-housing-project
https://www.over-view.com/overviews/alabama-tornadoes
https://www.over-view.com/overviews/alakol-lake
https://www.over-view.com/overviews/albenga
https://www.over-view.com/overviews/albuquerque-baseball-complex
https://www.over-view.com/overviews/alta-wind-energy-center
https://www.over-view.com/overviews/altocumulus-clouds
https://www.over-view.com/overviews/amsterdam
https://www.over-view.com/overviews/anak-krakatoa-eruption-juxtapose
https://www.over-view.com/overviews/ancient-ruins-of-palmyra
https://www.over-view.com/overviews/andean-mountain-vineyards
https://www.over-view.com/overviews/angas-inlet-trees
https://www.over-view.com/overviews/angkor-wat
https://www.over-view.com/overviews/ankara-residential-development
https://www.over-view.com/overviews/antofagasta-chile
https://www.over-view.com/overviews/apple-park
https://www.over-view.com/overviews/aquatica-water-park
https://www.over-view.com/overviews/aral-sea
https://www.over-view.com/overviews/arc-de-triomphe
https://www.over-view.com/overviews/arecibo-observatory
https://www.over-view.com/overviews/arizona-rock-formations

对于将来的更改(因为有许多移动部件),您可以从浏览器中的Web控制台获取有关其API的信息👇

Web Console in Firefox browser

由于该网站上的信息是动态加载的,因此您可以使用selenium收集所需信息:

import time

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument(" window-size=1920x1080")

path_to_chromedriver ='chromedriver'
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=path_to_chromedriver)

driver.get('https://www.over-view.com/digital-index/')

time.sleep(5)

soup = BeautifulSoup(driver.page_source, "lxml")
rows = soup.select("ul.ais-Hits-list > li > a")

for row in rows:
    print(row.get('href'))

输出示例:

/overviews/adelaide-canola-flowers
/overviews/adelaide-rift-complex
/overviews/adriatic-tankers
/overviews/adventuredome
/overviews/agricultural-development
/overviews/agricultural-development
/overviews/agricultural-development
/overviews/agriculture-development

相关问题 更多 >

    热门问题