从URL中抓取数据：如何检索带有丢失和未知最终页面id的所有URL页面

import urllib2 from bs4 import BeautifulSoup web_page = "http://www.signalpeptide.de/index.php?sess=&m=listspdb_mammalia&s=details&id=" + id_name + "&listname=" page = urllib2.urlopen(web_page) soup = BeautifulSoup(page,'html.parser')

1条回答

网友

1楼 · 发布于 2024-06-25 23:27:22

为了获得可能的页面，您可以执行以下操作（我的示例是Python3）：

import re
from urllib.request import urlopen
from lxml import html

ITEMS_PER_PAGE = 50

base_url = 'http://www.signalpeptide.de/index.php'
url_params = '?sess=&m=listspdb_mammalia&start={}&orderby=id&sortdir=asc'


def get_pages(total):
    pages = [i for i in range(ITEMS_PER_PAGE, total, ITEMS_PER_PAGE)]
    last = pages[-1]
    if last < total:
        pages.append(last + (total - last))
    return pages

def generate_links():
    start_url = base_url + url_params.format(ITEMS_PER_PAGE)
    page = urlopen(start_url).read()
    dom = html.fromstring(page)
    xpath = '//div[@class="content"]/table[1]//tr[1]/td[3]/text()'
    pagination_text = dom.xpath(xpath)[0]
    total = int(re.findall(r'of\s(\w+)', pagination_text)[0])
    print(f'Number of records to scrape: {total}')
    pages = get_pages(total)
    links = (base_url + url_params.format(i) for i in pages)
    return links

基本上，它所做的是获取第一页并获取记录数，假设每页有50条记录，get_pages（）函数可以计算传递给start参数的页码并生成所有分页URL，您需要获取所有这些页，用每个蛋白质迭代表，然后转到details页面获取使用beauthulsoup或lxml和XPath所需的信息。我尝试使用asyncio同时获取所有这些页面，但服务器超时：）。希望我的功能有帮助！在

相关问题更多 >

编程相关推荐

热门问题

热门文章

从URL中抓取数据：如何检索带有丢失和未知最终页面id的所有URL页面

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >