在一个特定的网站上抓取问题

import requests, sys debug = {'verbose': sys.stderr} user_agent = {'User-agent': 'Mozilla/5.0', 'Connection':'keep-alive'} url = 'http://www.normattiva.it/atto/caricaArticolo?art.progressivo=0&art.idArticolo=1&art.versione=1&art.codiceRedazionale=047U0001&art.dataPubblicazioneGazzetta=1947-12-27&atto.tipoProvvedimento=COSTITUZIONE&art.idGruppo=1&art.idSottoArticolo1=10&art.idSottoArticolo=1&art.flagTipoArticolo=0#art' r = requests.session() s = r.get(url, headers=user_agent) #print(s.text) print(s.url) print(s.headers) print(s.request.headers)

2条回答

网友

1楼 · 编辑于 2024-10-01 07:15:48

在“开发工具打开”页面上单击任何链接后，在“网络”下的“文档”选项卡下：

您可以看到三个链接，第一个链接是我们单击的内容，第二个链接返回允许您跳转到特定文章的html，最后一个链接包含文章文本。在

在firstlink返回的源代码中，可以看到两个iframe标记：

<div id="alberoTesto">
        <iframe  
            src="/atto/caricaAlberoArticoli?atto.dataPubblicazioneGazzetta=2016-08-31&atto.codiceRedazionale=16G00182&atto.tipoProvvedimento=DECRETO LEGISLATIVO" 
            name="leftFrame" scrolling="auto" id="leftFrame" title="leftFrame" height="100%" style="width: 285px; float:left;" frameborder="0">
        </iframe>

        <iframe 
            src="/atto/caricaArticoloDefault?atto.dataPubblicazioneGazzetta=2016-08-31&atto.codiceRedazionale=16G00182&atto.tipoProvvedimento=DECRETO LEGISLATIVO" 
            name="mainFrame" id="mainFrame" title="mainFrame" height="100%" style="width: 800px; float:left;" scrolling="auto" frameborder="0">
        </iframe>

第一个是本文，后者带有/caricarticolodefault和id主机就是我们想要的。在

您需要使用来自初始请求的cookies，这样您就可以使用Session对象，并使用bs4解析页面：

^{pr2}$

第一个文本文件的片段：

^{3}$

网友

2楼 · 编辑于 2024-10-01 07:15:48

太棒了，太棒了，太棒了。它起作用了。只需要稍微编辑一下就可以清除进口，但效果很好。非常感谢。我只是在发现python的潜力，您让我的旅程更轻松了。我不可能独自解决它。在

import requests, sys
import os
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import io
user_agent = {'User-agent': 'Mozilla/5.0', 'Connection': 'keep-alive'}

url = 'http://www.normattiva.it/atto/caricaArticolo?art.progressivo=0&art.idArticolo=1&art.versione=1&art.codiceRedazionale=047U0001&art.dataPubblicazioneGazzetta=1947-12-27&atto.tipoProvvedimento=COSTITUZIONE&art.idGruppo=1&art.idSottoArticolo1=10&art.idSottoArticolo=1&art.flagTipoArticolo=0#art'

with requests.session() as s:
    s.headers.update(user_agent)
    r = s.get("http://www.normattiva.it/")
    soup = BeautifulSoup(r.content, "lxml")
    # get all the links from the initial page
    for a in soup.select("div.testo p a[href^=http]"):
        soup = BeautifulSoup(s.get(a["href"]).content)
        # The link to the text is in a iframe tag retuened from the previous get.

        text_src_link = soup.select_one("#mainFrame")["src"]

        # Pick something to make the names unique
        with io.open(os.path.basename(text_src_link), "w", encoding="utf-8") as f:
            # The text is in pre tag that is in the  div with the pre class
            text = BeautifulSoup(s.get(urljoin("http://www.normattiva.it", text_src_link)).content, "html.parser")\
                .select_one("div.wrapper_pre pre").text
            f.write(text)

相关问题更多 >

编程相关推荐

热门问题

热门文章