为什么有些html标签在抓取时是不可见的?

2024-09-30 18:32:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从here中提取文本,直接输入到excel工作表中,而不是复制和粘贴。该网站使用Html来包含有关原始字体的信息。这是如何在页面上对一行文本进行编码的示例:

<div class="line">
    <span class="milestone_wrap"> </span>
    <a id="tln-2212" href="index.html#tln-2212" class="milestone tln invisible" title="TLN: 2212">2212</a>
    <span class="milestone_wrap">When </span>
    <span class="typeform" data-setting="ſ">s</span>
    <span class="milestone_wrap">uch ill dealing mu</span>
    <span class="ligature" data-precomposed="ſt">
        <span class="typeform" data-setting="ſ">s</span>
        <span class="milestone_wrap">t</span>
    </span>
    <span class="milestone_wrap"> be </span>
    <span class="typeform" data-setting="ſ">s</span>
    <span class="milestone_wrap">eene in thought. </span>
    <span class="sd exit">
        <span class="space" style="padding-right:1em;" xml:space="preserve"></span>
        <i>Exit</i>
        <span class="milestone_wrap">.</span>
    </span>
</div>

我试过使用find\u all方法

import requests
from bs4 import BeautifulSoup as bs
url = 'https://internetshakespeare.uvic.ca/doc/R3_F1/scene/3.6/index.html'
page = requests.get(url)
text = bs(page.text, 'html.parser')

divs = text.find_all('div', class_="line")
for div in divs:
    for item in div.contents: print(item)

这就是我得到的回报:

When 
<span class="typeform" data-setting="ſ">s</span>
uch ill dealing mu
<span class="ligature" data-precomposed="ſt"><span class="typeform" data-setting="ſ">s</span>t</span>
 be 
<span class="typeform" data-setting="ſ">s</span>
eene in thought. 
<span class="sd exit"><span class="space" style="padding-right:1em;" xml:space="preserve"> </span><i>Exit</i>.</span>

所有带有标记<span class="milestone_wrap">的内容都会在没有标记的情况下出现:因此,当我使用.find\u all for'span'时,这些字符串就不会出现,因此我只剩下随机字母。那班学生不来有什么原因吗?你知道吗


Tags: textindivdatahtmlspacetypeformall
2条回答

在执行代码时稍作调整(必须导入请求模块),您应该获得站点的内容。你知道吗

from bs4 import BeautifulSoup as bs
import requests

url = 'https://internetshakespeare.uvic.ca/doc/R3_F1/scene/3.6/index.html'
page = requests.get(url)
text = bs(page.text, 'html.parser')

divs = text.find_all('div', class_="line")
for div in divs:
    for item in div.contents: print(item)

文本可以在<span class="milestone_wrap">标记中找到。您可以使用浏览器的检查器检查这一点。文本以一小部分一小部分的标签传递,例如“哪一部分在a中”。你应该能够提取文本。你知道吗

在line类的级别上工作,但是分解a标记以便删除行号(除非您真的需要它们),在这种情况下,我会在它们和下面的文本之间添加空格

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://internetshakespeare.uvic.ca/doc/R3_F1/scene/3.6/index.html')
soup = bs(r.content, 'lxml')

for line in soup.select('.line'):
    line.select_one('a').decompose()
    print(line.text)

相关问题 更多 >