如何使用BeautifulSoap获取完整链接

sub_site = "https://www.fotoregistro.com.br/navhome.php?vitrine-produto-slim" response = urllib.request.urlopen(sub_site) data = response.read() soup = BeautifulSoup(data,'lxml') for link in soup.find_all('a'): url = link.get("href") print (url)

2条回答

网友

1楼 · 编辑于 2024-06-30 07:47:16

使用select，看起来打印效果很好

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.fotoregistro.com.br/fotolivros/180-slim?cpmdsc=MOZAO')
soup = bs(r.content, 'lxml')
print([item['href'] for item in soup.select('.warp_lightbox')])

使用

print([item['href'] for item in soup.select('[href]')])

所有链接

网友

2楼 · 编辑于 2024-06-30 07:47:16

让我把重点放在html中问题的具体部分：

<a class='warp_lightbox' title='Comprar' href='//www.fotoregistro.com.br/
navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'><img src='
//sh.digipix.com.br/subhomes/_lojas_consumer/paginas/fotolivro/img/180slim/vitrine/classic_01_tb.jpg' alt='slim' />
                              </a>

您可以通过以下操作获得：

for link in soup.find_all('a', {'class':'warp_lightbox'}):
    url = link.get("href")
    break

你发现url是：

'//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'

在字符串的开头可以看到两个重要的模式：

//这是保持当前协议的一种方法，参见this
\r这是ASCII回车（CR）

当您打印它时，您只需丢失以下部分：

//www.fotoregistro.com.br/\r

如果需要原始字符串，可以在for循环中使用^{}：

print(repr(url))

你会得到：

//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO

如果需要路径，可以替换初始零件：

base = 'www.fotoregistro.com.br/'

for link in soup.find_all('a', {'class':'warp_lightbox'}):
    url = link.get("href").replace('//www.fotoregistro.com.br/\r',base)
    print(url)

你会得到：

www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/preview=true/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
.
.
.

不指定类：

for link in soup.find_all('a'):
    url = link.get("href")
    print(repr(url))

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用BeautifulSoap获取完整链接

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >