如何从BeautifulSoup中的span获取数据?

2024-09-28 20:17:49 发布

您现在位置:Python中文网/ 问答频道 /正文

这是我的代码,我想获取位置的名称和链接,变量“lugares”查找多个项目容器,但我只想要第一个[0];然后进行for循环,但我找不到span类

from bs4 import BeautifulSoup
import requests

b=[]
i="https://www.vivanuncios.com.mx"
url = "https://www.vivanuncios.com.mx/s-renta-inmuebles/estado-de-mexico/v1c1098l1014p1"

encabezado = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36",'Accept-Language': 'en-US, en;q=0.5'}

page =requests.get(url,headers=encabezado)

soup = BeautifulSoup(page.content,"html.parser")

lugares = soup.find_all("div",{"class":"items-container"})

lugares=lugares[0]
print(len(lugares))

for lugar in lugares:
    
    locationlink = i + str(lugar.find("span",{"class":"item"}).find("a")["href"])

    location= lugar.find("span",{"class":"item"}).text
    a=[location,locationlink]
    
    b.append(a)

Tags: httpsimportcomurlforwwwfindrequests
2条回答

首先,您需要在第一个Lugares lugares[0]中获得所有spans

然后需要对每个跨度进行迭代,以获得每个位置的链接和文本

代码:

from bs4 import BeautifulSoup
import requests

b=[]
i="https://www.vivanuncios.com.mx"
url = "https://www.vivanuncios.com.mx/s-renta-inmuebles/estado-de-mexico/v1c1098l1014p1"

encabezado = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36",'Accept-Language': 'en-US, en;q=0.5'}

page =requests.get(url,headers=encabezado)

soup = BeautifulSoup(page.content,"html.parser")

lugares = soup.find_all("div",{"class":"items-container"})

#lugares=lugares[0]
print(len(lugares))

# get all spans
spans = lugares[0].find_all("span",{"class":"item"})

# itreate throw each span
for span in spans: 
    # get location text
    location = span.find("a").text

    # locationlink builder
    site = "www.vivanuncios.com.mx"
    link = span.find("a")["href"]
    locationlink = f"{site}{link}"   

    a = [location,locationlink]
    b.append(a)

print (b[0])

输出:

['Huixquilucan', 'www.vivanuncios.com.mx/s-renta-inmuebles/huixquilucan/v1c1098l10689p1']

实现目标有多种选择,最佳选择取决于您在后续过程中对这些信息的期望和想要做什么

第一个选项

如果您只是在查找第一个位置的信息,您可以执行以下操作:

lugar = soup.select_one('div.items-container a')   
b = [lugar.text, f'{i}{lugar["href"]}']

lugar = soup.select('div.items-container a')[0]
b = [lugar.text, f'{i}{lugar["href"]}']

两者都选择类为items-container<div>中的第一个<a>

输出

['Huixquilucan','https://www.vivanuncios.com.mx/s-renta-inmuebles/huixquilucan/v1c1098l10689p1']

备选方案

如果您有兴趣一次获取所有信息,那么应该使用dicts列表,因此稍后您只需迭代它,并将所有信息准备就绪:

[{'name':x.text, 'link':f'{i}{x["href"]}'} for x in soup.select('div.items-container a')]

输出

[{'name': 'Huixquilucan',
  'link': 'https://www.vivanuncios.com.mx/s-renta-inmuebles/huixquilucan/v1c1098l10689p1'},
 {'name': 'Naucalpan',
  'link': 'https://www.vivanuncios.com.mx/s-renta-inmuebles/naucalpan/v1c1098l10710p1'},
 {'name': 'Atizapán',
  'link': 'https://www.vivanuncios.com.mx/s-renta-inmuebles/atizapan/v1c1098l10662p1'},
 {'name': 'Metepec',
  'link': 'https://www.vivanuncios.com.mx/s-renta-inmuebles/metepec-edomex/v1c1098l10707p1'},...]

示例(显示两者的结果)

from bs4 import BeautifulSoup
import requests

i="https://www.vivanuncios.com.mx"
url = "https://www.vivanuncios.com.mx/s-renta-inmuebles/estado-de-mexico/v1c1098l1014p1"

encabezado = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36",'Accept-Language': 'en-US, en;q=0.5'}

page =requests.get(url,headers=encabezado)
soup = BeautifulSoup(page.content,"html.parser")

lugar = soup.select_one('div.items-container a')
b = [lugar.text, f'{i}{lugar["href"]}']
print(f'First lugar:\n {b} \n')

## or alternative option

allLugaros = [{'name':x.text, 'link':f'{i}{x["href"]}'} for x in soup.select('div.items-container a')]

print(f'First lugar from lugaros (list of dict):\n {allLugaros[0]} \n')
print(f'All lugaros as list of dict:\n {allLugaros} \n')

相关问题 更多 >