如何将URL从BeautifulSoup结果存储到列表,然后存储到选项卡

2024-09-30 01:33:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在废弃一个房地产网页,试图获得一些网址,然后创建一个表。 https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html 我有很多时间 1将结果存储到列表或字典中,然后 2创建表 但我真的被困住了

from bs4 import BeautifulSoup
import requests
import re
source=requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html').text
soup=BeautifulSoup(source,'lxml')


#Extract URL 
link_text = ''
URL=[]
PlacesDf = pd.DataFrame(columns=['Address', 'Location.lat', 'Location.lon'])
for a in soup.find_all('a', attrs={'href': re.compile("/propiedades/")}):
  link_text = a['href']
  URL='https://www.zonaprop.com.ar'+link_text
  print(URL)

好的,输出对我来说没问题:

https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html#map
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html#map
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html#map
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html#map
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html#map
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html

问题是,输出是真实的链接(你可以点击它们并进入页面)

但是,当我试图将它存储在一个新变量(列名称为'Address'的列表或字典与“PlacesDf”(相同的列名称为'Address'))/convert to table/或任何技巧中时,我都找不到解决方法。事实上,当我试着变成熊猫时:

Address = pd.dataframe(URL) 

它只创建一个单行表。你知道吗

我希望看到这样的事情

Adresses=['https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html#map','
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html',...]

或是一本字典或是任何我能和熊猫一起吃饭的东西


Tags: httpscommaplocalhtmlwwwenar
2条回答

我不知道你是从哪里得到拉特和伦,我正在做一个关于地址的假设。我可以看到你有很多重复在你目前的网址返回。我建议下面的css选择器只针对列表链接。这些类选择器比您当前的方法快得多。你知道吗

使用返回的链接列表的len来定义行维度,您已经拥有了列。你知道吗

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import re

r = requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html')
soup = bs(r.content, 'lxml') #'html.parser'
links = ['https://www.zonaprop.com.ar' + item['href'] for item in soup.select('.aviso-data-title a')]
locations = [re.sub('\n|\t','',item.text).strip() for item in soup.select('.aviso-data-location')]
df = pd.DataFrame(index=range(len(links)),columns= ['Address', 'Lat', 'Lon', 'Link'])
df.Link = links
df.Address = locations
print(df)

您应该执行以下操作:

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd


source=requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html').text
soup=BeautifulSoup(source,'lxml')

#Extract URL
all_url = [] 
link_text = ''
PlacesDf = pd.DataFrame(columns=['Address', 'Location.lat', 'Location.lon'])
for a in soup.find_all('a', attrs={'href': re.compile("/propiedades/")}):
  link_text = a['href']
  URL='https://www.zonaprop.com.ar'+link_text
  print(URL)
  all_url.append(URL)

df = pd.DataFrame({"URLs":all_url}) #replace "URLs" with your desired column name

希望这有帮助

相关问题 更多 >

    热门问题