我正在使用requests和BeautifulSoup从房地产网站上搜集数据。它有几个编号的“页面”,显示了几十个公寓。我编写了一个循环,在所有这些页面上运行,并从单元中收集数据,但不幸的是,它们使用javascript,因此,代码只返回第一个页面的单元。我也尝试了硒元素,但遇到了同样的问题
非常感谢您的建议
代码如下:
# Create empty lists to append data scraped from URL
# Number of lists depends on the number of features you want to extract
lista_preco = []
lista_endereco = []
lista_tamanho = []
lista_quartos = []
lista_banheiros = []
lista_vagas = []
lista_condominio = []
lista_amenidades = []
lista_fotos = []
lista_sites = []
n_pages = 0
for page in range(1, 15):
n_pages += 1
url = "https://www.vivareal.com.br/venda/bahia/salvador/apartamento_residencial/"+'?pagina='+str(page)
url = requests.get(url)
soup = BeautifulSoup(url.content, 'html.parser')
house_containers = soup.find_all('div', {'class' :'js-card-selector'})
if house_containers != []:
for container in house_containers:
# Price
price = container.find_all('section', class_='property-card__values')[0].text
try:
price = int(price[:price.find('C')].replace('R$', '').replace('.','').strip())
except:
price = 0
lista_preco.append(price)
# Zone
location = container.find_all('span', class_='property-card__address')[0].text
location = location.strip()
lista_endereco.append(location)
# Size
size = container.find_all('span', class_='property-card__detail-value js-property-card-value property-card__detail-area js-property-card-detail-area')[0].text
if '-' not in size:
size = int(size[:size.find('m')].replace(',','').strip())
else:
size = int(size[:size.find('-')].replace(',','').strip())
lista_tamanho.append(size)
# Rooms
quartos = container.find_all('li', class_='property-card__detail-item property-card__detail-room js-property-detail-rooms')[0].text
quartos = quartos[:quartos.find('Q')].strip()
if '-' in quartos:
quartos = quartos[:quartos.find('-')].strip()
lista_quartos.append(int(quartos))
# Bathrooms
banheiros = container.find_all('li', class_='property-card__detail-item property-card__detail-bathroom js-property-detail-bathroom')[0].text
banheiros = banheiros[:banheiros.find('B')].strip()
if '-' in banheiros:
banheiros = banheiros[:banheiros.find('-')].strip()
lista_banheiros.append(int(banheiros))
# Garage
vagas = container.find_all('li', class_='property-card__detail-item property-card__detail-garage js-property-detail-garages')[0].text
vagas = vagas[:vagas.find('V')].strip()
if '--' in vagas:
vagas = '0'
lista_vagas.append(int(vagas))
# Condomínio
condominio = container.find_all('section', class_='property-card__values')[0].text
try:
condominio = int(condominio[condominio.rfind('R$'):].replace('R$','').replace('.','').strip())
except:
condominio = 0
lista_condominio.append(condominio)
# Amenidades
try:
amenidades = container.find_all('ul', class_='property-card__amenities')[0].text
amenidades = amenidades.split()
except:
amenidades = 'Zero'
lista_amenidades.append(amenidades)
# url
link = 'https://www.vivareal.com.br/' + container.find_all('a')[0].get('href')[1:-1]
lista_sites.append(link)
# image
#p = str(container.find_all('img')[0])
#p
#2x size thumbnail
#imgurl = p[p.find('https'):p.rfind('data-src')]
#imgurl.replace('"', '').strip()
#lista_fotos.append(imgurl)
else:
break
time.sleep(randint(1,2))
print('You scraped {} pages containing {} properties.'.format(n_pages, len(lista_preco)))```
你有选择的余地。无需使用Selenium,因为您可以通过api访问数据
网站上有一个限制,只允许您对最多10000个列表进行分页。返回的数据远远多于您想要的数据,因此您需要查看json响应,看看是否还有其他需要添加的数据:
代码:
输出:
不幸的是,我相信你别无选择。原因是,使用新的前端技术,html呈现为异步的,它需要“真实”的环境,javascript才能运行和加载页面。例如,使用Ajax,您需要一个真正的浏览器(Chrome、Firefox)才能使其正常工作。因此,我的建议是,您应该继续深入研究Selenium,模拟click事件,单击每个页面(单击页码,如1..2..3,直到结束),然后等待数据加载,然后读取html并提取所需的数据。 问候
相关问题 更多 >
编程相关推荐