如何使我的代码更好?网页抓取,XPath

2024-10-02 20:31:32 发布

您现在位置:Python中文网/ 问答频道 /正文

希望你们都好。你知道吗

所以直截了当地说,我的代码试图从一个网站的结果,特别是餐厅的标题,评级和地址。你知道吗

餐厅和地址的代码工作得很完美,但是评级代码不仅带来了评级,还带来了其他价值。你知道吗

餐厅位://a[@class="arrivalName"]/text()

地址位://span[@class="address"]/text()

额定位://a[@rel="nofollow"]/text()

对于刮削,我将它们组合成:

'//a[@class="arrivalName"]/text()|//span[@class="address"]/text()|//a[@rel="nofollow"]/text()'

评级的问题其实并没有那么糟糕,因为当我导出它时,我可以删除实际上不是评级的附加行。你知道吗

我的问题在于结果在列表中的显示方式。比如说:

169: Farbatto Helados

170: 999 opiniones

171: \nYerbal 2413\n

我想有这个,但与餐厅名称(169)列,另一个评级(170)和方向(171)第三列。你知道吗

Farbatto Helados | 999 opiniones | \nYerbal 2413\n

我的代码如下,任何帮助将不胜感激!你知道吗

第1部分

import pandas as pd import requests from lxml import html

第2部分

header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36","X-Requested-With": "XMLHttpRequest"}

url = 'https://www.pedidosya.com.ar/restaurantes/buenos-aires?a=+colpayo+132&lng=-58.44132490000004&lat=-34.6184536&doorNumber=132&page=8'

第三部分

r = requests.get(url, headers=header)

第四部分

tree = html.fromstring(r.content)

title = tree.xpath('//a[@class="arrivalName"]/text()|//span[@class="address"]/text()|//a[@rel="nofollow"]/text()')

df = pd.DataFrame(title)

Tags: 代码textimportaddress地址餐厅classrel
1条回答
网友
1楼 · 发布于 2024-10-02 20:31:32

我已经很快编写了代码来帮助您,根据您的需要修改。。你知道吗

import pandas as pd
import requests
from lxml import html

header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36","X-Requested-With": "XMLHttpRequest"}
url = 'https://www.pedidosya.com.ar/restaurantes/buenos-aires?a=+colpayo+132&lng=-58.44132490000004&lat=-34.6184536&doorNumber=132&page=8'
r = requests.get(url, headers=header)
tree = html.fromstring(r.content)

hotel_elements = tree.xpath('//section[@class="restaurantData"]')
hotels = []
for hotel in hotel_elements:
    hotel_name = hotel.xpath('.//a[@class="arrivalName"]')
    hotel_address = hotel.xpath('.//span[@class="address"]')
    hotel_reviews = hotel.xpath('.//a[@rel="nofollow"]')
    if hotel_name:
        hotel_name = hotel_name[0].text_content()
    if hotel_address:
        hotel_address = hotel_address[0].text_content()
    if hotel_reviews:
        hotel_reviews = hotel_reviews[0].text_content()
    hotels.append([hotel_name, hotel_address, hotel_reviews])


df = pd.DataFrame(hotels)

输出

                              0                                       1              2
0                 Double Crêpes           \nBernardo de Irigoyen 1588\n  144 opiniones
1            Empanadas del Chef                         \nRosario 749\n  230 opiniones
2     El Emporio Helado Natural                         \nMurillo 749\n   33 opiniones
3                    Vian-ditas                                      []             []
4                 Rios Peruanos                          \nYerbal 787\n   33 opiniones
5                   Puro Goyena                \nAv. Pedro Goyena 293\n             []
6   Rotisería Welcome Caballito  \nAvenida Dr. Honorio Pueyrredón 784\n  137 opiniones
7             Moreira Caballito               \nJosé María Moreno 735\n   62 opiniones
8               Game of Burgers                         \nSaraza 1110\n   82 opiniones
9                Salimos Fuerte                    \nRamos Mejía 1088\n             []
10        Fullescabio Caballito                    \nCucha Cucha 1420\n   59 opiniones
11                     Donovans                          \nPerón 1596\n             []
12         Don Ricardo Restobar              \nJuan B. Ambrosetti 704\n   37 opiniones
13                Titan Burgers                       \nAranguren 334\n   40 opiniones
14         El Rey de las Arepas       \nDr. Juan Felipe Aranguren 336\n             []

相关问题 更多 >