Python:如何使用LXML/Requests遍历HTML元素对象?

2024-06-02 10:24:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试使用LXML&Requests从网站创建一个数据表。我需要标签中的文本和标签中包含的文本。以下是HTML:

<div class="houses">
    <input type="hidden" class="houseNumber" value="107">
    <input type="hidden" class="houseState" value="MT">
    <input type="hidden" class="houseStatus" value="Occupied">
<div class="houseInfo">
    <div class="houseCity">Helena</div>
    <div class="houseArea">Helena Valley</div>
</div>
</div>
<div class="houses">
    <input type="hidden" class="houseNumber" value="237">
    <input type="hidden" class="houseState" value="MT">
    <input type="hidden" class="houseStatus" value="Occupied">
<div class="houseInfo">
    <div class="houseCity">East Helena</div>
    <div class="houseArea">Helena Valley</div>
</div>
</div>
<div class="houses">
    <input type="hidden" class="houseNumber" value="104">
    <input type="hidden" class="houseState" value="MT">
    <input type="hidden" class="houseStatus" value="Vacant">
<div class="houseInfo">
    <div class="houseCity">Helena</div>
    <div class="houseArea">Helena Valley</div>
</div>
</div>

基于此,我想创建一个如下表:

^{pr2}$

使用Requests&LXML,我尝试遍历div class="houses"以获得所需的内容,但每次我尝试打印值时,它都会打印以下内容:

['107', '237', '104']
['MT', 'MT', 'MT']
['Occupied', 'Occupied', 'Vacant']
['Helena', 'East Helena', 'Helena']
['Helena Valley', 'Helena Valley', 'Helena Valley']
['107', '237', '104']
['MT', 'MT', 'MT']
['Occupied', 'Occupied', 'Vacant']
['Helena', 'East Helena', 'Helena']
['Helena Valley', 'Helena Valley', 'Helena Valley']
['107', '237', '104']
['MT', 'MT', 'MT']
['Occupied', 'Occupied', 'Vacant']
['Helena', 'East Helena', 'Helena']
['Helena Valley', 'Helena Valley', 'Helena Valley']

这是我的部分代码:

link = "example.com"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(link, headers=headers, allow_redirects=False) 
sourceCode = response.content

htmlElem = html.document_fromstring(sourceCode)
houses = htmlElem.find_class('houses')
for house in houses:
    houseNumber = house.xpath('//input[@class="houseNumber"]/@value')
    houseState = house.xpath('//input[@class="houseState"]/@value')
    houseStatus = house.xpath('//input[@class="houseStatus"]/@value')

如何在上面所示的表中捕获数据?我可以用不同的方式遍历houses对象吗?在

更新:@efirvida我已将代码修改为以下内容:

link = "example.com"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(link, headers=headers, allow_redirects=False) 
sourceCode = response.content

htmlElem = html.document_fromstring(sourceCode)
houses = htmlElem.find_class('houses')
houseNumber = []
houseState = []
houseStatus = []

for house in houses:
    houseNumber.append(house.xpath('//input[@class="houseNumber"]/@value'))
    print(houseNumber)
    houseState.append(house.xpath('//input[@class="houseState"]/@value'))
    houseStatus.append(house.xpath('//input[@class="houseStatus"]/@value'))

data = map(list, zip(*[houseNumber,houseState,houseStatus]))

当我这样做时,会有以下指纹:

[['107', '237', '104']]
[['107', '237', '104'], ['107', '237', '104']]
[['107', '237', '104']], ['107', '237', '104'], ['107', '237', '104']]

Tags: divinputvaluetypehiddenclasshousemt
1条回答
网友
1楼 · 发布于 2024-06-02 10:24:20

尝试转换结果,请参见this thread以理解我的代码。在

# create a list with elements
houseNumber = []
houseState = []
houseStatus = []

# append each element to it's list
for house in houses:
    houseNumber.append(house.xpath('//input[@class="houseNumber"]/@value'))
    houseState.append(house.xpath('//input[@class="houseState"]/@value'))
    houseStatus.append(house.xpath('//input[@class="houseStatus"]/@value'))


# transpose the lists, and turn into a list of list
data = map(list, zip(*[houseNumber,houseState,houseStatus]))

>>> list(data)
#[['107', 'MT', 'Occupied'], ['237', 'MT', 'Occupied'], ['104', 'MT', 'Vacant']]

如果可以将其用作元组,只需移除映射即可

^{pr2}$

相关问题 更多 >