使用Python分析XML时缺少字段

2024-09-26 18:08:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用zillow的API收集有关一栋房子的所有数据。我得到一些字段,但其他字段返回为空。在

下面是我的Python代码:

from bs4 import BeautifulSoup
import requests
import urllib, urllib2
import csv


url = requests.get("https://raw.github.com/rfarley90/random/master/zillowresults.html")
pageText = url.text
soup = BeautifulSoup(pageText)

useCode = soup.find('useCode')
taxAssessmentYear = soup.find('taxAssessmentYear')
taxAssessment = soup.find('taxAssessment')
yearBuilt = soup.find('yearBuilt')
lotSizeSqFt = soup.find('lotSizeSqFt')
finishedSqFt = soup.find('finishedSqFt')
bathrooms = soup.find('bathrooms')
lastSoldDate = soup.find('lastSoldDate')
lastSoldPrice = soup.find('lastSoldPrice')
zestimate = soup.find('zestimate')
amount = soup.find('amount')
lastupdated = soup.find('last-updated')
valueChangeduration = soup.find('valueChange')
valuationRange = soup.find('valuationRange')
lowcurrency = soup.find('low')
highcurrency = soup.find('high')
percentile = soup.find('percentile')
localRealEstate = soup.find('localRealEstate')
region = soup.find('region')
links = soup.find('links')
overview = soup.find('overview')
forSaleByOwner = soup.find('forSaleByOwner')
forSale = soup.find('forSale')




array = [
            ['useCode ' , useCode],
            ['taxAssessmentYear ' , taxAssessmentYear],
            ['taxAssessment ' , taxAssessment],
            ['yearBuilt ' , yearBuilt],
            ['lotSizeSqFt ' , lotSizeSqFt],
            ['finishedSqFt ' , finishedSqFt],
            ['bathrooms ' , bathrooms],
            ['lastSoldDate ' , lastSoldDate],
            ['lastSoldPrice ' , lastSoldPrice],
            ['zestimate ' , zestimate],
            ['amount ' , amount],
            ['lastupdated ' , lastupdated],
            ['valueChangeduration ' , valueChangeduration],
            ['valuationRange ' , valuationRange],
            ['lowcurrency ' , lowcurrency],
            ['highcurrency ' , highcurrency],
            ['percentile ' , percentile],
            ['localRealEstate ' , localRealEstate],
            ['region ' , region],
            ['links ' , links],
            ['overview ' , overview],
            ['forSaleByOwner ' , forSaleByOwner],
            ['forSale ' , forSale]]


for x in array:
    print x

我得到的结果有很多缺失值,如下所示:

^{pr2}$

你知道是什么原因造成的吗?在


Tags: importfindamountsoupbathroomslotsizesqftyearbuiltzestimate
2条回答

beauthulsoup find查询是小写的

>>> url = requests.get("https://raw.github.com/rfarley90/random/master/zillowresults.html")
>>> soup = BeautifulSoup(pageText)
>>> soup.find('usecode')
<usecode>SingleFamily</usecode>
>>> soup.find('usecode').text
u'SingleFamily'

或者:

^{pr2}$

默认情况下,BeautifulSoup将所有标记强制为小写。您可以在上面的结果数据中看到这一点:region标记包含forsalebyowner和{}作为其内容的一部分,而它们在原始数据中是forSaleByOwner和{}。在

谢天谢地,您可以通过指定在创建BeautifulSoup对象时使用XML来覆盖此行为,但是在执行此操作之前,您需要删除一些非XML页面内容:

url = requests.get("https://raw.github.com/rfarley90/random/master/zillowresults.html")
pageText = url.text.split('\n')
# exclude initial text & end comment
pageXML = ''.join( pageText[1:pageText.index(u'<! ')] )
soup = BeautifulSoup(pageXML, "xml")

相关问题 更多 >

    热门问题