使用Python分析XML时缺少字段

from bs4 import BeautifulSoup import requests import urllib, urllib2 import csv url = requests.get("https://raw.github.com/rfarley90/random/master/zillowresults.html") pageText = url.text soup = BeautifulSoup(pageText) useCode = soup.find('useCode') taxAssessmentYear = soup.find('taxAssessmentYear') taxAssessment = soup.find('taxAssessment') yearBuilt = soup.find('yearBuilt') lotSizeSqFt = soup.find('lotSizeSqFt') finishedSqFt = soup.find('finishedSqFt') bathrooms = soup.find('bathrooms') lastSoldDate = soup.find('lastSoldDate') lastSoldPrice = soup.find('lastSoldPrice') zestimate = soup.find('zestimate') amount = soup.find('amount') lastupdated = soup.find('last-updated') valueChangeduration = soup.find('valueChange') valuationRange = soup.find('valuationRange') lowcurrency = soup.find('low') highcurrency = soup.find('high') percentile = soup.find('percentile') localRealEstate = soup.find('localRealEstate') region = soup.find('region') links = soup.find('links') overview = soup.find('overview') forSaleByOwner = soup.find('forSaleByOwner') forSale = soup.find('forSale') array = [ ['useCode ' , useCode], ['taxAssessmentYear ' , taxAssessmentYear], ['taxAssessment ' , taxAssessment], ['yearBuilt ' , yearBuilt], ['lotSizeSqFt ' , lotSizeSqFt], ['finishedSqFt ' , finishedSqFt], ['bathrooms ' , bathrooms], ['lastSoldDate ' , lastSoldDate], ['lastSoldPrice ' , lastSoldPrice], ['zestimate ' , zestimate], ['amount ' , amount], ['lastupdated ' , lastupdated], ['valueChangeduration ' , valueChangeduration], ['valuationRange ' , valuationRange], ['lowcurrency ' , lowcurrency], ['highcurrency ' , highcurrency], ['percentile ' , percentile], ['localRealEstate ' , localRealEstate], ['region ' , region], ['links ' , links], ['overview ' , overview], ['forSaleByOwner ' , forSaleByOwner], ['forSale ' , forSale]] for x in array: print x

2条回答

网友

1楼 · 编辑于 2024-09-26 18:08:44

beauthulsoup find查询是小写的

>>> url = requests.get("https://raw.github.com/rfarley90/random/master/zillowresults.html")
>>> soup = BeautifulSoup(pageText)
>>> soup.find('usecode')
<usecode>SingleFamily</usecode>
>>> soup.find('usecode').text
u'SingleFamily'

或者：

^{pr2}$

网友

2楼 · 编辑于 2024-09-26 18:08:44

默认情况下，BeautifulSoup将所有标记强制为小写。您可以在上面的结果数据中看到这一点：region标记包含forsalebyowner和{}作为其内容的一部分，而它们在原始数据中是forSaleByOwner和{}。在

谢天谢地，您可以通过指定在创建BeautifulSoup对象时使用XML来覆盖此行为，但是在执行此操作之前，您需要删除一些非XML页面内容：

url = requests.get("https://raw.github.com/rfarley90/random/master/zillowresults.html")
pageText = url.text.split('\n')
# exclude initial text & end comment
pageXML = ''.join( pageText[1:pageText.index(u'<! ')] )
soup = BeautifulSoup(pageXML, "xml")

相关问题更多 >

编程相关推荐

热门问题

热门文章