下面是用Python编写的ScraperWiki scraper:
import lxml.html
import scraperwiki
from unidecode import unidecode
html = scraperwiki.scrape("http://www.timeshighereducation.co.uk/world-university-rankings/2012-13/world-ranking/range/001-200")
root = lxml.html.fromstring(html)
for tr in root.cssselect("table.ranking tr"):
if len(tr.cssselect("td.rank")) > 0 and len(tr.cssselect("td.uni")) > 0:
university = unidecode(tr.cssselect("td.uni")[0].text_content()).strip().title()
if 'cole' in university:
print university
它产生以下输出:
Ecole Polytechnique Federale De Lausanne
Ecole Normale Superieure
Acole Polytechnique
Ecole Normale Superieure De Lyon
我的问题是:是什么导致第三个输出行上的初始字符被呈现为“A”而不是“E”,如何阻止这种情况发生?你知道吗
基于上面soulseekah的有用注释,以及lxmldocs here和here,以下解决方案有效:
相关问题 更多 >
编程相关推荐