从webpag爬网表

2024-09-26 22:10:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从这个网页(http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento)中提取csu员工的工资数据。我尝试过使用urlib2和requests库,但是没有一个从网页返回实际的表。我猜原因可能是这个表是由javascript动态生成的。下面是我的代码使用请求。在

from lxml import html
import requests

page = requests.get("http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento")
tree = html.fromstring(page.text)
name = tree.xpath('//table/tbody/tr/td[2]/text()'

如有任何帮助/意见,我们将不胜感激。在


Tags: textimportcomtreehttp网页htmlwww
2条回答

根据我的评论,这是我的尝试。请注意,我只提取了一行数据。其他一切都由你决定。在

代码:

import requests as rq

url = "http://api.sacbeelabs.com/v1/statepay/employee/search/name=/year=2013/department=CSU%20Sacramento.json"
data = "74XoegZ494trsvrus_As4B4handjZ494-Adl4B4olg494dnnk933pppAmWYXaaAYjh3mnWnakWq3-Ela-B-Oahkgjqaa07tw8tJmaWlYd07tw8tJiWha07tw8uH07tw8tJqaWl07tw8uHtrsu07tw8tJZakWlnhain07tw8uHGT-107tw8trTWYlWhainj4B4labalal494dnnk933mnWYfj-8albgjpAYjh3-Boamnejim3tt_v_rt_3YlWpgeic1nWXgam1bljh1paXkWca4B4nenga494TnWnaDVjlfalDTWgWlqDTaWlYdD1DUdaDTWYlWhainjDFaaBDTWYlWhainjBDGWgebjlieW4B4mYlV49sxzrB4mYlL49srwrB4peiV49sxzrB4peiL49_stB4oW4974Wcain494Oj-CeggW3wArD-I-6ss-MD-1Xoino-MDNeio-AD-Azx2xv-MDl-89tzAr-JDKaYfj3trsrrsrsDJelabj-A3tzAr4B4njoYd49bWgmaB4Zjh4954mnjlWca4B4WiehWneji4B4YWi-8WmtZ4B4paXmjYfan4B4pjlfal4B4WoZej4B4-8eZaj4B4m-8c4B4cajgjY46B4Ymm4954WiehWneji4B4nlWimbjlh468B4omal4974Woi494Koamn488"
headers = {
'Host': 'api.sacbeelabs.com',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'X-SBAPI-Auth-Token': '0QNWbefXw6fQQcWXqK8vDw',
'X-SBAPI-SID': '3gbRqglHXAVDy1vwdcVVMf',
'X-SBAPI-CID': '2HuWho39ZcDUlTswYSWUd9',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Referer': 'http://www.sacbee.com/statepay/',
'Content-Length': '684',
'Origin': 'http://www.sacbee.com',
'Cookie': 'sbapi-cid=2HuWho39ZcDUlTswYSWUd9; sbapi-sid=3gbRqglHXAVDy1vwdcVVMf',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}

r = rq.post(url, data=data, headers=headers)
json_data = r.json()

base = json_data["result"]["employees"][0] # First employee.

name = base["name"]
first_name = name["first"]
last_name = name["last"]

pay = base["pay"]["total"]

title = base["title"]
dept = base["department"]

print first_name, last_name, pay, title, dept
# Your turn here...

结果:

^{pr2}$

只是快速浏览了一下你提到的网站。这确实是因为表是使用javascript加载的。所以它实际上不是你在脚本中请求的网站的一部分。在

要解决这个问题,您可能需要查看该网站发出的web请求,并找到检索表数据的请求。这也不难做,只是一个麻烦。看看here;类似的问题。希望有帮助!在

相关问题 更多 >

    热门问题