我正在尝试从这个网页(http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento)中提取csu员工的薪资数据。我尝试过使用urlib2和请求库,但它们都没有从网页返回实际的表。我猜原因可能是该表是由javascript动态生成的。下面是我使用请求的代码。
from lxml import html
import requests
page = requests.get("http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento")
tree = html.fromstring(page.text)
name = tree.xpath('//table/tbody/tr/td[2]/text()'如有任何帮助或评论,将不胜感激。
发布于 2014-04-08 22:59:01
这是我的尝试,根据我的评论。请注意,我只提取了一行数据。其他一切都取决于你。
代码:
import requests as rq
url = "http://api.sacbeelabs.com/v1/statepay/employee/search/name=/year=2013/department=CSU%20Sacramento.json"
data = "74XoegZ494trsvrus_As4B4handjZ494-Adl4B4olg494dnnk933pppAmWYXaaAYjh3mnWnakWq3-Ela-B-Oahkgjqaa07tw8tJmaWlYd07tw8tJiWha07tw8uH07tw8tJqaWl07tw8uHtrsu07tw8tJZakWlnhain07tw8uHGT-107tw8trTWYlWhainj4B4labalal494dnnk933mnWYfj-8albgjpAYjh3-Boamnejim3tt_v_rt_3YlWpgeic1nWXgam1bljh1paXkWca4B4nenga494TnWnaDVjlfalDTWgWlqDTaWlYdD1DUdaDTWYlWhainjDFaaBDTWYlWhainjBDGWgebjlieW4B4mYlV49sxzrB4mYlL49srwrB4peiV49sxzrB4peiL49_stB4oW4974Wcain494Oj-CeggW3wArD-I-6ss-MD-1Xoino-MDNeio-AD-Azx2xv-MDl-89tzAr-JDKaYfj3trsrrsrsDJelabj-A3tzAr4B4njoYd49bWgmaB4Zjh4954mnjlWca4B4WiehWneji4B4YWi-8WmtZ4B4paXmjYfan4B4pjlfal4B4WoZej4B4-8eZaj4B4m-8c4B4cajgjY46B4Ymm4954WiehWneji4B4nlWimbjlh468B4omal4974Woi494Koamn488"
headers = {
'Host': 'api.sacbeelabs.com',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'X-SBAPI-Auth-Token': '0QNWbefXw6fQQcWXqK8vDw',
'X-SBAPI-SID': '3gbRqglHXAVDy1vwdcVVMf',
'X-SBAPI-CID': '2HuWho39ZcDUlTswYSWUd9',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Referer': 'http://www.sacbee.com/statepay/',
'Content-Length': '684',
'Origin': 'http://www.sacbee.com',
'Cookie': 'sbapi-cid=2HuWho39ZcDUlTswYSWUd9; sbapi-sid=3gbRqglHXAVDy1vwdcVVMf',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
r = rq.post(url, data=data, headers=headers)
json_data = r.json()
base = json_data["result"]["employees"][0] # First employee.
name = base["name"]
first_name = name["first"]
last_name = name["last"]
pay = base["pay"]["total"]
title = base["title"]
dept = base["department"]
print first_name, last_name, pay, title, dept
# Your turn here...结果:
Clayton Abajian 9844 Lecturer - Academic Year CSU Sacramento
[Finished in 0.9s]发布于 2014-04-08 22:29:53
在你提到的网站上快速浏览一下。这确实是因为表是使用javascript加载的。因此,它实际上不是您在脚本中请求的网站的一部分。
要解决这个问题,您可能需要查看网站提出的and请求,并找到检索表数据的请求。这并不难做,只是一个讨厌。看看here;类似的问题。希望能帮上忙!
https://stackoverflow.com/questions/22949029
复制相似问题