我正努力从网页中提取动态生成的内容。该网站列出了不同大学的表现,并允许你选择该大学提供的每个学习领域。例如,从下面代码中列出的页面中,我希望能够提取大学名称(“Bond university”)和“总体体验质量”(91.3%)的值。在
但是,当我使用'viewsource'、curl或scrapy时,不会显示实际值。E、 g.在我希望看到Uni-name的地方,它显示:
<h1 class="inline-block instiution-name" data-bind="text: Description"></h1>
但是如果我使用firebug或chrome来检查元素,它会显示
^{pr2}$在进一步的检查中,在firebug的“Net”选项卡上,我可以看到有一个AJAX(?)打了一个返回相关信息的电话,但我还没能在scrapy甚至curl中模仿这个(是的,我确实搜索了一下,恐怕花了很长时间尝试)。在
请求标头
POST /Websilk/DataServices/SurveyData.asmx/FetchInstitutionStudyAreaData HTTP/1.1
Host: www.qilt.edu.au
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:39.0) Gecko/20100101 Firefox/39.0
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/json; charset=utf-8
X-Requested-With: XMLHttpRequest
Referer: http://www.qilt.edu.au/institutions/institution/bond-university/business-management
Content-Length: 36
Cookie: _ga=GA1.3.69062787.1442441726; ASP.NET_SessionId=lueff4ysg3yvd2csv5ixsc1f; _gat=1
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
随请求传递的POST参数
{"InstitutionId":20,"StudyAreaId":0}
作为第二个选择,我尝试在scrapy中使用Selenium,因为我认为它可以像浏览器那样“看到”真实的值,但是没有效果。到目前为止,我的主要尝试如下:
import scrapy
import time #used for the sleep() function
from selenium import webdriver
class QiltSpider(scrapy.Spider):
name = "qilt"
allowed_domains = ["qilt.edu.au"]
start_urls = [
"http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/"
]
def __init__(self):
self.driver = webdriver.Firefox()
self.driver.get('http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/')
time.sleep(5) # tried pausing, in case problem was delayed loading - didn't work
def parse(self, response):
# parse the response to find the uni name and show in console (using xpath code from firebug). This find the relevant section, but it shows as empty
title = response.xpath('//*[@id="bd"]/div[2]/div/div/div[1]/div/div[2]/h1').extract()
print title
# dumping the whole response to a file so I can check whether dynamic values were captured
with open("extract.html", 'wb') as f:
f.write(response.body)
self.driver.close()
有人能告诉我怎么才能做到这一点吗?在
非常感谢!在
编辑:感谢您迄今为止的建议,但是对于如何使用InstitutionID和StudyAreaID的参数具体模拟AJAX调用有什么想法吗?我测试这一点的代码如下所示,但它似乎仍然遇到了一个错误页面。在
import scrapy
from scrapy.http import FormRequest
class HeaderTestSpider(scrapy.Spider):
name = "headerTest"
allowed_domains = ["qilt.edu.au"]
start_urls = [
"http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/"
]
def parse(self, response):
return [FormRequest(url="http://www.qilt.edu.au/Websilk/DataServices/SurveyData.asmx/FetchInstitutionData",
method='POST',
formdata={'InstitutionId':'20', 'StudyAreaId': '0'},
callback=self.parser2)]
QILT page使用AJAX从服务器检索数据。此AJAX请求使用javascript代码发送,该代码使用偶数文档准备就绪(jQuery)/窗口.onload(Javascript)(如果您不熟悉Javascript,则在web页面加载到浏览器窗口后立即触发此方法)。由于您正在使用软件来激发页面请求,因此根本不会触发此事件。在
对于您试图模拟的AJAX请求,请求体的类型是Application/JSON。 请在请求中添加以下标头。 内容类型:application/json
相关问题 更多 >
编程相关推荐