我试图从这个URL获取表数据,但这是post方法,我试图实现scrapy代码,但我面临500个错误。但若你们能通过网络部分查看一下,它显示的是200,但我在零碎的时间里得到了500。请检查我的代码,让我知道我在这里做什么。Pelase帮助。谢谢。还有一件事,用户代理也已经应用了
from scrapy import Spider
from scrapy.http import Request, FormRequest
from scrapy.utils.response import open_in_browser
class LarmSpider(Spider):
name = 'larm'
allowed_domains = ['larmtjanst.se']
start_urls = ['https://www.larmtjanst.se/Efterlysta-objekt/Personbil/?s=True']
def parse(self, response):
yield FormRequest('https://www.larmtjanst.se/StolenItemsHelper/SearchAjax?category=Personbil',
formdata={'category': 'Personbil'},
callback=self.parse_form)
def parse_form(self, response):
open_in_browser(response)
table = response.xpath('//*[contains(@class, "searchResultTable")]')[1]
trs = table.xpath('.//tr')
for tr in trs:
reg_num = tr.xpath('.//td/a/text()').extract_first()
yield {
'Register Number': reg_num
}
输出
2020-10-28 16:15:50 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <POST https://www.larmtjanst.se/StolenItemsHelper/SearchAjax ?category=Personbil> (failed 3 times): 500 Internal Server Error 2020-10-28 16:15:50 [scrapy.core.engine] DEBUG: Crawled (500) <POST https://www.larmtjanst.se/StolenItemsHelper/SearchAjax?category=Personbil
(referer: https://www.larmtjanst.se/Efterlysta-objekt/Personbil/?s=True) 2020-10-28 16:15:50 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://www.larmtjanst.se/StolenItemsHelper/SearchAjax? category=Personbil>: HTTP status code is not handled or not allowed
您不仅应该提供表单数据,还应该设置一些标题以获得所需的输出。可以随意使用https://michael-shub.github.io/curl2scrapy/在scrapy shell中调试请求。这个对我很有用:
正如您所看到的,您应该设置
__RequestVerificationToken
,它可以在页面正文中找到。其他一些标题也是必需的相关问题 更多 >
编程相关推荐