我用刮痧来刮痧。xpath是正确的，但没有给出标记的内容

2018-06-12 13:46:01 [oauth2client.client] INFO: Refreshing access_token 2018-06-12 13:46:01 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): spreadsheets.google.com 2018-06-12 13:46:02 [urllib3.connectionpool] DEBUG:https://spreadsheets.google.com:443 "GET /feeds/spreadsheets/private/full HTTP/1.1" 200 None 2018-06-12 13:46:03 [urllib3.connectionpool] DEBUG: https://spreadsheets.google.com:443 "GET /feeds/worksheets/1oxVjCH2otn_OcS5PlogjsPd8fkDXEnI_4dWptACS4eU/private/full HTTP/1.1" 200 None 2018-06-12 13:46:03 [urllib3.connectionpool] DEBUG: https://spreadsheets.google.com:443 "GET /feeds/cells/1oxVjCH2otn_OcS5PlogjsPd8fkDXEnI_4dWptACS4eU/od6/private/full HTTP/1.1" 200 None ['7185297', 'http://macys.com/shop/product/polo-ralph-lauren-baseline-hat?ID=2606206', '24.99', '35', 'New', '19 Sep 16'] 2018-06-12 13:46:07 [scrapy.core.engine] DEBUG: Crawled (301) <GET http://macys.com/shop/product/polo-ralph-lauren-baseline-hat?ID=2606206> (referer: None) None import scrapy import gspread from oauth2client.service_account import ServiceAccountCredentials class MacysSpider(scrapy.Spider): name = 'macys' allowed_domains = ['macys.com'] def start_requests(self): scope = "https://spreadsheets.google.com/feeds" credentials = ServiceAccountCredentials.from_json_keyfile_name("/home/prakher/cred.json", scope) gs = gspread.authorize(credentials) gsheet = gs.open("test") # worksheet=gsheet.sheet_by_index(0) wsheet = gsheet.worksheet("Sheet1") all_rows = wsheet.get_all_values() all_urls=all_rows[1:] #all_urls=['http://goodreads.com/quotes'] for url in all_urls: print(url) yield scrapy.Request(url=url[1], meta={ 'dont_redirect': True, 'handle_httpstatus_list': [302, 200, 301] }, callback=self.parse) def parse(self, response): print("Hi") res=response.xpath('.//div[@class="columns medium-10 medium-centered product-not-available-message"]/p/text()').extract_first() print(res)

1条回答

网友

1楼 · 发布于 2024-10-03 02:47:06

问题是ScrapyShell会重定向您，而您的代码不会，因为您显式地将dont_redirect设置为True，并将3xx代码包含为handle_httpstatus_list：

yield scrapy.Request(url=url[1], meta={
    'dont_redirect': True,
    'handle_httpstatus_list': [302, 200, 301]
}, callback=self.parse)

尝试删除这些参数：

yield scrapy.Request(url=url[1], callback=self.parse)

相关问题更多 >

编程相关推荐

热门问题

热门文章