Scrapy Python spider无法使用LinkExtractor或手动请求（）找到链接

from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.http.request import Request from scrapy.contrib.linkextractors import LinkExtractor from genesisSpider.items import GenesisJob class genesis_crawl_spider(CrawlSpider): name = "genesis" #allowed_domains = ['http://www.ghcjobs.apply2jobs.com'] start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1'] #allow &CurrentPage= up to 1000, currently ~ 512 rules = [Rule(LinkExtractor(allow=("^https://www.ghcjobs.apply2jobs.com/ProfExt/ index.cfm\?fuseaction=mExternal.returnToResults&CurrentPage=[1-1000]$")), 'parse_inner_page')] def parse_inner_page(self, response): self.log('===========Entrered Inner Page============') self.log(response.url) item = GenesisJob() item['url'] = response.url yield item

1条回答

网友

1楼 · 发布于 2024-09-29 23:27:19

你真的和你一样看待网页吗？现在，越来越多的网站是用Javascript，Ajax。。这些动态内容可能需要一个功能齐全的浏览器才能完全填充。然而，Nutch和Scrapy都无法处理这些现成的问题。在

首先，你需要确保你感兴趣的web内容可以被scrapy检索到。有几种方法可以做到这一点。我通常使用urllib2和beautifulsoup4来快速尝试。你的起始页没有通过我的测试。在

$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> url = "https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1"

>>> html = urllib2.urlopen(url).read()
>>> soup = BeautifulSoup(html)
>>> table = soup.find('div', {'id':'VESearchResults'})
>>> table.text
u'\n\n\n\r\n\t\t\tJob Title\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tArea of Interest\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tLocation\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tState\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tCity\xa0\r\n\t\t\t\r\n\t\t\n\n\n\r\n\t\t\t\t\tNo results matching your criteria.\r\n\t\t\t\t\n\n\n'
>>>

正如您所看到的，“没有符合您的条件的结果！” 我想您可能需要弄清楚为什么没有填充内容。饼干？发布而不是获取？用户代理等

另外，您可以使用scrapyparse命令来帮助您调试。例如，我经常使用这个命令。在

^{pr2}$

其他一些稀薄的commands，也许硒对以后的发展有帮助。在

在这里，我使用iPython中运行scrapy shell来检查您的起始url，而且我在浏览器中看到的第一条记录包含Englewood，它不存在于scrapy抓取的html中

Here I am using running scrapy shell in iPython to inspect your start url and also the first record that I can see in my browser contains Englewood and it doesn't exist in the html that scrapy grabbed.

更新：

你所做的只是一个非常琐碎的刮擦工作，你真的不需要刮擦，这有点过分了。以下是我的建议：

看一看Selenium（我假设您编写的是Python）并在您尝试在服务器上运行它时，最终生成无头Selenium。在
您可以使用PhantomJS实现这一点，PhantomJS是一个轻量级的Javascript执行器，可以完成您的工作。Here是另一个可能有帮助的stackoverflow问题。在
你可以在几个other资源中获得职业发展。在

相关问题更多 >

编程相关推荐

热门问题

热门文章