我有一些奇怪的行为从我的破爬蜘蛛,我不知道解释,任何建议,谢谢!{alecxe}运行下面的脚本
我的爬行蜘蛛(sdcrawler.py
)的脚本如下。如果我从命令行调用它(例如“python sdcrawler.py 'myEGurl.com' 'http://www.myEGurl.com/testdomain' './outputfolder/' 'testdomain/'
”),那么LinkExtractor将跟随页面上的链接并进入parse_item
回调函数以处理它找到的任何链接。但是,如果我试图从Python脚本中使用os.system()
调用完全相同的命令,那么对于某些页面(不是所有页面),crawspider不会跟踪任何链接,也不会进入parse_item
回调函数。我似乎无法获得任何输出或错误消息来理解为什么在这个实例中没有为这些页面调用parse_item
。我添加的print
语句确认__init__
确实被调用了,但是spider关闭了。我不明白为什么如果我将与os.system()
一起使用的“python sdcrawler.py ...
”命令粘贴到命令行中并运行它,那么parse_function
会被调用来获得完全相同的参数?在
爬行蜘蛛代码:
class SDSpider(CrawlSpider):
name = "sdcrawler"
# requires 'domain', 'start_page', 'folderpath' and 'sub_domain' to be passed as string arguments IN THIS PARTICULAR ORDER!!!
def __init__(self):
self.allowed_domains = [sys.argv[1]]
self.start_urls = [sys.argv[2]]
self.folder = sys.argv[3]
try:
os.stat(self.folder)
except:
os.makedirs(self.folder)
sub_domain = sys.argv[4]
self.rules = [Rule(LinkExtractor(allow=sub_domain), callback='parse_item', follow=True)]
print settings['CLOSESPIDER_PAGECOUNT']
super(SDSpider, self).__init__()
def parse_item(self, response):
# check for correctly formatted HTML page, ignores crap pages and PDFs
print "entered parse_item\n"
if re.search("<!\s*doctype\s*(.*?)>", response.body, re.IGNORECASE) or 'HTML' in response.body[0:10]:
s = 1
else:
s = 0
if response.url[-4:] == '.pdf':
s = 0
if s:
filename = response.url.replace(":","_c_").replace(".","_o_").replace("/","_l_") + '.htm'
if len(filename) > 255:
filename = filename[0:220] + '_filename_too_long_' + str(datetime.datetime.now().microsecond) + '.htm'
wfilename = self.folder + filename
with open(wfilename, 'wb') as f:
f.write(response.url)
f.write('\n')
f.write(response.body)
print "i'm writing a html!\n"
print response.url+"\n"
else:
print "s is zero, not scraping\n"
# callback fired when the spider is closed
def callback(spider, reason):
stats = spider.crawler.stats.get_stats() # collect/log stats?
# stop the reactor
reactor.stop()
print "spider closing\n"
# instantiate settings and provide a custom configuration
settings = Settings()
settings.set('DEPTH_LIMIT', 5)
settings.set('CLOSESPIDER_PAGECOUNT', 100)
settings.set('DOWNLOAD_DELAY', 3)
settings.set('USER_AGENT', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko)')
# breadth-first crawl (depth-first is default, comment the below 3 lines out to run depth-first)
settings.set('DEPTH_PRIORITY', 1)
settings.set('SCHEDULER_DISK_QUEUE', 'scrapy.squeue.PickleFifoDiskQueue')
settings.set('SCHEDULER_MEMORY_QUEUE', 'scrapy.squeue.FifoMemoryQueue')
# instantiate a crawler passing in settings
crawler = Crawler(settings)
# instantiate a spider
spider = SDSpider()
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()
# start the reactor (blocks execution)
reactor.run()
编辑回应@alecxe的评论:
我在打电话sdcrawler.py与操作系统()在一个名为execute\u spider()的函数中。参数是一个URL列表,其中包含作为.txt文件的子域,以及spider在探索子域时应该保留的整个域URL。在
执行_spider code():
^{pr2}$我在os.system(cmd)
之前打印cmd
,如果我只是复制这个print
输出并在一个单独的终端中运行它,crawspider会像我预期的那样执行,访问链接并使用parse_item
回调函数解析它们。在
打印sys.argv
的输出是:
['sdcrawler.py' 'example.com' 'http://example.com/testdomain/' './outputfolder/' 'testdomain/']
目前没有回答
相关问题 更多 >
编程相关推荐