我正在尝试使用spider从Reddit中获取数据。我希望我的爬行器在我的url列表(位于名为reddit.txt的文件中)中的每个url上迭代并收集数据,但我收到一个错误,将整个url列表作为开始的url。这是我的密码:
import scrapy
import time
class RedditSpider(scrapy.Spider):
name = 'reddit'
allowed_domains = ['www.reddit.com']
custom_settings={ 'FEED_URI': "reddit_comments.csv", 'FEED_FORMAT': 'csv'}
with open('reddit.txt') as f:
start_urls = [url.strip() for url in f.readlines()]
def parse(self, response):
for URL in response.css('html'):
data = {}
data['body'] = URL.css(r"div[style='--commentswrapper-gradient-color:#FFFFFF;max-height:unset'] p::text").extract()
data['name'] = URL.css(r"div[style='--commentswrapper-gradient-color:#FFFFFF;max-height:unset'] a::text").extract()
time.sleep(5)
yield data
以下是我的输出:
scrapy.exceptions.NotSupported: Unsupported URL scheme '': no handler available for that scheme
2020-07-26 00:51:34 [scrapy.core.scraper] ERROR: Error downloading <GET ['http://www.reddit.com/r/electricvehicles/comments/lb6a3/im_meeting_with_some_people_helping_to_bring_evs/',%20'http://www.reddit.com/r/electricvehicles/comments/1b4a3b/prospective_buyer_question_what_is_a_home/',%20'http://www.reddit.com/r/electricvehicles/comments/1f5dmm/any_rav4_ev_drivers_on_reddit/' ...
我列表的一部分:['http://www.reddit.com/r/electricvehicles/comments/lb6a3/im_meeting_with_some_people_helping_to_bring_evs/', 'http://www.reddit.com/r/electricvehicles/comments/1b4a3b/prospective_buyer_question_what_is_a_home/', 'http://www.reddit.com/r/electricvehicles/comments/1f5dmm/any_rav4_ev_drivers_on_reddit/', 'http://www.reddit.com/r/electricvehicles/comments/1fap6p/any_good_subreddits_for_ev_conversions/', 'http://www.reddit.com/r/electricvehicles/comments/1h9o9t/buying_a_motor_for_an_ev/', 'http://www.reddit.com/r/electricvehicles/comments/1iwbp7/is_there_any_law_governing_whether_a_parking/', 'http://www.reddit.com/r/electricvehicles/comments/1j0bkv/electric_engine_regenerative_braking/',...]
如果您能帮我解决这个问题,我将不胜感激。谢谢大家!
因此,您可以在
start_requests
方法中打开url文件,并向parse
方法添加回调确保输入文件的内容格式正确,并且每行有一个url:
在没有看到异常或您的reddit.txt文件的情况下,我无法确定,但我相信您的问题出在txt文件中
尝试在单独的脚本中运行此脚本:(或在spider中添加
print()
)如果我是对的,那么输出将是单个字符串中的所有URL,而不是由行分隔的URL
确保您正在为txt文件中的每个URL划分新行
相关问题 更多 >
编程相关推荐