所以当我试图从epinions.com网站,如果主评论文本太长,它会有一个指向另一个页面的“阅读更多”链接。 {我先看看你的评论是什么意思。在
我在想:在for循环的每个迭代中是否可以有一个小蜘蛛来获取url并从新链接中刮出评论?我有下面的代码,但它不适用于小“蜘蛛”。在
这是我的代码:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from epinions_test.items import EpinionsTestItem
from scrapy.http import Response, HtmlResponse
class MySpider(BaseSpider):
name = "epinions"
allow_domains = ["epinions.com"]
start_urls = ['http://www.epinions.com/reviews/samsung-galaxy-note-16-gb-cell-phone/pa_~1']
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="review_info"]')
items = []
for sites in sites:
item = EpinionsTestItem()
item["title"] = sites.select('h2/a/text()').extract()
item["star"] = sites.select('span/a/span/@title').extract()
item["date"] = sites.select('span/span/span/@title').extract()
item["review"] = sites.select('p/span/text()').extract()
# Everything works fine and i do have those four columns beautifully printed out, until....
url2 = sites.select('p/span/a/@href').extract()
url = str("http://www.epinions.com%s" %str(url2)[3:-2])
# This url is a string. when i print it out, it's like "http://www.epinions.com/review/samsung-galaxy-note-16-gb-cell-phone/content_624031731332", which looks legit.
response2 = HtmlResponse(url)
# I tried in a scrapy shell, it shows that this is a htmlresponse...
hxs2 = HtmlXPathSelector(response2)
fullReview = hxs2.select('//div[@class = "user_review_full"]')
item["url"] = fullReview.select('p/text()').extract()
# The three lines above works in an independent spider, where start_url is changed to the url just generated and everything.
# However, i got nothing from item["url"] in this code.
items.append(item)
return items
为什么item[“url”]什么也不返回?在
谢谢!在
您应该在回调中实例化一个新的
Request
,并在meta
dict中传递您的item
:{另请参见^ a1。在
希望有帮助。在
相关问题 更多 >
编程相关推荐