(Python 3):垃圾MongoDB管道不工作

2024-10-04 01:34:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图通过Pymongo用一个废弃的管道连接到MongoDB,以便创建一个新的数据库,并用我刚刚刮来的内容填充它,但是我遇到了一个奇怪的问题。我遵循了基本教程并设置了两个命令行,一个用于在中运行scrapy,另一个用于运行mongod。不幸的是,当我在运行mongod之后运行crapy代码时,mongod似乎没有接收到我试图设置的crapy管道,只是维护了一个“等待端口27107上的连接”的通知。在

在命令行1(scray)中,我将目录设置为Documents/PyProjects/twitterBot/krugman

在命令行2(mongod)中,我将其设置为Documents/PyProjects/twitterBot

我使用的脚本如下: 克鲁格曼/克鲁格曼/蜘蛛/克鲁格蜘蛛.py(调出保罗·克鲁格曼的博客条目):

from scrapy import http
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider
import scrapy
import pymongo
import json
from krugman.items import BlogPost


class krugSpider(CrawlSpider):
    name = 'krugbot'
    start_url = ['https://krugman.blogs.nytimes.com']

    def __init__(self):
        self.url = 'https://krugman.blogs.nytimes.com/more_posts_jsons/page/{0}/?homepage=1&apagenum={0}'

    def start_requests(self):
        yield http.Request(self.url.format('1'), callback = self.parse_page)

    def parse_page(self, response):
        data = json.loads(response.body)
        for block in range(len(data['posts'])):
            for article in self.parse_block(data['posts'][block]):
                yield article


        page = data['args']['paged'] + 1
        url = self.url.format(str(page))
        yield http.Request(url, callback = self.parse_page)


    def parse_block(self, content):
        article = BlogPost(author = 'Paul Krugman', source = 'Blog')                
        paragraphs = Selector(text = str(content['html']))

        article['paragraphs']= paragraphs.css('p.story-body-text::text').extract()
        article['links'] = paragraphs.css('p.story-body-text a::attr(href)').extract()
        article['datetime'] = content['post_date']
        article['post_id'] = content['post_id']
        article['url'] = content['permalink']
        article['title'] = content['headline']

        yield article

克鲁格曼/克鲁格曼/设置.py公司名称:

^{pr2}$

克鲁格曼/克鲁格曼/管道.py在

from pymongo import MongoClient
from scrapy.conf import settings
from scrapy import log

class KrugmanPipeline(object):

    def __init(self):
        connection = MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_BLOG']]

    def process_item(self, item, spider):
        self.collection.insert_one(dict(item))
        log.msg("Test this out")
        return item

我没有收到任何错误消息,所以在故障排除方面遇到了困难。它似乎根本不肯开火。你知道我的问题是什么吗?在


Tags: fromimportselfurldatasettingsparsedef