糟糕的性能改进和内存消耗问题的回答

糟糕的性能改进和内存消耗

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

正如我所说的，您应该使用<a href="https://wiki.python.org/moin/Generators" rel="nofollow noreferrer">generators</a>来避免在内存中创建对象列表（<a href="https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python">what-does-the-yield-keyword-do-in-python</a>），使用生成器对象是惰性创建的，这样就不会同时在内存中创建所有对象的大列表： <pre><code>def get_urls_from_csv(): with open('data.csv', newline='') as csv_file: data = csv.reader(csv_file, delimiter=',') for row in data: yield "http://"+row[2]) # yield each url lazily class rssitem(scrapy.Item): sourceurl = scrapy.Field() rssurl = scrapy.Field() class RssparserSpider(scrapy.Spider): name = "rssspider" allowed_domains = ["*"] start_urls = () def start_requests(self): # return a generator expresion. return (scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()) def parse(self, response): res = response.xpath('//link[@type="application/rss+xml"]/@href') for sel in res: item = rssitem() item['sourceurl']=response.url item['rssurl']=sel.extract() yield item </code></pre> 就性能而言，<a href="https://doc.scrapy.org/en/latest/topics/broad-crawls.html" rel="nofollow noreferrer">Broad Crawls</a>上的文档建议尝试<a href="http://doc.scrapy.org/en/latest/topics/broad-crawls.html#increase-concurrency" rel="nofollow noreferrer">increase concurrency</a>是： Concurrency是并行处理的请求数。有全局限制和每个域限制。 Scrapy中默认的全局并发限制不适合并行爬网多个不同的域，因此您需要增加它。增加多少将取决于你的爬虫将有多少CPU可用。一个好的起点是100，但是找到这个问题的最好方法是做一些试验，并确定你的废进程在什么样的并发性下会得到CPU的限制。为了获得最佳性能，您应该选择CPU使用率在80-90%之间的并发。 要提高全局并发性，请执行以下操作： ^{pr2}$ 强调我的。在 还有<a href="http://doc.scrapy.org/en/latest/topics/broad-crawls.html#increase-twisted-io-thread-pool-maximum-size" rel="nofollow noreferrer">Increase Twisted IO thread pool maximum size</a>： 目前Scrapy使用线程池以阻塞方式进行DNS解析。在更高的并发级别下，爬网可能会很慢，甚至会导致DNS解析程序超时。增加处理DNS查询的线程数的可能解决方案。DNS队列的处理速度将更快，从而加快连接的建立和总体爬网。 要增加最大线程池大小，请使用： <pre><code> REACTOR_THREADPOOL_MAXSIZE = 20 </code></pre>

糟糕的性能改进和内存消耗

1 个回答

相关Python问题