擅长:python、mysql、java
<p>我建议使用<a href="https://scrapy.readthedocs.org" rel="nofollow">Scrapy</a>。这是一个非常强大和易于使用的网页抓取工具。值得尝试的原因:</p>
<ol>
<li><p>速度/性能/效率</p>
<blockquote>
<p>Scrapy is written with Twisted, a popular event-driven networking
framework for Python. Thus, it’s implemented using a non-blocking (aka
asynchronous) code for concurrency.</p>
</blockquote></li>
<li><p>数据库流水线</p>
<p>Scrapy具有<code>Item Pipelines</code>功能:</p>
<blockquote>
<p>After an item has been scraped by a spider, it is sent to the Item
Pipeline which process it through several components that are executed
sequentially.</p>
</blockquote>
<p>因此,每一页都可以在下载后立即写入数据库。</p></li>
<li><p>代码组织</p>
<p>Scrapy为你提供了一个很好的清晰的项目结构,在那里你有设置,蜘蛛,项目,管道等逻辑分离。即便如此,你的代码也会更清晰、更易于支持和理解。</p></li>
<li><p>编码时间到了</p>
<p>Scrapy在幕后为你做了很多工作。这将使您专注于实际的代码和逻辑本身,而不是考虑“金属”部分:创建进程、线程等。</p></li>
</ol>
<p>是的,你懂的-我喜欢。在</p>
<p>为了开始:</p>
<ul>
<li><a href="https://scrapy.readthedocs.org/en/latest/intro/tutorial.html" rel="nofollow">official tutorial</a></li>
<li><a href="http://newcoder.io/scrape/" rel="nofollow">newcoder.io tutorial</a></li>
</ul>
<p>希望有帮助。在</p>