使用Python Scrapy从源代码获取href网址,并将其导出为JSON文件

2024-10-02 18:16:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我是python和Scrapy的新手,我在网上搜索,但是没有找到很多关于Scrapy的例子。作为实践和挑战,我尝试用Scrapy从源代码中获取href link并将其放入json文件中,还找到了一个有用的github源代码,用Scrapy和python从源代码中生成电影url。但不幸的是,这个github源已经过时,不能完全工作。在文件名movie中_蜘蛛网.py我对源代码做了一行更改,并将url替换为最近的工作url,我的意思是我更改了:

name, start_urls = 'ip_spider', ['http://iranproud.com/movies']

^{pr2}$

然后我用这个命令运行它:

scrapy crawl ip_spider -o movies_list.csv -t csv

目前电影.json有237部电影,但这是3年前的事,并没有所有最近的电影。有人能帮我做些什么改变,或者我应该怎样更新githubhttps://github.com/xldrx/kodi-persian-contents才能使它工作。。。在

以下是日志的一部分:

2017-09-09 21:45:05 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://iranproud.net/site.aspx?aspxerrorpath=/iran-1-movies/tv&cinema/yek-damaghe-naghabel> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2017-09-09 21:45:05 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://iranproud.net/iran-1-movies/tv&cinema/az-ma-behtaroon> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2017-09-09 21:45:05 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://iranproud.net/site.aspx?aspxerrorpath=/iran-1-movies/tv&cinema/inja-aseman-hamishe-baranist> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2017-09-09 21:45:10 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://iranproud.net/iran-1-movies/tv&cinema/behtarin-hamsayeh-donya> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2017-09-09 21:45:10 [scrapy.extensions.logstats] INFO: Crawled 137 pages (at 1 pages/min), scraped 0 items (at 0 items/min)

这是电影.json文件(结果),但不包括所有最近的电影网址:

{"video_url": "http://63.237.48.3/ipnx/media/movies/KhastehNabashiHQ.mp4", "title": ["Khasteh Nabashid"]},
{"video_url": "http://63.237.48.3/ipnx/media/movies/Khaneh_Neshin_HQ.mp4", "title": ["Khane Neshin"]},

谢谢。在


Tags: debughttpurlget电影源代码moviesout