Python requests-crawler包_程序模块 - PyPI

一种基于请求html的web爬虫，主要针对url验证测试。

requests-crawler的Python项目详细描述

请求爬网程序

一种基于requests-html的网络爬虫，主要针对url验证测试。

功能

基于requests-html，完整的javascript支持！
支持请求频率限制，例如RPS/RPM
支持带有标题和cookies的爬网
包含和排除机制
按http状态代码对访问过的url进行分组
显示URL的引用和超链接

安装/升级

$ pip install requests-crawler

仅支持python 3.6。

为了确保安装或升级成功，可以执行命令requests_crawler -V，查看是否可以获得正确的版本号。

$ requests_crawler -V
0.5.3

用法

$ requests_crawler -h
usage: requests_crawler [-h] [-V] [--log-level LOG_LEVEL]
                        [--seed SEED]
                        [--headers [HEADERS [HEADERS ...]]]
                        [--cookies [COOKIES [COOKIES ...]]]
                        [--requests-limit REQUESTS_LIMIT]
                        [--interval-limit INTERVAL_LIMIT]
                        [--include [INCLUDE [INCLUDE ...]]]
                        [--exclude [EXCLUDE [EXCLUDE ...]]]
                        [--workers WORKERS]

A web crawler based on requests-html, mainly targets for url validation test.

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show version
  --log-level LOG_LEVEL
                        Specify logging level, default is INFO.
  --seed SEED           Specify crawl seed url
  --headers [HEADERS [HEADERS ...]]
                        Specify headers, e.g. 'User-Agent:iOS/10.3'
  --cookies [COOKIES [COOKIES ...]]
                        Specify cookies, e.g. 'lang=en country:us'
  --requests-limit REQUESTS_LIMIT
                        Specify requests limit for crawler, default rps.
  --interval-limit INTERVAL_LIMIT
                        Specify limit interval, default 1 second.
  --include [INCLUDE [INCLUDE ...]]
                        Urls include the snippets will be crawled recursively.
  --exclude [EXCLUDE [EXCLUDE ...]]
                        Urls include the snippets will be skipped.
  --workers WORKERS     Specify concurrent workers number.

示例

基本用法。

$ requests_crawler --seed http://debugtalk.com

抓取标题和饼干。

$ requests_crawler --seed http://debugtalk.com --headers User-Agent:iOS/10.3 --cookies lang:en country:us

以30转/秒的速度爬行。

$ requests_crawler --seed http://debugtalk.com --requests-limit 30

爬行速度限制为500转/分。

$ requests_crawler --seed http://debugtalk.com --requests-limit 500 --interval-limit 60

使用额外主机进行爬网，例如httprunner.org也将递归爬网。

$ requests_crawler --seed http://debugtalk.com --include httprunner.org

跳过排除的url片段，例如url includehttprunner将被跳过。

$ requests_crawler --seed http://debugtalk.com --exclude httprunner

欢迎加入QQ群-->： 979659372

requests-crawler 0.5.4

requests-crawler的Python项目详细描述

请求爬网程序

功能

安装/升级

用法

示例

推荐PyPI第三方库

genetica

tv2-bell-automation-framework

Geccoi

gemma

sci-distributions

pyqt5stubs

async-vk-bots

startables

flexibox

ophac-pkg-danielbakkelund

MagnetiCalc

botframeworkconnector

aiotcloud

snowoptics

hackernews500kindex

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

requests-crawler 0.5.4

requests-crawler的Python项目详细描述

请求爬网程序

功能

安装/升级

用法

示例

推荐PyPI第三方库

genetica

tv2-bell-automation-framework

Geccoi

gemma

sci-distributions

pyqt5stubs

async-vk-bots

startables

flexibox

ophac-pkg-danielbakkelund

MagnetiCalc

botframeworkconnector

aiotcloud

snowoptics

hackernews500kindex

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签