并行化请求威胁处理python

1条回答

网友

1楼 · 发布于 2024-05-18 21:23:28

寻求解决方案，如grequest。在

原因是大多数时候，您等待工作I/O操作完成（下载页面）。很有可能，与这个I/O时间相比，前处理和后处理实际上是微不足道的。在

我看到过很多关于网页抓取的grequest的加速。在

如果您的前处理和后处理也很耗时，那么您将不得不使用multiprocessing模块，并在多个进程中运行任务（如果您是CPU受限的，那么多个线程是没有帮助的）。在

但首先-真的要试试grequest。在

使用grequests的HTTP作业

由于需要进行一些预处理和后处理，所以我们必须以某种方式对其进行组织。在

grequests：“未发送请求”的概念

grequests有一个“未发送请求”的概念。这是以后要做的工作。grequests 允许启动这些作业，例如通过grequests.map或grequests.imap。在

grequests：对未发送的请求进行回调（hooks）

每个未发送的请求都可以有附加的钩子来处理返回的响应。在

一个未发送的请求可以附加更多的钩子。在

我们将使用它进行后处理。在

类HttpJobXxx绑定每个实例的东西

我们希望以某种方式组织好以下与工作相关的事情：

实例化：传入参数，定义要完成的工作
预处理：准备要准备的东西
创建未发送请求
返回响应的后处理

我们稍后将在以下步骤中使用它：

实例化作业。这将包括对pre_processing方法的调用。在
通过请求作业实例来创建未发送的请求。未发送的请求将包括实例post_process调用的后处理最终响应。在
将作业实例添加到要执行的作业列表中
让作业运行，例如由grequests.map或grequests.imap运行。在

诀窍在于，每个作业实例都可以利用其自己的参数来保持上下文信息在整个作业生命周期中保持干净。在

HttpJobMyIp

这是真实的代码：

import grequests


class HttpJobMyIp(object):
    url = "https://httpbin.org/ip"

    def __init__(self, nickname="myip over httpbin.org"):
        self.nickname = nickname
        self.pre_process()

    def pre_process(self):
        """Whatever pre-processing you need."""
        print("Preprocessing {self.nickname} with {self.url}".format(self=self))

    @property
    def unsent_request(self):
        """Create requests for grequests.
        Override by whatever construct you need.
        """
        return grequests.get(self.url, hooks={"response": [self.post_process]})

    def post_process(self, response, **kwargs):
        msg = "Post-processing {self.nickname} with {self.url}"
        print(msg.format(self=self))
        assert "origin" in response.json()
        myip = response.json()["origin"]
        print("My IP is {}".format(myip))

HttpJobDelay

为了使示例完整，我们可以添加另一个HTTP作业类。这次允许调用一个url，它会延迟响应。在

^{pr2}$

在这里我们故意修改延迟，增加0.7秒显示，预处理有一个改变某事的机会。在

让一切都过去吧

def exception_handler(request, exception):
    return exception


def main():
    http_jobs = []
    job = HttpJobMyIp("Get my IP")
    http_jobs.append(job.unsent_request)

    for delay in [3, 1, 6, 2]:
        job = HttpJobDelay(delay, "Delay "+str(delay))
        http_jobs.append(job.unsent_request)
    # grequests.map(http_jobs, exception_handler=exception_handler))
    list(grequests.imap(http_jobs, exception_handler=exception_handler), size=6)
    print("DONE")


if __name__ == "__main__":
    main()

计划是调用HttpJobMyIp作业的一个实例和HttpJobDelay作业的4个实例不同的请求延迟。延误是故意不分类的。在

在mreq.py文件中包含以上所有代码，我们可以运行它：

$ python mreq.py                                                                                 1 ↵
Preprocessing Get my IP with https://httpbin.org/ip
Preprocessing Delay 3 with https://httpbin.org/delay/{delay} and delay 3
Preprocessing Delay 1 with https://httpbin.org/delay/{delay} and delay 1
Preprocessing Delay 6 with https://httpbin.org/delay/{delay} and delay 6
Preprocessing Delay 2 with https://httpbin.org/delay/{delay} and delay 2
Post-processing Get my IP with https://httpbin.org/ip
My IP is 87.257.712.26
Post-processing Delay 1 with https://httpbin.org/delay/{delay} and expected delay 1.7
Finally we got (a bit delayed) response
Post-processing Delay 2 with https://httpbin.org/delay/{delay} and expected delay 2.7
Finally we got (a bit delayed) response
Post-processing Delay 3 with https://httpbin.org/delay/{delay} and expected delay 3.7
Finally we got (a bit delayed) response
Post-processing Delay 6 with https://httpbin.org/delay/{delay} and expected delay 6.7
Finally we got (a bit delayed) response
DONE

经验教训

在哪里可以找到`grequests`文档

没有ReadTheDocs文档。在

而是使用：

源代码（整个模块有153行，包括注释）
测试套件（235行）

注意，HttpJobXxx类不是必需的，创建它只是因为我觉得它很方便。在

回调函数参数：include`**kwargs`

回调函数应有两个参数：

response（由HTTP调用提供）
**kwargs

如果没有**kwargs，代码将静默不动。在

静默失败：使用异常处理程序

在没有异常处理程序的情况下，如果出现问题，您通常会不知道是什么继续。使用exception_handler你可以得到一个异常作为结果，并了解到，发生了什么错了。在

退出太快：`grequests.imap`是生成器

立即调用grequests.imap(http_jobs, exception_handler=exception_handler), size=6) 返回生成器，如果没有任何内容消耗其中的值，则继续并退出。在

因此，调用被封装在list()中。在

并发处理的请求-有效

如我们所见，不管HttpJobDelay的延迟实例未排序，则返回结果按顺序-最短的延迟在前，较长的延迟在后。使用grequests.imap的未发送请求的顺序和结果的顺序可能不同。在

另一方面，使用grequests.map时，结果将按原来的顺序返回按工作列表请求。在

`grequest.imap`参数`size`默认为2

如果不指定size，它将使用默认值2。这可能会影响结果的顺序。在

期待什么样的加速

grequests使用“绿色线程”在一个进程中运行。这意味着，在不同的 “绿色线程”效率更高，因为它是在代码喜欢的时候执行的，因此可以节省 CPU和操作系统进行上下文切换的开销。在

由于任务是I/O受限的（大多数时间我们都在等待一些数据的到来），所以我们可以很好地生活在其中单一流程。在

当请求数增加时，将看到最高的加速。在