在上将简单函数转换为多线程

2024-05-20 15:28:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下功能,我想在多线程的一个转换:

from __future__ import print_function
import os
import json


def url_searcher(value):
    url_file_path = "C:\\Users\\Link\\Desktop\\"

    for filename in os.listdir(url_file_path):
        if filename == "url_list.json":
            with open(url_file_path + filename) as f:
                for line in f:
                    returnJson = json.loads(line)
                    if value in returnJson["url"]:
                        return returnJson


print(url_searcher("http://zadkay.com/blog/wwp/51065983.jpg"))

基本上,它将在JSON file中搜索一个值。主要问题是文件太大(超过50mb),我需要尽快得到结果(超过7秒被认为太多)。你知道吗

这是从文件中提取的,以便您可以看到结构:

{"dateadded": "2019-11-04 12:33:27", "url_status": "online", "tags": "elf", "url": "http://2.56.8.16/bins/arm7", "reporter": "Gandylyan1", "threat": "malware_download", "id": "251402"}
{"dateadded": "2019-11-04 12:33:25", "url_status": "online", "tags": "elf", "url": "http://2.56.8.16/bins/arm6", "reporter": "Gandylyan1", "threat": "malware_download", "id": "251401"}

问题:

步骤是什么?这是个好主意吗?这会提高获得结果的速度吗?你知道吗

在这里您可以看到一个JSON的文件示例,其中包含更多行:https://pastebin.com/MXYTg1CV

谢谢


Tags: 文件pathinimportjsonhttpurlfor
3条回答

用python创建线程非常简单

import threading

thread_one= threading.Thread(name='searcher', target=url_searcher, args=(value))                
thread_one.start()

https://docs.python.org/3/library/threading.html

但我认为这不会减少处理时间,它只允许您一次读取多个文件,甚至可能需要更长的时间来处理每个文件。你知道吗

文件的结构总是一样的? 您是否尝试过将其作为纯文本文件处理并搜索关键字名称?你知道吗

首先,测量,而不是猜测。 熟悉蛇: https://jiffyclub.github.io/snakeviz/

pip install snakeviz
python -m cProfile -o program.prof my_program.py
snakeviz program.prof

在我的机器上,65%的时间都花在json库的decode函数上。你知道吗

让我们尝试一个简单的改进:

with open(url_file_path + filename) as f:
    for line in f:
        # returnJson = json.loads(line)
        # if value in returnJson["url"]:
        #     return returnJson
        if value in line:
            returnJson = json.loads(line)
            return returnJson

在我的机器上,我看到执行时间提高了10倍。你知道吗

看起来很像I/O受限操作:

from multiprocessing.dummy import Pool
from __future__ import print_function
from functools import partial
import json
import glob


def url_searcher(pathlist, value):

    for filename in pathlist:

         with open(filename) as f:

            for line in f:

                returnJson = json.loads(line)

                if value in returnJson["url"]:

                    return returnJson

no_of_threads = 4 #can be set manually or allocated automatically by Pool (see docs)

url_file_path = "C:\\Users\\Link\\Desktop\\"
value = "very_important_stuff" #adjust to your liking!

workers = Pool(no_of_threads)

pathlist = [f for f in glob.iglob(url_file_path, recursive=recurse) if "url_list.json" in f]

result = workers.map(partial(url_searcher, value = value), pathlist)

workers.close()       
workers.join()

相关问题 更多 >