os.walk速度很慢,有方法可以优化吗?

2024-09-24 20:53:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用os.walk构建数据存储的映射(此映射稍后将在我正在构建的工具中使用)

这是我当前使用的代码:

def find_children(tickstore):
    children = []
    dir_list = os.walk(tickstore)
    for i in dir_list:
        children.append(i[0])
    return children

我对此做了一些分析:

dir_list = os.walk(tickstore)立即运行,如果我对dir_list不做任何操作,则此函数立即完成。

它在dir_list上迭代需要很长时间,即使我没有append任何东西,只要在它上迭代就需要时间。

Tickstore是一个大数据存储,有大约10000个目录。

目前,完成此功能大约需要35分钟。

有没有办法加快速度?

我已经研究了os.walk的替代方案,但它们似乎都没有在速度方面提供太多优势。


Tags: 工具数据代码inforreturnosdef
3条回答

python2.7中的一种优化方法,用scandir.walk()代替os.walk(),参数完全相同。

import scandir
directory = "/tmp"
res = scandir.walk(directory)
for item in res:
    print item

PS:正如注释中提到的@reconp,scandir需要在python2.7中使用之前安装。

os.walk当前非常慢,因为它首先列出目录,然后对每个条目执行stat操作,以查看它是目录还是文件。

PEP 471中提出了一个改进,在Python 3.5中很快就会出现。同时,您可以使用scandir包在Python 2.7中获得相同的好处

是:使用Python 3.5(它目前仍然是RC,但是should be out momentarily)。在Python 3.5中,os.walk被重写以提高效率。

这项工作是PEP 471的一部分。

摘自政治公众人物:

Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling os.listdir() on each directory -- it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.

But the underlying system calls -- FindFirstFile / FindNextFile on Windows and readdir on POSIX systems -- already tell you whether the files returned are directories or not, so no further system calls are needed. Further, the Windows system calls return all the information for a stat_result object on the directory entry, such as file size and last modification time.

In short, you can reduce the number of system calls required for a tree function like os.walk() from approximately 2N to N, where N is the total number of files and directories in the tree. (And because directory trees are usually wider than they are deep, it's often much better than this.)

In practice, removing all those extra system calls makes os.walk()about 8-9 times as fast on Windows, and about 2-3 times as fast on POSIX systems. So we're not talking about micro-optimizations. See more benchmarks here.

相关问题 更多 >