Python中遍历大文件（10GB+）的最有效方法

for i, logLine in enumerate(logHandle): #start matching UUID entries in log file to UUID from rulebase if logFunc.progress(lineCount, logSize): #check progress print logFunc.progress(lineCount, logSize) #print progress in 10% intervals for uid in uidHits: if logLine.count(uid) == 1: #for each UUID, check the current line of the log for a match in the UUID list uidHits[uid] += 1 #if matched, increment the relevant value in the uidHits list break #as we've already found the match, don't process the rest lineCount += 1

3条回答

网友

1楼 · 编辑于 2024-09-25 06:34:55

从功能上考虑！

编写一个函数，它将获取日志文件的一行并返回uuid。比如说，叫它uuid。
将此函数应用于日志文件的每一行。如果使用的是Python3，则可以使用内置函数映射；否则，需要使用itertools.imap。

将此迭代器传递给collections.Counter。

collections.Counter(map(uuid, open("log.txt")))

这将是非常有效的。

一些评论：

这完全忽略了uuid的列表，只统计出现在日志文件中的uuid。如果您不想这样做，您需要对程序进行一些修改。
- 您的代码很慢，因为您使用了错误的数据结构。你在这里想要的就是口述。

网友

2楼 · 编辑于 2024-09-25 06:34:55

就像上面提到的，使用一个10GB的文件，你可能会很快达到磁盘的极限。对于纯代码的改进，生成器的建议非常好。在Python2.x中，它看起来像

uuid_generator = (line.split(SPLIT_CHAR)[UUID_FIELD] for line in file)

听起来这不一定是python的问题。如果您没有做任何比计算uuid更复杂的事情，那么Unix可能比python更快地解决您的问题。

cut -d${SPLIT_CHAR} -f${UUID_FIELD} log_file.txt | sort | uniq -c

网友

3楼 · 编辑于 2024-09-25 06:34:55

这不是你问题的5行答案，但是在PyCon'08上有一个很好的教程叫做Generator Tricks for System Programmers。还有一个叫做A Curious Course on Coroutines and Concurrency的后续教程。

生成器教程特别使用大日志文件处理作为其示例。

相关问题更多 >

编程相关推荐

热门问题

热门文章