如何在linux中只处理新的（未处理的）文件

#find all gz files for f in $(find $rawdatapath -name '*.gz'); do filename=`basename $f` #check whether the filename is already contained in the process list onlist=`grep $filename $processed_files` if [[ -z $onlist ]] then echo "processing, new: $filename" #unzip file and import into mongodb #write filename into processed list echo $filename #>> $processed_files fi done

3条回答

网友

1楼 · 编辑于 2024-10-01 00:16:09

只需使用一套：

import os

path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = set(line.strip() for line in open(processed_files_file))

with open(processed_files_file, "a") as pff:
    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith(".gz"):
                if file not in processed_files:
                    pff.write("%s\n" % file)

网友

2楼 · 编辑于 2024-10-01 00:16:09

使用标准命令行实用程序的替代方法：

只要diff一个包含所有文件列表的文件，以及一个包含已处理文件列表的文件。在

容易尝试，而且应该相当快。在

如果您在列表中包含完整的时间戳，您也可以通过这种方式获取“已更改”的文件。在

网友

3楼 · 编辑于 2024-10-01 00:16:09

如果文件在处理后没有被修改，一个选项是记住最新处理的文件，然后使用find的-newer选项检索尚未处理的文件。在

find $rawdatapath -name '*.gz' -newer $(<latest_file) -exec process.sh {} \;

在哪里进程.sh看起来像

^{pr2}$

这是未经测试的。在考虑实施这一策略之前，请注意不必要的副作用。在

如果一个老套的/快速的脏的解决方案是可以接受的，一个有趣的替代方案是在文件权限中对状态（已处理或未处理）进行编码，例如在组读取权限位。假设您的umask是022，因此任何新创建的文件都具有644的权限，在处理文件后将权限更改为600，并使用find的-perm选项检索尚未处理的文件。在

find $rawdatapath -name '*.gz' -perm 644 -exec process.sh {} \;

在哪里进程.sh看起来像

#!/bin/env bash 
echo "processing, new: $1"
#unzip file and import into mongodb 
chmod 600 $1

这也是未经测试的。在考虑实施这一策略之前，请注意不必要的副作用。在

相关问题更多 >

编程相关推荐

热门问题

热门文章