将多个文件连接为一个文件，但跳过以重新连接已存在的内容

3条回答

网友

1楼 · 编辑于 2024-10-03 00:25:37

可以处理的方法不止一种，但这取决于您的环境：

第一个：读取目录中的文件，并将数据附加到输出文件中。然后，使用pickle或json将已读文件保存在字典中，并保存在光盘上。下次调用代码getc时，解析该列表并跳过保存在该列表中的文件(PS：使用Python进行文件处理，它的用例）

第二个：Pass the newly create files as argument，如果适合您（我对ApacheNIFI一无所知）

第三个：将这些行与输出文件中的行进行比较，但这会降低性能，而且可能非常不可靠

第四个：将已读取的文件移动到子目录中

我会选择方法一，因为它非常简单和直接

编辑：我做了一段代码（没有测试），如果它不能开箱即用，那么应该清楚该怎么做

import json
import os

directory = "/home/adrian/from_hdfs/"

parsed = {}
with open('data.txt') as json_file:
    parsed = json.load(json_file)


#open output file
with open("finalfile.txt", "a") as outfile:

    #loop through src directory
    for filename in os.listdir(directory):
        if filename in parsed: 
            continue # skip file if already read

        file_abs = os.path.join(directory, filename)

        #print("Reading file: "+file_abs)
        with open(file_abs, "r") as src_file:
            myfile.write(src_file.read()) #append data from src to dest
            parsed[filename] = 1



with open('result.json', 'w') as fp:
    json.dump(parsed, fp)

网友
2楼 · 编辑于 2024-10-03 00:25:37

Flows 正如您在附件中所看到的，ApacheNIFI中有一个带有“ExecuteScript”处理器的数据管道，我在其中运行上述python代码。我所描述的问题是，文件中的现有行会不断添加

网友
3楼 · 编辑于 2024-10-03 00:25:37

#CODE:

#!/usr/bin/python

import subprocess
import json
import os


subprocess.call('cd /home/adrian/from_hdfs; for f in *; do (cat "${f}"; echo) >> notfinal.txt; done', shell=True) =====> I am using this to generate "data.txt" from your example

directory = "/home/adrian/from_hdfs/"

parsed = {}
with open('/home/adrian/from_hdfs/notfinal.txt') as json_file:
    parsed = json.load(json_file)


#open output file
with open("finalfile.txt", "a") as outfile:

    #loop through src directory
    for filename in os.listdir(directory):
        if filename in parsed: 
            continue # skip file if already read

        file_abs = os.path.join(directory, filename)

        #print("Reading file: "+file_abs)
        with open(file_abs, "r") as src_file:
            myfile.write(src_file.read()) #append data from src to dest
            parsed[filename] = 1



with open('result.json', 'w') as fp:
    json.dump(parsed, fp)



Traceback (most recent call last):
  File "./script.py", line 14, in <module>
    parsed = json.load(json_file)
  File "/usr/lib/python2.7/json/__init__.py", line 291, in load
    **kw)
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

相关问题更多 >

编程相关推荐

热门问题

热门文章

将多个文件连接为一个文件，但跳过以重新连接已存在的内容

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >