在python中,当文本中的任何地方都可能存在差异时,如何合并两个文本文件?

2024-06-26 18:10:25 发布

您现在位置:Python中文网/ 问答频道 /正文

现在我有了一个文件列表,我想将目标相同序列号的文件组合起来。每个文件包含数千行,每行都有这样的格式:日期、计数、读取

例如,第一个文件:

"2019-12-23 00:00:00",1123,211685,34650.75,33225.69,...(hundreds of similar numbers)
 ...(hundreds of similar lines)
"2020-02-23 06:00:00",1372,211685,34651.22,33224.6,...
"2020-02-23 12:00:00",1373,211685,34650.34,33224.74,...

第二个文件:

"2019-12-17 12:00:00",1101,211685,34649.3,33225.8...
 ...
"2020-02-15 00:00:00",1339,211685,34651.66,33225.32,...
"2020-02-15 06:00:00",1340,211685,34651.63,33225.19...

问题是,缺少的行可能在文件的开头或结尾。一个文件中可能缺少最初的100个读数,而另一个文件可能缺少最近的50个读数。我想知道合并它们的最佳方式是什么?我可以想到使用“SET”,但我不确定它是否能够在文件中间添加丢失的行。

已完成行的示例:

"2019-12-17 12:00:00",1101,211685,34649.3,33225.8...
 ...
"2019-12-23 00:00:00",1123,211685,34650.75,33225.69,...
 ...
"2020-02-15 00:00:00",1339,211685,34651.66,33225.32,...
"2020-02-15 06:00:00",1340,211685,34651.63,33225.19...
 ...
"2020-02-23 06:00:00",1372,211685,34651.22,33224.6,...
"2020-02-23 12:00:00",1373,211685,34650.34,33224.74,...

Tags: 文件of目标列表格式结尾方式计数
2条回答

set不维护顺序,但您可以稍后对其进行排序以获得所需的输出文件。当一个日期字符串以UTC格式写为年-月-日-小时-分-秒时,它可以按从高到低或从低到高排序,而无需任何日期转换。用美语“2019年6月31日下午12:30 MST”写,你需要转换成可排序的内容

def merge_files(filenames, outfilename):
    rows = set()
    for filename in filenames:
        rows.update(open(filename))
    with open(outfilename, 'w') as fp:
        fp.writelines(sorted(rows))

您可以尝试使用:

from datetime import datetime
from pprint import pprint
files = ["merge_01.txt", "merge_02.txt"]
all_lines = []
for file in files:
    with open(file) as f:
        all_lines += [x.strip() for x in f]

all_lines = list(set(all_lines))
all_lines.sort(key=lambda date: datetime.strptime(date[1:20], "%Y-%m-%d %H:%M:%S"))
pprint(all_lines)

with open("merge_all.txt", "w") as f:
    for line in all_lines:
        f.write(line+"\n")

['"2019-12-17 12:00:00",1101,211685,34649.3,33225.8',
 '"2019-12-23 00:00:00",1123,211685,34650.75,33225.69',
 '"2020-02-15 00:00:00",1339,211685,34651.66,33225.32',
 '"2020-02-15 06:00:00",1340,211685,34651.63,33225.19',
 '"2020-02-23 06:00:00",1372,211685,34651.22,33224.6',
 '"2020-02-23 12:00:00",1373,211685,34650.34,33224.74']

Demo


熊猫解决方案:

import pandas as pd
files = ["merge_01.txt", "merge_02.txt"]
all_lines = []
for file in files:
    with open(file) as f:
        all_lines += list([x.strip().replace("\"", "") for x in f])

df = pd.DataFrame([sub.split(",") for sub in all_lines], columns=["date", "field1", "field2", "field3", "field4"])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by='date').drop_duplicates()
df.to_csv('merged.csv', index=False)

Demo

相关问题 更多 >