Python csv合并具有不同列的多个文件

2024-10-03 21:24:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我希望有人能帮我解决这个问题

我有大约20个csv文件(每个文件都有其标题),每个文件都有数百列。 我的问题与合并这些文件有关,因为其中有两个文件有额外的列。 我想知道是否有一个选项可以将所有这些文件合并到一个文件中,添加所有具有相关数据的新列,而不会损坏其他文件

到目前为止,我使用了awk终端命令:

awk '(NR == 1) || (FNR > 1)' *.csv > file.csv

要合并,请删除除第一个文件之外的所有文件的标题。 我从上一个问题中得到了这个 Merge multiple csv files into one

但这并不能解决额外列的问题

编辑:

下面是一些带有标题的纯文本csv文件

文件1

"@timestamp","@version","_id","_index","_type","ad.(fydibohf23spdlt)/cn","ad.</o","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","ad.totalComputers","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationUserPrivileges","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-27 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

文件2

"@timestamp","@version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

文件3

"@timestamp","@version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

文件4

"@timestamp","@version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

这是20个文件中的4个,我包括了所有的标题,但没有行,因为它们包含敏感数据

当我在这些文件上运行脚本时,我可以看到它写入了时间戳值。但是,当我对原始文件(包含大量数据)运行它时,它所做的一切就是写入标题,就这样。如果您需要更多信息,请告诉我

一旦我在原始文件上运行脚本。这就是我得到的

enter image description here

共有20行(每个文件一行),但它不会写入每个文件的内容。这可能与嗅探第一行有关?因为我认为这只会检查文件的第一行,并像脚本一样向前移动。那么,在一个小文件中,它是如何复制和合并内容的呢


Tags: 文件csvid标题indexversiontypelabel
2条回答

你的问题不清楚,idk,如果你真的想用awk或python或两者中的任何一种来解决问题,而且它没有任何我们可以测试的示例输入/输出,所以这只是一个猜测,但这就是你想要做的(在每个Unix框上的任何shell中使用任何awk)

$ head file{1..2}.csv
==> file1.csv <==
1,2
a,b
c,d

==> file2.csv <==
1,2,3
x,y,z

$ cat tst.awk
BEGIN {
    FS = OFS = ","
    for (i=1; i<ARGC; i++) {
        if ( (getline < ARGV[i]) > 0 ) {
            if ( NF > maxNF ) {
                maxNF = NF
                hdr   = $0
            }
        }
    }
}
NR == 1 { print hdr }
FNR > 1 { NF=maxNF; print }

$ awk -f tst.awk file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z

有关何时/如何使用getline及其相关注意事项的详细信息,请参见http://awk.freeshell.org/AllAboutGetline

或者由GNU head-q提供协助:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
    if ( NF > maxNF ) {
        maxNF = NF
        hdr   = $0
    }
    next
}
!doneHdr++ { print hdr }
FNR > 1 { NF=maxNF; print }

$ head -q -n 1 file{1..2}.csv | awk -f tst.awk - file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z

正如您在原始问题中已经解释的,如果您知道需要多少列,您可以轻松地扩展Awk中的列

awk -F ',' -v cols=5 'BEGIN { OFS=FS }
FNR == 1 && NR > 1 { next }
NF<cols { for (i=NF+1; i<=cols; ++i) $i = "" }
1' *.csv >file.csv

我稍微重构了它,用next跳过不需要的行,而不是相反;这稍微简化了脚本的其余部分。我还添加了缺少的逗号分隔符

您可以轻松打印每个文件中的列数,只需注意最大列数:

awk -F , 'FNR==1 { print NF, FILENAME }' *.csv

如果您不知道文件中还有多少字段是您还没有的,或者如果您需要处理带引号的复杂CSV字段,那么可以切换到Python。在Awk中进行字段号嗅探并不难,但处理引用是很棘手的

import csv
import sys

# Sniff just the first line from every file
fields = 0
for filename in sys.argv[1:]:
    with open(filename) as raw:
        for row in csv.reader(raw):
            # If the line is longer than current max, update
            if len(row) > fields:
                fields = len(row)
                titles = row
            # Break after first line, skip to next file
            break

# Now do the proper reading
writer = csv.writer(sys.stdout)
writer.writerow(titles)

for filename in sys.argv[1:]:
    with open(filename) as raw:
        for idx, row in enumerate(csv.reader(raw)):
            if idx == 0:
                next
        row.extend([''] * (fields - len(row)))
        writer.writerow(row)

这只是假设附加字段位于末尾。如果文件可能在其他列之间有额外的列,或者列的顺序不同,则需要一个更复杂的解决方案(虽然不是很复杂;Python CSVDictReader子类可以完成大部分繁重的工作)

演示:https://ideone.com/S998l4

如果要在Awk中执行相同类型的嗅探,基本上必须两次指定输入文件的名称,或者在启动主脚本之前在BEGIN块中执行一些非平凡的处理以读取所有文件

相关问题 更多 >