将两个CSV文件合并到JSON中

2024-09-27 07:30:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我想合并两个CSV文件并将其转换为JSON。困难的是,两个文件中的同一列在行中不是唯一的,如下所示:

gpo_full.csv:
Date           hearing_sub_type   Old_ID   witnesses
January,1997   Oversight          22       Babbitt
June,2000      General            21       Blitzer
January,1997   General            23       Alldridge
April,2001     Oversight          25       Michael 
June,2000      Oversight          24       Bruce

CAP_cols.csv:
majortopic   id     Chamber   subtopic   Date
21           79846  1         2103       January,1997
4            79847  2         705        June,2000
13           79848  1         1802       May,2001
7            79849  2         201        June,2000
21           79850  1         401        January,1997

我想要的输出应该是:

 [{
"Date": "January,1997",
"GPO": [{
    "hearing_sub_type": "Oversight",
    "Old_ID": "22",
    "witnesses": "Babbitt"
}, {
    "hearing_sub_type": "General",
    "Old_ID": "23",
    "witnesses": "Alldridge"
}]
"CAP": [{
    "majortopic": "21",
    "id": "79846",
    "Chamber": "1"
    "subtopic": "2103"
}, {
    "majortopic": "21",
    "id": "79850",
    "Chamber": "1"
    "subtopic": "401"
}]
},
and similar for others]

如果G.csv中没有2001年4月和C.csv中的2001年5月这样的匹配数据,则格式将相同,但值不是这样的:

[{
"Date": "April,2001",
"GPO": [{
    "hearing_sub_type": None,
    "Old_ID": None,
    "witnesses": None
}]
"CAP": [{
    "majortopic": None,
    "id": None,
    "Chamber": None, 
    "subtopic": None
}]
},
and similar for others]

我写的代码是

GPO = open("gpo_full.csv", "r")
CAP = open("CAP_cols.csv", "r")

jsonfile = open('datamerge.json', 'w')

gporeader = csv.DictReader(GPO)
capreader = csv.DictReader(CAP)

output = []
for gporow in gporeader:
    output.append(gporow["Date"])
    gporow["CAP"] = []
    gporow["GPO"] = []
    if gporow["Date"] in output:
        gporow["GPO"].append(gporow)
    CAP.seek(0)
    for caprow in capreader:
        if (gporow["Date"] == caprow["Date"]):
            gporow["CAP"].append(caprow)
print(output)
json.dump(output, jsonfile, sort_keys=True)

它不起作用。更具体地说,我不知道如何提取日期作为密钥,并在GPO和CAP中删除它们
我真的很感激马蒂诺为我画了一幅完整的画


Tags: csvnoneidoutputdatetypeoldcap
1条回答
网友
1楼 · 发布于 2024-09-27 07:30:41

通过检查您的代码和所需的输出,猜猜您所说的“它不起作用”是什么意思,我想我已经能够编写一些起作用的代码了

首先,这里是我使用的数据,其中»右guillemet字符用于表示tab'\t'字符-换句话说,您有以tab分隔的CSV文件,在使用csv.DictReader读取它们时必须指定这些文件,因为默认情况下,值之间要有逗号

输入文件

gpo_full.csv

Date»hearing_sub_type»Old_ID»witnesses
January,1997»Oversight»22»Babbitt
June,2000»General»21»Blitzer
January,1997»General»23»Alldridge
April,2001»Oversight»25»Michael
June,2000»Oversight»24»Bruce

CAP_cols.csv

majortopic»id»Chamber»subtopic»Date
21»79846»1»2103»January,1997
4»79847»2»705»June,2000
13»79848»1»1802»May,2001
7»79849»2»201»June,2000
21»79850»1»401»January,1997

Code:

最重要的变化是使output成为一个字典子类,其键是日期(参见下面关于日期的注意事项),每个都与一个嵌套的字典关联,该字典包含自己的一对"CAP""GPO"列表。这样做可以很容易地检测到以前是否遇到过日期,并为其初始化嵌套字典-这会在另一个CSV文件中搜索匹配的日期(这就是为什么定义并使用字典子类

我认为值得一提的是,通过file.seek(0)来“倒带”CSV文件以重新读取其中的数据行是行不通的。相反,必须关闭、重新打开它们,并重新创建它们的csv.DictReader。这至少在一定程度上是因为在本例中,文件的开头有一个特殊的标题行,用于定义其字段的名称

一个警告我想提到的是,由于日期字段是“月,年”格式,不能用于按时间顺序对它们进行排序,因为它们将按字典顺序作为字符串进行比较,而不是按数字进行比较,即June,2021将被视为在December,2001之前出现-这是我禁用排序的主要原因。这是可以解决的,但这不是你问题的主要主题

import csv
import json
from pprint import pprint


class OutputDict(dict):
    """ Dictionary subclass to create initial entry for each date. """
    def __init__(self, cap_filename):
        super().__init__()  # Base class initialization.
        self.cap_filename = cap_filename

    def __missing__(self, date):
        """ Initialize entry for a date when it's first encountered. """
        self[date] = {'Date': date, 'GPO': [], 'CAP': []}

        # Scan cap csv file for matching dates - occurs once per unique date added.
        with open(self.cap_filename) as cap_file:  # Add matches from other CSV file.
            cap_reader = csv.DictReader(cap_file, delimiter='\t')
            for cap_row in cap_reader:
                if date == cap_row['Date']:
                    del cap_row['Date']
                    self[date]['CAP'].append(cap_row)

        # If there weren't any matching dates, create and add an empty row.
        if not self[date]['CAP']:  # No matching CAP dates?
            cap_row = {field: None for field in cap_reader.fieldnames if field != 'Date'}
            self[date]['CAP'].append(cap_row)

        return self[date]


# Add all the dates in the gpo_full.csv file to an output dictionary.
output = OutputDict('CAP_cols.csv')  # Automatically add matching dates in CAP_cols file.
with open('gpo_full.csv') as gpo_file:
    gpo_reader = csv.DictReader(gpo_file, delimiter='\t')
    for gpo_row in gpo_reader:
        date = gpo_row.pop('Date')
        output[date]['GPO'].append(gpo_row)

# Add any dates in CAP_cols.csv that weren't in gpo_full.csv file.
with open('CAP_cols.csv') as cap_file:
    cap_reader = csv.DictReader(cap_file, delimiter='\t')
    for cap_row in cap_reader:
        date = cap_row.pop('Date')
        if date not in output:  # Create entry for it with a null gpo row.
            gpo_row = {field: None for field in gpo_reader.fieldnames if field != 'Date'}
            output[date]['GPO'].append(gpo_row)  # Automatically adds cap_row.

output = list(output.values())  # Only want a list of the values.
#pprint(output, sort_dicts=0)  # Show what's going into JSON file.

with open('datamerge.json', 'w') as json_file:
    json.dump(output, json_file, indent=4, sort_keys=False)

print('-Fini-')

Results

下面是它放入datamerge.json文件的内容:

[
    {
        "Date": "January,1997",
        "GPO": [
            {
                "hearing_sub_type": "Oversight",
                "Old_ID": "22",
                "witnesses": "Babbitt"
            },
            {
                "hearing_sub_type": "General",
                "Old_ID": "23",
                "witnesses": "Alldridge"
            }
        ],
        "CAP": [
            {
                "majortopic": "21",
                "id": "79846",
                "Chamber": "1",
                "subtopic": "2103"
            },
            {
                "majortopic": "21",
                "id": "79850",
                "Chamber": "1",
                "subtopic": "401"
            }
        ]
    },
    {
        "Date": "June,2000",
        "GPO": [
            {
                "hearing_sub_type": "General",
                "Old_ID": "21",
                "witnesses": "Blitzer"
            },
            {
                "hearing_sub_type": "Oversight",
                "Old_ID": "24",
                "witnesses": "Bruce"
            }
        ],
        "CAP": [
            {
                "majortopic": "4",
                "id": "79847",
                "Chamber": "2",
                "subtopic": "705"
            },
            {
                "majortopic": "7",
                "id": "79849",
                "Chamber": "2",
                "subtopic": "201"
            }
        ]
    },
    {
        "Date": "April,2001",
        "GPO": [
            {
                "hearing_sub_type": "Oversight",
                "Old_ID": "25",
                "witnesses": "Michael"
            }
        ],
        "CAP": [
            {
                "majortopic": null,
                "id": null,
                "Chamber": null,
                "subtopic": null
            }
        ]
    },
    {
        "Date": "May,2001",
        "GPO": [
            {
                "hearing_sub_type": null,
                "Old_ID": null,
                "witnesses": null
            }
        ],
        "CAP": [
            {
                "majortopic": "13",
                "id": "79848",
                "Chamber": "1",
                "subtopic": "1802"
            }
        ]
    }
]

相关问题 更多 >

    热门问题