基于多种标准合并两个CSV文件

1条回答

网友

1楼 · 发布于 2024-09-27 07:27:27

主题匹配有几个问题，因此您需要扩展我使用的match_topic()方法，但我添加了一些逻辑，以查看最后不匹配的内容

results变量包含一个dict列表，您可以轻松地将其保存为JSON文件

检查内联注释以了解我使用的逻辑推理

旁注：

如果我是你，我会稍微重组JSON。对我来说，将topic作为一个键/值对放在GPO和CAP键下比将Topic键与单独的GPO和CAP键/值对放在一起更有意义

import csv
from pprint import pprint
import json


# load gpo_full.csv into a list of dict using
# csv.DictReader & list comprehension
with open("path/to/file/gpo_full.csv") as infile:
    gpo_full = [item for item in csv.DictReader(infile)]


# do the same for CAP_cols.csv
with open("path/to/file/CAP_cols.csv") as infile:
    cap_cols = [item for item in csv.DictReader(infile)]


def match_topic(gpo_topic: str, cap_topic: str) -> bool:
    """We need a function as some of the mapping is not simple

    Args:
        gpo_topic (str): gpo topic
        cap_topic (str): CAP topic

    Returns:
        bool: True if topics match
    """
    # this one is simple
    if gpo_topic in cap_topic:
        return True
    # you need to repeat the below conditional check
    # for each custom topic matching
    elif gpo_topic == "weather" and cap_topic == "rain & cloudy":
        return True 
    # example secondary topic matching
    elif gpo_topic == "foo" and cap_topic == "bar":
        return True 
    # finally return false for no matches
    return False


# we need this later
gpo_length = len(gpo_full)
results = []
cap_left_over = []
# do the actual mapping
# this could've been done above, but I separated it intentionally
for cap in cap_cols:
    found = False
    # first find the corresponding gpo
    for index, gpo in enumerate(gpo_full):
        if (
            gpo["Specific_Date"] == cap["Specific_Date"] # check by date
            and match_topic(gpo["topic"], cap["topic"]) # check if topics match
        ):
            results.append({
                "Date": gpo["Date"],
                "Specific_Date": gpo["Specific_Date"],
                "Topic": {
                    "GPO": gpo["topic"],
                    "CAP": cap["topic"]
                },
                "GPO": {
                    "hearing_sub_type": gpo["hearing_sub_type"]
                },
                "CAP": {
                    "majortopic": cap["majortopic"],
                    "id": cap["id"],
                    "Chamber": cap["Chamber"]
                }
            })
            # pop & break to remove the gpo item
            # this is so you're left over with a list of
            # gpo items that didn't match
            # it also speeds up further matches
            gpo_full.pop(index)
            found = True
            break
    # this is to check if there's stuff left over
    if not found:
        cap_left_over.append(cap)


with open('path/to/file/combined_json.json', 'w') as outfile:
    json.dump(results, outfile, indent=4)


pprint(results)
print(f'\nLength:\n  Results: {len(results)}\n  CAP: {len(cap)}\n  GPO: {gpo_length}')
print('\nLeftover GPO:')
pprint(gpo_full)
print('\nLeftover CAP:')
pprint(cap_left_over)

输出
我已经从输出中删除了pprint(results)，请参阅下面的JSON

Length:
  Results: 5
  CAP: 6
  GPO: 7

Leftover GPO:
[{'Date': 'April,2001',
  'Specific_Date': 'NaN ',
  'hearing_sub_type': 'Oversight',
  'topic': 'people'},
 {'Date': 'June,2000',
  'Specific_Date': 'June 6,2000',
  'hearing_sub_type': 'Oversight',
  'topic': 'depressed'}]

Leftover CAP:
[{'Chamber': '2',
  'Date': 'June,2000',
  'Specific_Date': 'June 6,2000',
  'id': '79847',
  'majortopic': '4',
  'topic': 'emotion'},
 {'Chamber': '1',
  'Date': 'May,2001',
  'Specific_Date': 'NaN',
  'id': '79848',
  'majortopic': '13',
  'topic': 'NaN'}]

path/to/file/gpo_full.csv

Date,hearing_sub_type,topic,Specific_Date
"January,1997",Oversight,weather,"January 12,1997"
"June,2000",General,life,"June 5,2000"
"January,1997",General,forest,"January 1,1997"
"April,2001",Oversight,people,NaN 
"June,2000",Oversight,depressed,"June 6,2000"
"January,1997",General,weather,"January 1,1997"
"June,2000",Oversight,depressed,"June 5,2000"

path/to/file/CAP_cols.csv

majortopic,id,Chamber,topic,Date,Specific_Date
21,79846,1,many forest,"January,1997","January 1,1997"
4,79847,2,emotion,"June,2000","June 6,2000"
13,79848,1,NaN,"May,2001","NaN"
7,79849,2,good life,"June,2000","June 5,2000"
21,79850,1,good weather,"January,1997","January 1,1997"
25,79851,1,rain & cloudy,"January,1997","January 12,1997"
6,79852,2,sad & depressed,"June,2000","June 5,2000"

path/to/file/combined_json.json

[
    {
        "Date": "January,1997",
        "Specific_Date": "January 1,1997",
        "Topic": {
            "GPO": "forest",
            "CAP": "many forest"
        },
        "GPO": {
            "hearing_sub_type": "General"
        },
        "CAP": {
            "majortopic": "21",
            "id": "79846",
            "Chamber": "1"
        }
    },
    {
        "Date": "June,2000",
        "Specific_Date": "June 5,2000",
        "Topic": {
            "GPO": "life",
            "CAP": "good life"
        },
        "GPO": {
            "hearing_sub_type": "General"
        },
        "CAP": {
            "majortopic": "7",
            "id": "79849",
            "Chamber": "2"
        }
    },
    {
        "Date": "January,1997",
        "Specific_Date": "January 1,1997",
        "Topic": {
            "GPO": "weather",
            "CAP": "good weather"
        },
        "GPO": {
            "hearing_sub_type": "General"
        },
        "CAP": {
            "majortopic": "21",
            "id": "79850",
            "Chamber": "1"
        }
    },
    {
        "Date": "January,1997",
        "Specific_Date": "January 12,1997",
        "Topic": {
            "GPO": "weather",
            "CAP": "rain & cloudy"
        },
        "GPO": {
            "hearing_sub_type": "Oversight"
        },
        "CAP": {
            "majortopic": "25",
            "id": "79851",
            "Chamber": "1"
        }
    },
    {
        "Date": "June,2000",
        "Specific_Date": "June 5,2000",
        "Topic": {
            "GPO": "depressed",
            "CAP": "sad & depressed"
        },
        "GPO": {
            "hearing_sub_type": "Oversight"
        },
        "CAP": {
            "majortopic": "6",
            "id": "79852",
            "Chamber": "2"
        }
    }
]

相关问题更多 >

编程相关推荐

热门问题

热门文章

基于多种标准合并两个CSV文件

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >