我有两个CSV文件。它们具有相同的列,但同一列中的每一行都不是唯一的,如下所示:
gpo_full.csv:
Date hearing_sub_type topic Specific_Date
January,1997 Oversight weather January 12,1997
June,2000 General life June 5,2000
January,1997 General forest January 1,1997
April,2001 Oversight people NaN
June,2000 Oversight depressed June 6,2000
January,1997 General weather January 1,1997
June,2000 Oversight depressed June 5,2000
CAP_cols.csv:
majortopic id Chamber topic Date Specific_Date
21 79846 1 many forest January,1997 January 1,1997
4 79847 2 emotion June,2000 June 6,2000
13 79848 1 NaN May,2001 NaN
7 79849 2 good life June,2000 June 5,2000
21 79850 1 good weather January,1997 January 1,1997
25 79851 1 rain & cloudy January,1997 January 12,1997
6 79852 2 sad & depressed June,2000 June 5,2000
我想使用三个标准来匹配这些数据:特定的日期、日期和主题
首先,我想使用“日期”列对这些数据进行分组。接下来,我尝试使用“Specific_Date”列来缩小范围,因为此列中丢失了一些数据。最后,我想使用类似单词(如单词嵌入)的“topic”列来确保gpo_full中的哪些行可以与CAP_cols中的唯一行相对应
我尝试使用“日期”列对数据进行分组,并将它们合并到JSON文件中。然而,我被困在实现下一步缩小范围的具体日期和主题
我对该输出的想法如下:
{
"Date": "January,1997",
"Specific_Date": "January 12,1997"
"Topic": {"GPO": "weather", "CAP": "rain & cloudy"}
"GPO": {
"hearing_sub_type": "Oversight",
and other columns
}
"CAP": {
"majortopic": "25",
"id": "79851",
"Chamber": "1"
}
},
{
"Date": "January,1997",
"Specific_Date": "January 1,1997"
"Topic": {"GPO": "forest", "CAP": "many forest"}
"GPO": {
"hearing_sub_type": "General",
and other columns
}
"CAP": {
"majortopic": "21",
"id": "79846",
"Chamber": "1"
}
and similar for others}
我已经想了三天,不知道。实现这一目标的任何想法都将非常有用!非常感谢
主题匹配有几个问题,因此您需要扩展我使用的
match_topic()
方法,但我添加了一些逻辑,以查看最后不匹配的内容results
变量包含一个dict列表,您可以轻松地将其保存为JSON文件检查内联注释以了解我使用的逻辑推理
旁注:
如果我是你,我会稍微重组JSON。对我来说,将
topic
作为一个键/值对放在GPO
和CAP
键下比将Topic
键与单独的GPO
和CAP
键/值对放在一起更有意义输出
我已经从输出中删除了
pprint(results)
,请参阅下面的JSONpath/to/file/gpo_full.csv
path/to/file/CAP_cols.csv
path/to/file/combined_json.json
相关问题 更多 >
编程相关推荐