我正在做一个数据分析项目,结果遇到了麻烦。你知道吗
基本上,如果我有一个示例CSV“A”:
id | item_num
A123 | 1
A123 | 2
B456 | 1
我有一个例子“B”:
id | description
A123 | Mary had a...
A123 | ...little lamb.
B456 | ...Its fleece...
如果我使用Pandas
执行merge
,结果如下:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | Mary had a...
A123 | 1 | ...little lamb.
A123 | 2 | ...little lamb.
B456 | 1 | Its fleece...
我怎样才能让它变成:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb...
B456 | 1 | Its fleece...
这是我的密码:
import pandas as pd
# Import CSVs
first = pd.read_csv("../PATH_TO_CSV/A.csv")
print("Imported first CSV: " + str(first.shape))
second = pd.read_csv("../PATH_TO_CSV/B.csv")
print("Imported second CSV: " + str(second.shape))
# Create a resultant, but empty, DF, and then append the merge.
result = pd.DataFrame()
result = result.append(pd.merge(first, second), ignore_index = True)
print("Merged CSVs... resulting DataFrame is: " + str(result.shape))
# Lets do a "dedupe" to deal with an issue on how Pandas handles datetime merges
# I read about an issue where if datetime is involved, duplicate entires will be created.
result = result.drop_duplicates()
print("Deduping... resulting DataFrame is: " + str(result.shape))
# Save to another CSV
result.to_csv("EXPORT.csv", index=False)
print("Saved to file.")
我真的很感激任何帮助-我很困!我要处理20000多行。你知道吗
谢谢。你知道吗
编辑:我的文章被标记为可能的重复。不是的,因为我不一定要添加一个列-我只是想阻止description
乘以item_num
的个数,这个数是属于特定的id
。你知道吗
更新,6月21日:
如果两个df看起来像这样,我怎么能合并呢?你知道吗
id | item_num | other_col
A123 | 1 | lorem ipsum
A123 | 2 | dolor sit
A123 | 3 | amet, consectetur
B456 | 1 | lorem ipsum
我有一个例子“B”:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb.
B456 | 1 | ...Its fleece...
所以我的结论是:
id | item_num | other_col | description
A123 | 1 | lorem ipsum | Mary Had a...
A123 | 2 | dolor sit | ...little lamb.
B456 | 1 | lorem ipsum | ...Its fleece...
也就是说,在“其他列”中有“amet,consectetur”的3的行被忽略。你知道吗
我想你需要康卡特
你得到了吗
我会这样做:
说明:我们可以在
B
DF中创建“virtual”item_num
列来连接:尝试索引您的df,然后删除重复项:
相关问题 更多 >
编程相关推荐