执行merg时防止重复行

2024-09-30 08:20:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在做一个数据分析项目,结果遇到了麻烦。你知道吗

基本上,如果我有一个示例CSV“A”:

id   | item_num
A123 |     1
A123 |     2
B456 |     1

我有一个例子“B”:

id   | description
A123 | Mary had a...
A123 | ...little lamb.
B456 | ...Its fleece...

如果我使用Pandas执行merge,结果如下:

id   | item_num | description
A123 |     1    | Mary had a...
A123 |     2    | Mary had a...
A123 |     1    | ...little lamb.
A123 |     2    | ...little lamb.
B456 |     1    | Its fleece...

我怎样才能让它变成:

id   | item_num | description
A123 |     1    | Mary had a...
A123 |     2    | ...little lamb...
B456 |     1    | Its fleece...

这是我的密码:

import pandas as pd

# Import CSVs
first = pd.read_csv("../PATH_TO_CSV/A.csv")
print("Imported first CSV: " + str(first.shape))
second = pd.read_csv("../PATH_TO_CSV/B.csv")
print("Imported second CSV: " + str(second.shape))


# Create a resultant, but empty, DF, and then append the merge.
result = pd.DataFrame()
result = result.append(pd.merge(first, second), ignore_index = True)
print("Merged CSVs... resulting DataFrame is: " + str(result.shape))

# Lets do a "dedupe" to deal with an issue on how Pandas handles datetime merges
# I read about an issue where if datetime is involved, duplicate entires will be created.
result = result.drop_duplicates()
print("Deduping... resulting DataFrame is: " + str(result.shape))

# Save to another CSV
result.to_csv("EXPORT.csv", index=False)
print("Saved to file.")

我真的很感激任何帮助-我很困!我要处理20000多行。你知道吗

谢谢。你知道吗

编辑:我的文章被标记为可能的重复。不是的,因为我不一定要添加一个列-我只是想阻止description乘以item_num的个数,这个数是属于特定的id。你知道吗


更新,6月21日:

如果两个df看起来像这样,我怎么能合并呢?你知道吗

id   | item_num | other_col
A123 |     1    | lorem ipsum
A123 |     2    | dolor sit
A123 |     3    | amet, consectetur
B456 |     1    | lorem ipsum

我有一个例子“B”:

id   | item_num | description
A123 |     1    | Mary had a...
A123 |     2    | ...little lamb.
B456 |     1    | ...Its fleece...

所以我的结论是:

id   | item_num |  other_col  | description
A123 |     1    | lorem ipsum | Mary Had a...
A123 |     2    | dolor sit   | ...little lamb.
B456 |     1    | lorem ipsum | ...Its fleece...

也就是说,在“其他列”中有“amet,consectetur”的3的行被忽略。你知道吗


Tags: csviddescriptionresultitemnumitsmary
3条回答

我想你需要康卡特

result = pd.concat([df1.set_index('id'), df2.set_index('id')],axis = 1).reset_index()

你得到了吗

    id      item_no     description
0   A123    1           Mary had a...
1   A123    2           ...little lamb
2   B456    1           ...Its fleece...

我会这样做:

In [135]: result = A.merge(B.assign(item_num=B.groupby('id').cumcount()+1))

In [136]: result
Out[136]:
     id  item_num       description
0  A123         1     Mary had a...
1  A123         2   ...little lamb.
2  B456         1  ...Its fleece...

说明:我们可以在BDF中创建“virtual”item_num列来连接:

In [137]: B.assign(item_num=B.groupby('id').cumcount()+1)
Out[137]:
     id       description  item_num
0  A123     Mary had a...         1
1  A123   ...little lamb.         2
2  B456  ...Its fleece...         1

尝试索引您的df,然后删除重复项:

df = df.set_index(['id', 'item_num']).drop_duplicates()

相关问题 更多 >

    热门问题