在Python中,如何合并具有重复值的列并保留来自不同列的max value?

2024-09-30 01:33:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我想找到“reference”列的重复值,然后只在找到“amount”列的最大金额的行时避免重复。在

当前:

+----------+---------------------+---------+
| reference | amount | column3   | column4 |
+----------+---------------------+---------+
|   test1   |       9 |     45   | ye      |
|   test1   |      200|     45   | agag    |
|   test1   |      1  |     45   | aaa     |
|   test2   |      99 |     45   | bbab    |
|   test1   |      11 |     45   | value   |
+----------+---------------------+----------+

期望:

^{pr2}$

请分享关于这种情况的线索。在


Tags: value情况金额amountreferencetest1test2行时
3条回答

pandaps是一个非常好的python模块,用于处理表格数据。它很像R语言,提供了一种内存数据库。对于您的例子,它很简单:

import pandas as pd

df = pd.read_csv('test.csv')
a = df.groupby('reference')[['amount']].max()
answer = df.merge(a, on='amount')

并将结果保存回csv:

^{pr2}$

假设测试.csv您的数据文件是这样的:

reference,amount,column3,column4
test1,9,45,ye
test1,200,45,agag
test1,1,45,aaa
test2,99,45,bbab
test1,11,45,value

像下面这样的事情将是一个好的开始:

import csv, collections

with open("mydata.csv", 'r') as f_input:
    csv_input = csv.reader(f_input)
    # Assuming the first row contains the heading names, otherwise remove.
    headings = csv_input.next()     
    d_max_rows = collections.OrderedDict()

    for cols in csv_input:
        reference = cols[0]
        if reference in d_max_rows:
            cur_max = d_max_rows[reference]
            if int(cols[1]) >= int(cur_max[1]):
                d_max_rows[reference] = cols
        else:
            d_max_rows[reference] = cols

lrows = [headings] + list(d_max_rows.itervalues())

for reference, amount, col3, col4 in lrows:
    print "%-15s %-10s %-10s %-10s" % (reference, amount, col3, col4)

这将为您提供以下输出:

^{pr2}$

下面是一些代码,可以满足您的需要:

from collections import namedtuple
import csv

Record = namedtuple('Record', 'reference amount column3 column4')

no_dups = {}
with open('references.csv', 'r', newline='') as csvfile:
    for rec in map(Record._make, csv.reader(csvfile)):
        if (rec.reference not in no_dups or
            int(no_dups[rec.reference].amount) < int(rec.amount)):
            no_dups[rec.reference] = rec

with open('references_out.csv', 'w', newline='') as csvfile:
    csv.writer(csvfile).writerows(rec for rec in no_dups.values())

相关问题 更多 >

    热门问题