根据NUB为每个类别选择一行

2024-09-28 22:25:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我有5列的列表,第5列是数字列表,第1列是组标识符。总共有500行,但只有24组

我想要的是从第5列中具有最小数字的每个组标识符中只选择一行

例如

sheet= """ 
cmn1\tcmn2\tcmn3\tcmn4\tcmn5
rob\t45\tfoo\tbar\t0.0001
Steve\t32\tfoo\tspam\t0.01
rob\t45\tbar\tfoo\t0.0000001
Steve\t32\tfoo\tbar\t0.1"""

这是理想的结果:

cmn1\tcmn2\tcmn3\tcmn4\tcmn5
Steve\t32\tfoo\tspam\t.01
rob\t45\tbar\tfoo\t0.0000001

我在每行的列表中都有我的字段,但我一直在思考如何选择部分中数字最小的行[4]

for line in sheet:
     line = sheet.strip().split("\n")

parts = []

for part in line: 
      parts = []
      parts = part.split("\t")
      print parts [0], parts [1], parts[2], parts[3], parts[4]

Tags: 列表line数字标识符sheetstevepartsrob
3条回答

您可以使用itertools.groupby根据第一项对拆分行进行分组,然后使用min函数和适当的key来选择所需行:

>>> from operator import itemgetter
>>> s=sorted((line.split() for line in sheet.strip().split('\n')[1:]),key=itemgetter(0))
>>> [' '.join(min(g,key=lambda x:float(x[4]))) for _,g in groupby(s,itemgetter(0))]
['Steve 32 foo spam 0.01', 'rob 45 bar foo 0.0000001']
sheet= """ cmn1 cmn2 cmn3 cmn4 cmn5
rob  45   foo  bar  0.0001
Steve 32  foo  spam 0.01
rob   45  bar  foo  0.0000001
Steve 32  foo  bar  0.1"""

from collections import defaultdict

d = defaultdict(list)
spl = sheet.splitlines()
header = spl[0]
# iterate over all lines except header
for line in spl[1:]:
    # split once on whitespace using name as the key 
    name = line.split(None,1)[0]
    # append each line to our list of values
    d[name].append(line)

# get min of each line in our values based on the last float value
for v in d.values():
    print(min(v,key=lambda x: float(x.split()[-1])))

Steve 32  foo  spam 0.01
rob   45  bar  foo  0.0000001

如果订单很重要,您可以使用和订购信息通信技术广告,同时检查:

from collections import OrderedDict

d = OrderedDict()
spl = sheet.splitlines()
header = spl[0]
for line in spl[1:]:
    # unpack five elements after splitting
    # using name as key and f to cast to float and compare
    name, _, _, _, f = line.split()
    # if key exists compare float value to current float value
    # keeping or replacing the values based on the outcome
    if name in d and float(d[name].split()[-1]) > float(f):
        d[name] = line
    # else if first time seeing name just add it
    elif name not in d:
        d[name] = line

print(header)
for v in d.values():
    print(v)

cmn1 cmn2 cmn3 cmn4 cmn5
rob   45  bar  foo  0.0000001
Steve 32  foo  spam 0.01

使用您编辑的线,您可以看到输出未更改,它将与原来的完全相同:

for v in d.values():
    print(repr(v))

'rob\t45\tbar\tfoo\t0.0000001'
'Steve\t32\tfoo\tspam\t0.01

您可以使用字典存储每个唯一列1的所有行:

sheet= """cmn1\tcmn2\tcmn3\tcmn4\tcmn5
rob\t45\tfoo\tbar\t0.0001
Steve\t32\tfoo\tspam\t0.01
rob\t45\tbar\tfoo\t0.0000001
Steve\t32\tfoo\tbar\t0.1"""

grouped = {}
for line in sheet.split('\n')[1:]:
  parts = line.split('\t')
  print (line)
  # Parse the numbers into numerical types
  typed = (parts[0], int(parts[1]), parts[2], parts[3], float(parts[4]))
  #Add the typed list of values into a list stored in our dict
  if parts[0] in grouped.keys():
    grouped[parts[0]].append(typed) 
  else:
    grouped[parts[0]] = [typed]

#Now you can go through all the keys in the dict and select the smallest  
smallest_per_group = []
for key in grouped:
  lines = grouped[key]
  # using the 'key' parameter tells Python to give us the line with the smallest 5th column
  smallest = min(lines, key=lambda x:x[4])
  smallest_per_group.append(smallest)

相关问题 更多 >