在python中按升序和降序排列CSV数字

2024-09-29 21:42:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我很惊讶我在python中找不到任何关于排名数字的东西。。。在

基本上,我需要两个脚本来完成同一个任务—一个按升序,一个按降序。在

row[2]是要排名的数字,row[4]是要放入排名的单元格。

row[0] + row[1]定义了每个数据集/组

在第一个例子中,数字越大排名越高。在

CSV示例1(排名靠后)

uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,3,data
uniquedata2,uniquecell2,41,data,2,data
uniquedata2,uniquecell2,39,data,3,data
uniquedata2,uniquecell2,45,data,1,data
uniquedata2,uniquecell2,22,data,4,data
uniquedata1,uniquecell2,36,data,3,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data

在第二个例子中,数字越大排名越低。在

CSV示例2(排名靠前)

^{pr2}$

在第三个例子中,这个例子是向上排列的,它包含了应该被赋予最高等级的空单元格(如果有两个空格,它们应该被赋予相同的等级)

CSV示例3(包括空单元格)

uniquedata1,uniquecell1,42,data,2,data
uniquedata1,uniquecell1,,data,3,data
uniquedata1,uniquecell1,13,data,1,data
uniquedata2,uniquecell2,41,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,22,data,1,data
uniquedata1,uniquecell2,36,data,1,data
uniquedata1,uniquecell2,66,data,3,data
uniquedata1,uniquecell2,40,data,2,data

有人知道我怎样才能达到预期的效果吗?在


Tags: csv数据脚本示例data定义数字例子
3条回答
import sys

#Read the input file
input_data = [line.rstrip().split(",") for line in open("input.txt", 'r').readlines()]

#Put the value and index of each line into a dict,
#categorizing by the dataset/group name. 
#Each different dataset/group is a key of the dict,
#and each key's value is a list.
group_dict = {}
index = 0
for line in input_data:
    group_key = line[0]+","+line[1]
    if group_key not in group_dict.keys():
        group_dict[group_key] = []
    group_dict[group_key].append([index, line[2], None])
    index += 1

#Sort each list of the dict by the numbers.
#Make blank to be a very large number. 
for key in group_dict.keys():
    group_dict[key] = sorted(group_dict[key], key=lambda x: sys.maxint if x[1]=="" else int(x[1]))
    #####group_dict[key] = group_dict[key][::-1]
    ##### Uncomment the above line to sort in descending order  

#Check if there're multiple items with the same number, 
#If so, set them by the same rank.
    group_dict[key][0][2] = 1
    for i in range(1, len(group_dict[key])):
        group_dict[key][i][2] = (group_dict[key][i-1][2] if group_dict[key][i][1] == group_dict[key][i-1][1] else i+1)

#In order to keep the same line order with the input file, 
#get all the lists together into a new list, 
#and sort them by the line index (recorded when put them into the dict).
rank_list = []
for rank in group_dict.values():
    rank_list += rank
rank_list = sorted(rank_list, key=lambda x: x[0])
for rank in rank_list:
    input_data[rank[0]][4] = str(rank[2])

#Output the final list.
for line in input_data:
    print ",".join(line)

测试:

输入:

^{pr2}$

输出:

uniquedata1,uniquecell1,123,data,2,data
uniquedata1,uniquecell1,,data,3,data
uniquedata1,uniquecell1,111,data,1,data
uniquedata2,uniquecell2,456,data,1,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,,data,3,data
uniquedata2,uniquecell2,789,data,2,data
uniquedata1,uniquecell2,386,data,1,data
uniquedata1,uniquecell2,512,data,3,data
uniquedata1,uniquecell2,486,data,2,data  

如果任务的唯一区别是排名是按升序还是降序排列,那么您实际上不需要两个脚本——只需将其作为函数的参数,如图所示。StrCount类太琐碎了,可能不值得这么做(但我还是把它放在这里了)。在

import csv
from itertools import count, groupby
import sys

_MIN_INT, _MAX_INT = -sys.maxint-1, sys.maxint
RANK_DOWN, RANK_UP = False, True # larger numbers to get higher or lower rank

class StrCount(count):
    """ Like itertools.count iterator but supplies string values. """
    def next(self):
        return str(super(StrCount, self).next())

def rerank(filename, direction):
    with open(filename, 'rb') as inf:
        reader = csv.reader(inf)
        subst = _MIN_INT if direction else _MAX_INT  # subst value for empty cells
        for dataset, rows in groupby(reader, key=lambda row: row[:2]):
            ranking = StrCount(1)
            prev = last_rank = None
            for row in sorted(rows,
                              key=lambda row: int(row[2]) if row[2] else subst,
                              reverse=direction):
                row[4] = (ranking.next() if row[2] or not row[2] and prev != ''
                                         else last_rank)
                print ','.join(row)
                prev, last_rank  = row[2], row[4]

if __name__ == '__main__':
    print 'CSV example_1.csv (ranked down):'
    rerank('example_1.csv', RANK_DOWN)
    print '\nCSV example_2.csv (ranked up):'
    rerank('example_2.csv', RANK_UP)
    print '\nCSV example_3.csv (ranked up):'
    rerank('example_3.csv', RANK_UP)

输出:

^{pr2}$

如果你用熊猫,这很容易。在

import pandas as pd

def sorted_df(df, ascending=False):
    grouped = df.groupby([0,1])
    data = []
    for g in grouped:
        d = g[1]
        d[4] = d[2].rank(ascending=ascending)
        d = d.sort(4)
        data.append(d)
    return pd.concat(data)

# load our dataframe from a csv string
import StringIO
f = StringIO.StringIO("""uniquedata1,uniquecell1,42,data,1,data
uniquedata1,uniquecell1,32,data,2,data
uniquedata1,uniquecell1,13,data,3,data
uniquedata2,uniquecell2,41,data,2,data
uniquedata2,uniquecell2,39,data,3,data
uniquedata2,uniquecell2,45,data,1,data
uniquedata2,uniquecell2,22,data,4,data
uniquedata1,uniquecell2,36,data,3,data
uniquedata1,uniquecell2,66,data,1,data
uniquedata1,uniquecell2,40,data,2,data""")

df = pd.read_csv(f, header=None)
# sort descending
sorted_df(df)
=>           0            1   2     3  4     5
0  uniquedata1  uniquecell1  42  data  1  data
1  uniquedata1  uniquecell1  32  data  2  data
2  uniquedata1  uniquecell1  13  data  3  data
8  uniquedata1  uniquecell2  66  data  1  data
9  uniquedata1  uniquecell2  40  data  2  data
7  uniquedata1  uniquecell2  36  data  3  data
5  uniquedata2  uniquecell2  45  data  1  data
3  uniquedata2  uniquecell2  41  data  2  data
4  uniquedata2  uniquecell2  39  data  3  data
6  uniquedata2  uniquecell2  22  data  4  data
# sort ascending
sorted_df(df, ascending=True)
=>           0            1   2     3  4     5
2  uniquedata1  uniquecell1  13  data  1  data
1  uniquedata1  uniquecell1  32  data  2  data
0  uniquedata1  uniquecell1  42  data  3  data
7  uniquedata1  uniquecell2  36  data  1  data
9  uniquedata1  uniquecell2  40  data  2  data
8  uniquedata1  uniquecell2  66  data  3  data
6  uniquedata2  uniquecell2  22  data  1  data
4  uniquedata2  uniquecell2  39  data  2  data
3  uniquedata2  uniquecell2  41  data  3  data
5  uniquedata2  uniquecell2  45  data  4  data
# add some NA values
from numpy import nan
df.ix[1,2] = nan
df.ix[4,2] = nan
df.ix[5,2] = nan
# sort ascending
sorted_df(df, ascending=True)
=>           0            1   2     3   4     5
2  uniquedata1  uniquecell1  13  data   1  data
0  uniquedata1  uniquecell1  42  data   2  data
1  uniquedata1  uniquecell1 NaN  data NaN  data
7  uniquedata1  uniquecell2  36  data   1  data
9  uniquedata1  uniquecell2  40  data   2  data
8  uniquedata1  uniquecell2  66  data   3  data
6  uniquedata2  uniquecell2  22  data   1  data
3  uniquedata2  uniquecell2  41  data   2  data
4  uniquedata2  uniquecell2 NaN  data NaN  data
5  uniquedata2  uniquecell2 NaN  data NaN  data

我认为我在这里展示的处理NA值的行为(将它们排序为NA)可能比您在假设的示例中展示的行为更合适,但是您可以使用fillna在每个组中填充NA值。在

相关问题 更多 >

    热门问题