如何使用pandas、numpy或itertools创建具有相似总数的组?

2024-07-04 07:27:16 发布

您现在位置:Python中文网/ 问答频道 /正文

如果给定一个数据帧,其中包含一列文件和文件大小,那么创建n个文件大小相同的n个组的最佳方法是什么?在搜索时,这个问题听起来与背包问题类似,只是没有硬停止。任何快速的解决方案,产生的任何长度的组接近平均值(无论是低于还是高于平均值),都将是一个巨大的改进。你知道吗

我的第一次尝试(t1)通过循环顺序计数来创建组。下一次尝试(t2)是按大小对数据帧排序,希望防止一个组获得一个大文件束,但方法基本上与t1相同。通常总共有300个文件。在这种情况下,我不确定计算所有可能的组合是否可行,或者是否有更好的方法。你知道吗

from itertools import repeat, chain
from math import ceil
import pandas as pd

source_dict = {'file_name_': {0: 'file_0', 1: 'file_1', 2: 'file_2', 3: 'file_3', 4: 'file_4', 5: 'file_5', 6: 'file_6', 7: 'file_7', 8: 'file_8', 9: 'file_9'
                              , 10: 'file_10', 11: 'file_11', 12: 'file_12', 13: 'file_13', 14: 'file_14', 15: 'file_15', 16: 'file_16', 17: 'file_17', 18: 'file_18'
                              , 19: 'file_19', 20: 'file_20', 21: 'file_21', 22: 'file_22', 23: 'file_23', 24: 'file_24', 25: 'file_25', 26: 'file_26', 27: 'file_27'
                              , 28: 'file_28', 29: 'file_29', 30: 'file_30', 31: 'file_31', 32: 'file_32', 33: 'file_33', 34: 'file_34', 35: 'file_35', 36: 'file_36'
                              , 37: 'file_37', 38: 'file_38', 39: 'file_39', 40: 'file_40', 41: 'file_41', 42: 'file_42', 44: 'file_44', 45: 'file_45', 46: 'file_46'
                              , 47: 'file_47', 48: 'file_48', 49: 'file_49', 50: 'file_50'}
               , 'file_size': {0: 3407245, 1: 3973920, 2: 7408640, 3: 4086426, 4: 12795600, 5: 2155039, 6: 9514856, 7: 13190235, 8: 32043703, 9: 4936240, 10: 9591964
                               , 11: 70153435, 12: 5106282, 13: 212414, 14: 24998146, 15: 11605646, 16: 2427516, 17: 23634036, 18: 169983, 19: 7011305, 20: 2106828
                               , 21: 3420304, 22: 11254, 23: 1271220, 24: 1164562, 25: 83613105, 26: 1030701, 27: 366948, 28: 7014895, 29: 8274642, 30: 2731629
                               , 31: 1596299, 32: 524, 33: 302, 34: 42332100, 35: 5441036, 36: 40633457, 37: 34680208, 38: 123505, 39: 15905009, 40: 52071678
                               , 41: 10624966, 42: 15425993, 44: 27673986, 45: 144988223, 46: 62619919, 47: 21562386, 48: 10620299, 49: 254661, 50: 232406680}}


sampleSizesDF = pd.DataFrame(source_dict)

desired_groups = 4 # multiprocessing.cpu_count()

group_size = ceil(sampleSizesDF.file_name_.count() / desired_groups) 

max_length = sampleSizesDF.file_name_.count() # upper bound for list

# trial 1, count off and group
my_groups = list(chain(*repeat(list(range(0,desired_groups)), group_size)))[:max_length]

sampleSizesDF['pGroup_t1'] = my_groups

# trial 2, sort + trial 1
sampleSizesDF.sort_values('file_size', inplace = True)

sampleSizesDF['pGroup_t2'] = my_groups

你知道吗

pGroupDistDF = pd.concat([
                              sampleSizesDF.groupby('pGroup_t1').agg({'file_size': 'sum'})
                            , sampleSizesDF.groupby('pGroup_t2').agg({'file_size': 'sum'})
                         ]
                         , axis=1)

pGroupDistDF.columns = ['t1', 't2']

pGroupDistDF = pGroupDistDF.merge(pd.DataFrame(pGroupDistDF.values, columns=['t1_dist', 't2_dist']).apply(lambda x: x/x.sum()), left_index=True, right_index=True)

presentation_order = ['t1', 't1_dist', 't2', 't2_dist']

pGroupDistDF[presentation_order]


    t1          t1_dist     t2          t2_dist
0   304015174   0.281916    291719748   0.270514
1   470551775   0.436347    396619142   0.367788
2   134901157   0.125095    183490246   0.170152
3   168921844   0.156643    206560814   0.191546

Tags: 文件方法nameimportsizedistcountfile

热门问题