给定属于多个组的行,将每一行分配给一个组,其中每个组都有一个限制大小,最大化分组行

2024-10-01 19:25:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在寻找这个问题的通用psuedo代码numpypandassql解决方案

问题陈述: 您有一个具有组列的唯一项表,该列可以属于一个或多个组。将每行仅分配给一个组,同时尝试最大化所有组的总大小。但每组不能超过规定的尺寸。表中的总行数可能大于所有组大小限制的总和,因此需要丢弃某些行

Example Input:
ID   Groups
(4, [s1, s2])
(5, [s1, s2])
(6, [s1])
(15, [s1])
(7, [s2])
(8, [s3])
(10, [s3])
(12, [s3])
(13, [s3])

Desired Output Example
Group 1 (size_limit=3): 5, 6, 15
Group 2 (size_limit=2): 4, 7
Group 3 (size_limit=2): 10, 8


Bad Output
Group 1 (size_limit=3): 4, 5, 6
Group 2 (size_limit=2): 7
Group 3 (size_limit=2): 10, 8

Tags: 代码numpypandasoutputsqlsizes3example
2条回答

我的想法是将每一行放在每个组中,找到最好的组来制作混凝土,从其他组中移除其最好的成员,移除超出限制的多余成员,然后重复

我认为最好的具体化组是那些需要达到限制的模糊行最少的组,即可能属于其他组的行

在伪代码中:

While there are ambiguous rows:
    Of groups with ambiguous rows, pik group with least ambiguous rows under limit
    Sort rows in group by number of groups they belong to
    Divide list into keepers and excess, as determined by limit
    Remove groups from keeper rows and keeper rows from groups until all keepers are non-ambiguous.
    For excess set, remove current group from row. Remove excess set from group.
Chop off any remaining excess non-ambiguous rows

在Python中:

import pandas as pd
from typing import List

class Group:
    def __init__(self, id: str, limit: int):
        self.id = id
        self.limit = limit
        self.rows = set()

    def add_row(self, row):
        self.rows.add(row)

    def ambiguous_count(self):
        return len([row for row in self.rows if row.is_ambiguous()])

    def excess_count(self):
        return len(self.rows) - self.limit

    def ambiguous_under_limit_count(self):
        rows_list = sorted(self.rows, key=lambda row: row.groups_count())
        keepers = rows_list[:self.limit]
        return(len([k for k in keepers if k.is_ambiguous]))

    def trim(self):
        rows_list = sorted(self.rows, key=lambda row: row.groups_count())
        keepers = rows_list[:self.limit]
        for row in keepers:
            if row.is_ambiguous():
                row.make_non_ambiguous(self)
        excess = rows_list[self.limit:]
        for row in excess:
            self.remove_row(row)
            row.remove_group(self)
        self.rows = set(keepers)

    def chop(self):
        self.rows = set(list(self.rows)[:self.limit])

    def remove_row(self, row):
        self.rows.remove(row)


class GroupCollection:
    def __init__(self, groups, data: pd.DataFrame):
        self.groups = list(groups)
        self.groups_index = {group.id: group for group in groups}
        self.rows = set([Row(self, *row) for row in data.to_records(index=False)])

    def has_ambiguous_rows(self):
        for row in self.rows:
            if row.is_ambiguous():
                return True
        return False

    def group_to_trim(self):
        groups = [group for group in self.groups if group.ambiguous_count() > 0]
        if groups:
            return min(groups, key=lambda g: g.ambiguous_under_limit_count())
        else:
            return None

    def chop_all(self):
        for group in self.groups:
            group.chop()

    def optimize(self):
        while self.has_ambiguous_rows():
            if (trim_group := self.group_to_trim()):
                trim_group.trim()
        self.chop_all()

    def __str__(self):
        string = ""
        for group in self.groups:
            string = string + f"Group {group.id} (size_limit={group.limit}) " + ", ".join(str(row.id) for row in group.rows) + "\n"
        return string

class Row:
    def __init__(self, collection: GroupCollection, id: int, group_strings: List[str]):
        self.id = id
        self.collection = collection
        self.belongs_to = {collection.groups_index[key[1:]] for key in group_strings}
        for group in self.belongs_to:
            group.add_row(self)

    def is_ambiguous(self):
        return len(self.belongs_to) > 1

    def groups_count(self):
        return len(self.belongs_to)

    def make_non_ambiguous(self, group):
        remove_from = self.belongs_to - {group}
        for removal_group in remove_from:
            removal_group.remove_row(self)
        self.belongs_to = {group}

    def remove_group(self, group):
        self.belongs_to.remove(group)


data = pd.DataFrame({'ID': [4, 5, 6, 15, 7, 8, 10, 12, 13], 'Groups': [['s1', 's2'], ['s1', 's2'], ['s1'], ['s1'], ['s2'], ['s3'], ['s3'], ['s3'], ['s3']]})
all_groups = [Group('1', 3), Group('2', 2), Group('3', 2)]

gc = GroupCollection(all_groups, data)
gc.optimize()
print(gc)

data = pd.DataFrame({'ID': [1, 2], 'Groups': [['s1', 's2'], ['s1', 's2']]})
all_groups = [Group('1', 2), Group('2', 2)]

gc = GroupCollection(all_groups, data)
gc.optimize()
print(gc)

输出:

Group 1 (size_limit=3) 4, 15, 6
Group 2 (size_limit=2) 5, 7
Group 3 (size_limit=2) 8, 12

Group 1 (size_limit=2) 1, 2
Group 2 (size_limit=2) 

这可以更好地优化,可能有一些我没有想到的边缘情况

以下是我对解决方案的想法,但不确定它是否涵盖所有情况

SOLUTION

FOR EACH group: sort by samples belonging to the most groups. 
WHILE len(samples for group) > limit): remove group from groups
WHILE samples for group contains samples with more than 1 group: Remove other groups from sample

EXAMPLE INPUT 1:
Group 1 (size_limit=3): 4, 5, 6, 15
Group 2 (size_limit=2): 4, 5, 7
Group 3 (size_limit=2): 8, 10, 12, 13

or

ID  Groups
(4,  [s1, s2])
(5,  [s1, s2])
(6,  [s1])
(15, [s1])
(7,  [s2])
(8,  [s3])
(10, [s3])
(12, [s3])
(13, [s3])

Bad Output Example
Group 1 (size_limit=3): 4, 5, 6
Group 2 (size_limit=2): 7
Group 3 (size_limit=2): 8, 10

Possible Desired Output
Group 1 (size_limit=3): 5, 6, 15
Group 2 (size_limit=2): 4, 7
Group 3 (size_limit=2): 8, 10



For Group 1 (s1)
Group 1 (size_limit=3): 5, 6, 15
Group 2 (size_limit=2): 4, 5, 7
Group 3 (size_limit=2): 8, 10, 12, 13

or

ID  Groups
(4,  [s2])
(5,  [s1, s2])
(6,  [s1])
(15, [s1])
(7,  [s2])
(8,  [s3])
(10, [s3])
(12, [s3])
(13, [s3])

Then remove other groups for group (s1)
Group 1 (size_limit=3): 5, 6, 15
Group 2 (size_limit=2): 4, 7
Group 3 (size_limit=2): 8, 10, 12, 13

or

ID  Groups
(4,  [s2])
(5,  [s1])
(6,  [s1])
(15, [s1])
(7,  [s2])
(8,  [s3])
(10, [s3])
(12, [s3])
(13, [s3])

For group (s2), skip already at limit

For group (s3)
Group 1 (size_limit=3): 5, 6, 15
Group 2 (size_limit=2): 4, 7
Group 3 (size_limit=2): 8, 10

or

ID  Groups
(4,  [s2])
(5,  [s1])
(6,  [s1])
(15, [s1])
(7,  [s2])
(8,  [s3])
(10, [s3])











EXAMPLE 2:
Group 1 (size_limit=3): 4, 5, 6, 7, 8, 9
Group 2 (size_limit=2): 4, 5, 10, 11
Group 3 (size_limit=2): 4, 5, 6

or

ID  Groups
(4,  [s1, s2, s3])
(5,  [s1, s2, s3])
(6,  [s1, s3])
(7,  [s1])
(8,  [s1])
(9,  [s1])
(10, [s2])
(11, [s2])

Group = Group 1 (s1)
Group 1 (size_limit=3): 7, 8, 9
Group 2 (size_limit=2): 4, 5, 10, 11
Group 3 (size_limit=2): 4, 5, 6

or

(4,  [s2, s3])
(5,  [s2, s3])
(6,  [s3])
(7,  [s1])
(8,  [s1])
(9,  [s1])
(10, [s2])
(11, [s2])

Group = Group 2 (s2)
Group 1 (size_limit=3): 7, 8, 9
Group 2 (size_limit=2): 10, 11
Group 3 (size_limit=2): 4, 5, 6

or

(4,  [s3])
(5,  [s3])
(6,  [s3])
(7,  [s1])
(8,  [s1])
(9,  [s1])
(10, [s2])
(11, [s2])

Group = Group 3 (s3)
Group 1 (size_limit=3): 7, 8, 9
Group 2 (size_limit=2): 10, 11
Group 3 (size_limit=2): 5, 6

or

(5,  [s3])
(6,  [s3])
(7,  [s1])
(8,  [s1])
(9,  [s1])
(10, [s2])
(11, [s2])


EXAMPLE 3:
Group = Group 1 (s1)
Group 1 (size_limit=3): 5, 6, 7, 8
Group 2 (size_limit=2): 4, 5, 10, 11
Group 3 (size_limit=2): 4, 5, 6

or

(4,  [s2, s3])
(5,  [s1, s2, s3])
(6,  [s1, s3])
(7,  [s1])
(8,  [s1])
(10, [s2])
(11, [s2])

FOR EACH GROUP


Group = Group 1 (s1)
Group 1 (size_limit=3): 6, 7, 8
Group 2 (size_limit=2): 4, 5, 10, 11
Group 3 (size_limit=2): 4, 5

or

(4,  [s2, s3])
(5,  [s2, s3])
(6,  [s1])
(7,  [s1])
(8,  [s1])
(10, [s2])
(11, [s2])

Group = Group 2 (s2)
Group 1 (size_limit=3): 6, 7, 8
Group 2 (size_limit=2): 10, 11
Group 3 (size_limit=2): 4, 5

or

(4,  [s2])
(5,  [s2])
(6,  [s1])
(7,  [s1])
(8,  [s1])
(10, [s2])
(11, [s2])

相关问题 更多 >

    热门问题