<p>我的想法是将每一行放在每个组中,找到最好的组来制作混凝土,从其他组中移除其最好的成员,移除超出限制的多余成员,然后重复</p>
<p>我认为最好的具体化组是那些需要达到限制的模糊行最少的组,即可能属于其他组的行</p>
<p>在伪代码中:</p>
<pre><code>While there are ambiguous rows:
Of groups with ambiguous rows, pik group with least ambiguous rows under limit
Sort rows in group by number of groups they belong to
Divide list into keepers and excess, as determined by limit
Remove groups from keeper rows and keeper rows from groups until all keepers are non-ambiguous.
For excess set, remove current group from row. Remove excess set from group.
Chop off any remaining excess non-ambiguous rows
</code></pre>
<p>在Python中:</p>
<pre><code>import pandas as pd
from typing import List
class Group:
def __init__(self, id: str, limit: int):
self.id = id
self.limit = limit
self.rows = set()
def add_row(self, row):
self.rows.add(row)
def ambiguous_count(self):
return len([row for row in self.rows if row.is_ambiguous()])
def excess_count(self):
return len(self.rows) - self.limit
def ambiguous_under_limit_count(self):
rows_list = sorted(self.rows, key=lambda row: row.groups_count())
keepers = rows_list[:self.limit]
return(len([k for k in keepers if k.is_ambiguous]))
def trim(self):
rows_list = sorted(self.rows, key=lambda row: row.groups_count())
keepers = rows_list[:self.limit]
for row in keepers:
if row.is_ambiguous():
row.make_non_ambiguous(self)
excess = rows_list[self.limit:]
for row in excess:
self.remove_row(row)
row.remove_group(self)
self.rows = set(keepers)
def chop(self):
self.rows = set(list(self.rows)[:self.limit])
def remove_row(self, row):
self.rows.remove(row)
class GroupCollection:
def __init__(self, groups, data: pd.DataFrame):
self.groups = list(groups)
self.groups_index = {group.id: group for group in groups}
self.rows = set([Row(self, *row) for row in data.to_records(index=False)])
def has_ambiguous_rows(self):
for row in self.rows:
if row.is_ambiguous():
return True
return False
def group_to_trim(self):
groups = [group for group in self.groups if group.ambiguous_count() > 0]
if groups:
return min(groups, key=lambda g: g.ambiguous_under_limit_count())
else:
return None
def chop_all(self):
for group in self.groups:
group.chop()
def optimize(self):
while self.has_ambiguous_rows():
if (trim_group := self.group_to_trim()):
trim_group.trim()
self.chop_all()
def __str__(self):
string = ""
for group in self.groups:
string = string + f"Group {group.id} (size_limit={group.limit}) " + ", ".join(str(row.id) for row in group.rows) + "\n"
return string
class Row:
def __init__(self, collection: GroupCollection, id: int, group_strings: List[str]):
self.id = id
self.collection = collection
self.belongs_to = {collection.groups_index[key[1:]] for key in group_strings}
for group in self.belongs_to:
group.add_row(self)
def is_ambiguous(self):
return len(self.belongs_to) > 1
def groups_count(self):
return len(self.belongs_to)
def make_non_ambiguous(self, group):
remove_from = self.belongs_to - {group}
for removal_group in remove_from:
removal_group.remove_row(self)
self.belongs_to = {group}
def remove_group(self, group):
self.belongs_to.remove(group)
data = pd.DataFrame({'ID': [4, 5, 6, 15, 7, 8, 10, 12, 13], 'Groups': [['s1', 's2'], ['s1', 's2'], ['s1'], ['s1'], ['s2'], ['s3'], ['s3'], ['s3'], ['s3']]})
all_groups = [Group('1', 3), Group('2', 2), Group('3', 2)]
gc = GroupCollection(all_groups, data)
gc.optimize()
print(gc)
data = pd.DataFrame({'ID': [1, 2], 'Groups': [['s1', 's2'], ['s1', 's2']]})
all_groups = [Group('1', 2), Group('2', 2)]
gc = GroupCollection(all_groups, data)
gc.optimize()
print(gc)
</code></pre>
<p>输出:</p>
<pre><code>Group 1 (size_limit=3) 4, 15, 6
Group 2 (size_limit=2) 5, 7
Group 3 (size_limit=2) 8, 12
Group 1 (size_limit=2) 1, 2
Group 2 (size_limit=2)
</code></pre>
<p>这可以更好地优化,可能有一些我没有想到的边缘情况</p>