基于max-coun的数据帧数据复制

dfaugment = dftrain.sort_values('text', ascending=False).groupby('Category') countdict = dict(dfaugment['Category'].count()) countdictmax = max(countdict.values()) shortdict = {} for key, value in countdict.items(): if value <= countdictmax: shortdict[key] = countdictmax - value

2条回答

网友

1楼 · 编辑于 2024-10-03 17:19:52

您可以使用itertools.cycle&；zip获得重复填充

df = pd.DataFrame(
    [('Shoes',"aasdb"), 
     ('Shoes',"frrrd"),
     ('Shoes',"ertbt"),
     ('Shoes',"erbete"),
     ('Shoes',"ervsss"),
     ('Sticks',"14345"),
     ('Sticks',"33445")], 
    columns=['Category', 'text']
)

首先我们找到最大大小，然后构造元组列表并传递给DataFrame构造函数。你知道吗

max_size = df.groupby('Category').size().max()
pd.DataFrame(
    [(a, b) 
     for k in df.Category.drop_duplicates()
     for a, b in zip([k]*max_size, cycle(df.text[df.Category==k]))]
    , columns = df.columns
)

这将输出：

  Category    text
0    Shoes   aasdb
1    Shoes   frrrd
2    Shoes   ertbt
3    Shoes  erbete
4    Shoes  ervsss
5   Sticks   14345
6   Sticks   33445
7   Sticks   14345
8   Sticks   33445
9   Sticks   14345

变体1：

I'm thinking forwardfill is enough

要向前填充，请在Category上使用^{}，但不要在text上使用cycle，然后使用ffill

pd.DataFrame(
    [(a, b) 
     for k in df.Category.drop_duplicates()
     for a, b in zip_longest([k]*max_size, df.text[df.Category==k])]
    , columns = df.columns).ffill()

这将输出：

  Category    text
0    Shoes   aasdb
1    Shoes   frrrd
2    Shoes   ertbt
3    Shoes  erbete
4    Shoes  ervsss
5   Sticks   14345
6   Sticks   33445
7   Sticks   33445
8   Sticks   33445
9   Sticks   33445

变体2：

randomise the sample selected for duplication

我不确定这里到底是什么意思，但是这里有一个方法可以得到一个随机的填充。你知道吗

这与向前填充类似。你知道吗

df2 = pd.DataFrame(
    [(a, b) 
     for k in df.Category.drop_duplicates()
     for a, b in zip_longest([k]*max_size, df.text[df.Category==k])]
    , columns = df.columns
)

接下来，为每个组获取一个大小为max_size的text样本，并将它们堆叠起来。并使用^{}合并

fill = pd.concat(
    [df.text[df.Category==k].sample(max_size, replace=True)
     for k in df.Category.drop_duplicates()]
).reset_index(drop=True)
df2.text = df2.text.combine_first(fill)

示例df2输出（可能与您不同，因为我没有为示例设置种子）

  Category    text
0    Shoes   aasdb
1    Shoes   frrrd
2    Shoes   ertbt
3    Shoes  erbete
4    Shoes  ervsss
5   Sticks   14345
6   Sticks   33445
7   Sticks   14345
8   Sticks   14345
9   Sticks   33445

网友

2楼 · 编辑于 2024-10-03 17:19:52

您可以尝试通过考虑最大组值来复制单个分组的数据帧

def DuplicateRows(x,group_max):
    Count = int(np.ceil((group_max - len(x))/len(x))) +1
    return pd.concat([x]*Count)[:group_max]

group_max = df.groupby('Category').apply(len).max()
df.groupby('Category',group_keys=False).apply(lambda x: DuplicateRows(x,group_max))

输出：

    Category    text
0   Shoes   "aasdb"
1   Shoes   "frrrd"
2   Shoes   "ertbt"
3   Shoes   "erbete"
4   Shoes   "ervsss"
5   Sticks  "14345"
6   Sticks  "33445"
5   Sticks  "14345"
6   Sticks  "33445"
5   Sticks  "14345"

相关问题更多 >

编程相关推荐

热门问题

热门文章