如何在Python中为特定列选择一个句子最长的行并合并以形成新的数据帧?

2024-09-29 17:19:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用的数据集如下所示。这是一个视频字幕数据集,在“描述”列下有字幕

Video_ID       Description
mv89psg6zh4    A bird is bathing in a sink.
mv89psg6zh4    A faucet is running while a bird stands and is taking bath under it.
mv89psg6zh4    A bird gets washed.
mv89psg6zh4    A parakeet is taking a shower in a sink.
mv89psg6zh4    The bird is taking a bath under the faucet.
mv89psg6zh4    A bird is standing in a sink drinking water.
l7x8uIdg2XU    A woman is pouring ingredients into a bowl and then eating it.
l7x8uIdg2XU    A woman is adding milk to some pasta.
l7x8uIdg2XU    A person adds ingredients to pasta. 
l7x8uIdg2XU    the girls are doing the cooking.

但是,每个视频的字幕数量不同,也不统一

我打算为一个唯一的视频ID提取一个具有最长“描述”(即最大字数)的行,并形成一个合并这些唯一行的新数据帧

我想要的结果应该如下所示:

需要数据帧-

Video_ID       Description
mv89psg6zh4    A faucet is running while a bird stands and is taking bath under it.
l7x8uIdg2XU    A woman is pouring ingredients into a bowl and then eating it.

因此,行基本上从现有数据框中移出,以形成一个新的数据框,其中包含原始数据集中最长的句子

我尝试使用以下代码:

s = df.index.to_series().groupby(df['Video_ID']).apply(lambda x: len(x['Description']).max())

但这似乎不起作用。你能建议正确的方法吗


Tags: and数据inid视频isvideoit
1条回答
网友
1楼 · 发布于 2024-09-29 17:19:45

使用^{}表示长度,然后通过^{}按最大每组获取索引值,最后通过^{}进行选择:

df1 = df.loc[df['Description'].str.len().groupby(df['Video_ID'], sort=False).idxmax()]
print (df1)
      Video_ID                                        Description
1  mv89psg6zh4  A faucet is running while a bird stands and is...
6  l7x8uIdg2XU  A woman is pouring ingredients into a bowl and...

详细信息

print (df['Description'].str.len())
0    28
1    68
2    19
3    40
4    43
5    44
6    62
7    37
8    35
9    32
Name: Description, dtype: int64

print (df['Description'].str.len().groupby(df['Video_ID'], sort=False).idxmax())
Video_ID
mv89psg6zh4    1
l7x8uIdg2XU    6
Name: Description, dtype: int64

对于筛选器不匹配的行,可以使用带反转掩码的^{}^{}

df2 = df[~df.index.isin(df1.index)]
print (df2)
      Video_ID                                   Description
0  mv89psg6zh4                  A bird is bathing in a sink.
2  mv89psg6zh4                           A bird gets washed.
3  mv89psg6zh4      A parakeet is taking a shower in a sink.
4  mv89psg6zh4   The bird is taking a bath under the faucet.
5  mv89psg6zh4  A bird is standing in a sink drinking water.
7  l7x8uIdg2XU         A woman is adding milk to some pasta.
8  l7x8uIdg2XU           A person adds ingredients to pasta.
9  l7x8uIdg2XU              the girls are doing the cooking.

编辑:上面的解决方案只返回每组最大长度的一行。(这里的工作原理是一样的,因为在样本数据中每个组只有一个最大长度)

如果需要多个最大每组,则可以在^{}中使用最大长度:

s = df['Description'].str.len()
mask = s.groupby(df['Video_ID'], sort=False).transform('max').eq(s)
df1 = df[mask]
print (df1)
      Video_ID                                        Description
1  mv89psg6zh4  A faucet is running while a bird stands and is...
6  l7x8uIdg2XU  A woman is pouring ingredients into a bowl and...

df2 = df[~mask]
print (df2)
      Video_ID                                   Description
0  mv89psg6zh4                  A bird is bathing in a sink.
2  mv89psg6zh4                           A bird gets washed.
3  mv89psg6zh4      A parakeet is taking a shower in a sink.
4  mv89psg6zh4   The bird is taking a bath under the faucet.
5  mv89psg6zh4  A bird is standing in a sink drinking water.
7  l7x8uIdg2XU         A woman is adding milk to some pasta.
8  l7x8uIdg2XU           A person adds ingredients to pasta.
9  l7x8uIdg2XU              the girls are doing the cooking.

详情:

print (s.groupby(df['Video_ID'], sort=False).transform('max'))
0    68
1    68
2    68
3    68
4    68
5    68
6    62
7    62
8    62
9    62
Name: Description, dtype: int64

相关问题 更多 >

    热门问题