如何在Python中为特定列选择一个句子最长的行并合并以形成新的数据帧？

Video_ID Description mv89psg6zh4 A bird is bathing in a sink. mv89psg6zh4 A faucet is running while a bird stands and is taking bath under it. mv89psg6zh4 A bird gets washed. mv89psg6zh4 A parakeet is taking a shower in a sink. mv89psg6zh4 The bird is taking a bath under the faucet. mv89psg6zh4 A bird is standing in a sink drinking water. l7x8uIdg2XU A woman is pouring ingredients into a bowl and then eating it. l7x8uIdg2XU A woman is adding milk to some pasta. l7x8uIdg2XU A person adds ingredients to pasta. l7x8uIdg2XU the girls are doing the cooking.

1条回答

网友

1楼 · 发布于 2024-09-29 17:19:45

使用^{}表示长度，然后通过^{}按最大每组获取索引值，最后通过^{}进行选择：

df1 = df.loc[df['Description'].str.len().groupby(df['Video_ID'], sort=False).idxmax()]
print (df1)
      Video_ID                                        Description
1  mv89psg6zh4  A faucet is running while a bird stands and is...
6  l7x8uIdg2XU  A woman is pouring ingredients into a bowl and...

详细信息：

print (df['Description'].str.len())
0    28
1    68
2    19
3    40
4    43
5    44
6    62
7    37
8    35
9    32
Name: Description, dtype: int64

print (df['Description'].str.len().groupby(df['Video_ID'], sort=False).idxmax())
Video_ID
mv89psg6zh4    1
l7x8uIdg2XU    6
Name: Description, dtype: int64

对于筛选器不匹配的行，可以使用带反转掩码的^{}和^{}：

df2 = df[~df.index.isin(df1.index)]
print (df2)
      Video_ID                                   Description
0  mv89psg6zh4                  A bird is bathing in a sink.
2  mv89psg6zh4                           A bird gets washed.
3  mv89psg6zh4      A parakeet is taking a shower in a sink.
4  mv89psg6zh4   The bird is taking a bath under the faucet.
5  mv89psg6zh4  A bird is standing in a sink drinking water.
7  l7x8uIdg2XU         A woman is adding milk to some pasta.
8  l7x8uIdg2XU           A person adds ingredients to pasta.
9  l7x8uIdg2XU              the girls are doing the cooking.

编辑：上面的解决方案只返回每组最大长度的一行。（这里的工作原理是一样的，因为在样本数据中每个组只有一个最大长度）

如果需要多个最大每组，则可以在^{}中使用最大长度：

s = df['Description'].str.len()
mask = s.groupby(df['Video_ID'], sort=False).transform('max').eq(s)
df1 = df[mask]
print (df1)
      Video_ID                                        Description
1  mv89psg6zh4  A faucet is running while a bird stands and is...
6  l7x8uIdg2XU  A woman is pouring ingredients into a bowl and...

df2 = df[~mask]
print (df2)
      Video_ID                                   Description
0  mv89psg6zh4                  A bird is bathing in a sink.
2  mv89psg6zh4                           A bird gets washed.
3  mv89psg6zh4      A parakeet is taking a shower in a sink.
4  mv89psg6zh4   The bird is taking a bath under the faucet.
5  mv89psg6zh4  A bird is standing in a sink drinking water.
7  l7x8uIdg2XU         A woman is adding milk to some pasta.
8  l7x8uIdg2XU           A person adds ingredients to pasta.
9  l7x8uIdg2XU              the girls are doing the cooking.

详情：

print (s.groupby(df['Video_ID'], sort=False).transform('max'))
0    68
1    68
2    68
3    68
4    68
5    68
6    62
7    62
8    62
9    62
Name: Description, dtype: int64

相关问题更多 >

编程相关推荐

热门问题

热门文章