如何获得groupby，并为每一组Pandas获取最常用的单词和大字

words: other: category: hello, jim, you, you , jim val1 movie it, seems, bye, limb, pat, paddy val2 movie how, are, you, are , kim val1 television ...... ......

1条回答

网友

1楼 · 发布于 2024-09-27 19:35:33

示例数据帧：

                                   words other    category
0             hello, jim, you, you , jim  val1       movie
1  it, seems, bye, limb, pat, hello, jim  val2       movie
2               how, are, you, are , kim  val1  television

下面是一种使用Pandas和.iterrows()计算双参数的方法：

^{pr2}$

[[('hello', 'jim'), ('jim', 'you'), ('you', 'you'), ('you', 'jim')], 
[('it', 'seems'), ('seems', 'bye'), ('bye', 'limb'), ('limb', 'pat'), ('pat', 'hello'), ('hello', 'jim')], 
[('how', 'are'), ('are', 'you'), ('you', 'are'), ('are', 'kim')]]

下面是一个使用Pandas和.apply的更有效方法：

def bigram(row):
    lst = row['words'].split(', ')
    return [(lst[x].strip(), lst[x+1].strip()) for x in range(len(lst)-1)]

bigrams = df.apply(lambda row: bigram(row), axis=1)

print(bigrams.tolist())

[[('hello', 'jim'), ('jim', 'you'), ('you', 'you'), ('you', 'jim')], 
[('it', 'seems'), ('seems', 'bye'), ('bye', 'limb'), ('limb', 'pat'), ('pat', 'hello'), ('hello', 'jim')], 
[('how', 'are'), ('are', 'you'), ('you', 'are'), ('are', 'kim')]]

然后，您可以按类别对数据进行分组，并找到前10个最常见的bigram。以下是按类别查找最常见的双元组的示例：

df['bigrams'] = bigrams
df2 = df.groupby('category').agg({'bigrams': 'sum'})

# Compute the most frequent bigrams by category
from collections import Counter
df3 = df2.bigrams.apply(lambda row: Counter(row)).to_frame()

按类别排列的双峰频率有序字典：

print(df3)

                                                      bigrams
category                                                     
movie       {('hello', 'jim'): 2, ('jim', 'you'): 1, ('you...
television  {('how', 'are'): 1, ('are', 'you'): 1, ('you',...

# Filter to just the top 3 most frequent bigrams (or 10 if you have enough data)
df3.bigrams.apply(lambda row: list(row)[0:3])

category
movie         [(hello, jim), (jim, you), (you, you)]
television      [(how, are), (are, you), (you, are)]
Name: bigrams, dtype: object

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何获得groupby，并为每一组Pandas获取最常用的单词和大字

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >