按小时和天数计算多数票

2024-05-19 09:47:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧,它有tweet_textdatetimesentiments列,这些列有以下值

tweet_text    date         time       sentiments
tweet1        2021-08-16   11:53:37   positive  
tweet2        2021-08-16   02:44:04   neutral
tweet3        2021-08-16   02:44:02   neutral
tweet4        2021-08-16   02:47:02   neutral
tweet5        2021-08-16   02:50:00   negative
tweet6        2021-08-17   05:20:46   positive
tweet7        2021-08-17   06:01:00   positive
tweet8        2021:08:17   06:20:00   positive
tweet9        2021:08:17   07:05:00   negative
tweet10       2021:08:17   07:20:21   negative

可以使用

df = pd.DataFrame({'tweet_text': ['tweet1', 'tweet2', 'tweet3', 'tweet4', 'tweet5', 'tweet6', 'tweet7, 'tweet8', 'tweet9', 'tweet10'], 
                   'date': [2021-08-16, 2021-08-16, 2021-08-16, 2021-08-16, 2021-08-16, 2021-08-17, 2021-08-17, 2021-08-17,2021-08-17, 2021-08-17], 
                   'time': [11:53:37, 02:44:04, 02:44:02, 02:47:02'02:50:00', '05:20:46' '06:01:00', '06:20:00', '07:05:00', '07:20:21'], 
                   'sentiments': ['positive', 'neutral', 'neutral', 'neutral', 'negative', 'positive', 'positive', 'positive', 'negative', 'negative']})

我需要根据每天每小时的多数票来计算情绪。我需要两个不同的数据帧作为输出。一天的多数投票,如

Date         Majority_Sentiment
2021-08-16   neutral
2021-08-17   positive

每小时多数投票的第二个数据帧,可以如下所示

Date         Hour    Majority_Sentiment
2021-08-16   11:00   positive   
2021-08-16   02:00   neutral
2021-08-17   05:00   positive
2021:08:17   06:00   positive
2021:08:17   07:00   negative

我知道df.mode()可以用来计算它,但是我如何在我的场景中实现它呢?多谢各位


Tags: 数据textdatetimetweetnegativepositiveneutral
2条回答

^{}与lambda函数一起用于^{},因为可能返回的多个值仅由^{}首先选择:

f = lambda x: x.mode().iat[0]
df1 = df.groupby('date')['sentiments'].apply(f).reset_index(name='Majority_Sentiment')
print (df1)
         date Majority_Sentiment
0  2021-08-16            neutral
1  2021-08-17           positive

对于第二个输出,将time列转换为日期时间,然后通过^{}转换为自定义格式HH:00,并通过-columndatetime系列进行分组:

times = pd.to_datetime(df['time']).dt.strftime('%H:00')
df2 = (df.groupby(['date', times], sort=False)['sentiments'].apply(f)
         .reset_index(name='Majority_Sentiment'))
print (df2)

另一个类似的解决方案是分配返回时间:

df['time']= pd.to_datetime(df['time']).dt.strftime('%H:00')
df2 = (df.groupby(['date', 'time'], sort=False)['sentiments'].apply(f)
         .reset_index(name='Majority_Sentiment'))
print (df2)


         date   time Majority_Sentiment
0  2021-08-16  11:00           positive
1  2021-08-16  02:00            neutral
2  2021-08-17  05:00           positive
3  2021-08-17  06:00           positive
4  2021-08-17  07:00           negative
  1. 生成小数据集
import numpy as np
import pandas as pd
import datetime
import random


n = 1000
rawdata = pd.DataFrame({'tweet_text': [f'tweet{i}' for i in range(n)],
                        'date': random.choices(
                            [(datetime.date.today() + datetime.timedelta(days=days)).strftime('%Y-%m-%d') for days in
                             range(10)], k=n),
                        'time': [(datetime.datetime.now() + datetime.timedelta(minutes=i)).strftime('%H:%M:%S') for i in
                                 random.choices(range(40000), k=n)],
                        'sentiments': random.choices(population=['positive', 'neutral', 'foo', 'bar'], k=n)
                        })
rawdata

  1. 计算结果
def cal_majority(x):
    "get the most frequency of x"
    x = np.array(x)
    (index, value) = np.unique(x, return_counts=True)
    majority_str = index[np.argmax(value)]
    return majority_str


def cal_second_value(x):
    "get the second frequency of x"
    x = np.array(x)
    (index, value) = np.unique(x, return_counts=True)
    second_str = index[np.argsort(value)[-(np.shape(value)[0] - 1)]]
    return second_str


rawdata.groupby(['date']).agg(
    majority_sentiment=('sentiments', cal_majority),
    second_sentiment = ('sentiments', cal_second_value)
)

显示结果: enter image description here

  1. 随着时间的推移,cal更多:

data2 = rawdata.copy()
data2['time'] = data2['time'].apply(lambda x: datetime.datetime.strptime(x, '%H:%M:%S').strftime("%H:00"))
data2.groupby(['date', 'time']).agg(
    majority_sentiment=('sentiments', cal_majority),
    second_sentiment = ('sentiments', cal_second_value)
).reset_index()

enter image description here

相关问题 更多 >

    热门问题