海本频率

2024-09-21 00:22:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧,其中start_time采用正确的日期时间格式,start_station_name作为字符串,如下所示:

    start_time                      start_station_name
2019-03-20 11:04:16     San Francisco Caltrain (Townsend St at 4th St)
2019-04-06 14:19:06     Folsom St at 9th St
2019-05-24 17:21:11     Golden Gate Ave at Hyde St
2019-03-27 18:53:27     4th St at Mission Bay Blvd S
2019-04-16 08:45:16     Esprit Park

现在,我想简单地以月为单位绘制一年中每个名字的出现频率。为了对数据进行相应的分组,我使用了以下方法:

data = df_clean.groupby(df_clean['start_time'].dt.strftime('%B'))['start_station_name'].value_counts()

然后我得到的不是数据帧,而是一个数据类型:int64:

start_time  start_station_name                                       
April       San Francisco Caltrain Station 2  (Townsend St at 4th St)    4866
            Market St at 10th St                                         4609
            San Francisco Ferry Building (Harry Bridges Plaza)           4270
            Berry St at 4th St                                           3994
            Montgomery St BART Station (Market St at 2nd St)             3550
                                                                         ... 
September   Mission Bay Kids Park                                        1026
            11th St at Natoma St                                         1023
            Victoria Manalo Draves Park                                  1018
            Davis St at Jackson St                                       1015
            San Francisco Caltrain Station (King St at 4th St)           1014

现在,我想简单地使用Seaborn的countplot()将其绘制为一个聚集条形图,仅适用于绝对频率高于1000的情况,其中x轴表示月份,色调表示名称,y轴应显示计数:

sns.countplot(data = data[data > 1000], x = 'start_time', hue = 'start_station_name')

然后我得到错误消息Could not interpret input 'start_time',可能是因为它不是一个正确的数据帧。首先,我如何对其进行分组/聚合,以便可视化工作


Tags: 数据nameparkdatatimestartatst
1条回答
网友
1楼 · 发布于 2024-09-21 00:22:39

尝试:

data = df.groupby([df['start_time'].dt.strftime('%B'), 'start_station_name']) \
        .count() \
        .rename(columns={"start_time": "count"}) \
        .reset_index()
ax = sns.countplot(x="start_time", hue="start_station_name", data=data[data.count > 1000])

解释

  • 我通过添加start_station_name列来更改groupby中的键
  • 使用^{}获取单元格数
  • 使用^{}count列重命名为count
  • 使用^{}groupby重置索引
  • 子集数据集
  • 使用^{}绘制结果(使用文档中的第二个示例)

完整代码

print(df)
#            start_time                              start_station_name
# 0 2019-03-20 11:04:16  San Francisco Caltrain (Townsend St at 4th St)
# 1 2019-04-06 14:19:06                             Folsom St at 9th St
# 2 2019-05-24 17:21:11                      Golden Gate Ave at Hyde St
# 3 2019-03-27 18:53:27                    4th St at Mission Bay Blvd S
# 4 2019-04-16 08:45:16                                     Esprit Park

data = df.groupby([df['start_time'].dt.strftime('%B'), 'start_station_name']) \
        .count() \
        .rename(columns={"start_time": "count"}) \
        .reset_index()
print(data)
#   start_time                              start_station_name  count
# 0      April                                     Esprit Park      1
# 1      April                             Folsom St at 9th St      1
# 2      March                    4th St at Mission Bay Blvd S      1
# 3      March  San Francisco Caltrain (Townsend St at 4th St)      1
# 4        May                      Golden Gate Ave at Hyde St      1

# Filter as you desired
# data = data[data.count > 1000]

# Plot
ax = sns.countplot(x="start_time", hue="start_station_name", data=data)
plt.show()

输出

enter image description here

相关问题 更多 >

    热门问题