pythonPandas航班到达和起飞时间在单独的行上进行匹配并连接到同一行，然后绘制甘特图ch

1条回答

网友

1楼 · 发布于 2024-09-28 10:09:02

问题1

对设置的一个快速更改是，我没有将ID设置为index_col，因为我想在groupby().shift中快速使用它的值。所以从修改后的read_csv开始：

df = pd.read_csv(dfstr, sep=";")
cols = df.columns.values.tolist()

解决方案的很大一部分是确保df按Car、FltReg、和{}排序（因为前两个是唯一标识符，最后一个是主排序值）。在

^{pr2}$

现在我们是逻辑的主要部分了。我将把df分为到港和离港，两者的连接方式是通过一个移位的ID。也就是说，对于任何（Car，FltReg）分区，我知道将给定的'a'行与紧随其后的'D'行配对。我们需要完整的数据。在

让我们生成移位的ID：

# sort_cols[:2] is `Car` and `FltReg` together
df['NextID'] = df.groupby(sort_cols[:2])['ID'].shift(1)

现在使用一个“A”过滤的df和一个“D”过滤的df，我将把它们连接在一起。到达（左数据集）由原始的ID键控，离港（右数据集）由我们刚刚生成的NextID键控。在

df_display = df[df['ArrDep'] == 'A'] \
                 .merge(df[df['ArrDep'] == 'D'],
                       how='outer',
                       left_on='ID',
                       right_on='NextID',
                       suffixes=('1', '2'))

注意，这些列现在将以1（左）和2（右）作为后缀。在

此时，这个新的数据框df_display拥有它所需的所有行，但是它在最终显示中没有很好的排序。为了实现这一点，您需要再次使用sort_cols列表，但是合并了每个列的版本，这些版本将各自的左版本和右版本放在一起。例如，Car1和{}必须合并在一起，这样您就可以根据组合的版本对所有行进行排序。在

熊猫的combine_first就像是凝聚。在

# purely for sorting the final display
for c in sort_cols: 
    df_display['sort_' + c] = df_display[c + '1'] \
                                  .combine_first(df_display[c + '2'])
    # for example, Car1 and Car2 have now been coalesced into sort_Car

df_display.sort_values(by=['sort_{}'.format(c) for c in sort_cols], inplace=True)

我们快完成了。现在，df_display有我们不需要的无关列。我们可以只选择我们想要的列，基本上是原始列列表的两个副本cols。在

df_display = df_display[['{}1'.format(c) for c in cols] + ['{}2'.format(c) for c in cols]]
df_display.to_csv('output.csv', index=None)

我检查了（在csv导出中，以便我们可以看到广泛的数据集）这与您的示例相匹配。在

问题2

好的，所以如果您在https://matplotlib.org/examples/pylab_examples/broken_barh.html处玩代码，您可以看到broken_barh是如何工作的。这一点很重要，因为我们必须使数据适合这种结构才能使用它。broken_barh的第一个参数是要绘制的元组列表，每个元组都是一个（开始时间，持续时间）。在

对于matplotlib，开始时间必须采用其特殊的日期格式。{dateso23}我们要用dateso23来转换。最后，持续时间似乎是以天为单位的。在

因此，如果HSVKA在2017-05-01 15:25:00到达并且在地面上停留了70分钟，那么broken_barh需要绘制元组(mdates.date2num(Timestamp('2017-05-03 15:25:00')), 70 minutes in day units or 0.04861)。在

所以第一步是从问题1得到{}，格式如下。我们现在只需要关注四列'Car1', 'FltReg1', 'STADDtTm1', 'STADDtTm2'。在

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn # optional ... I like the look

print(df_display[['Car1', 'FltReg1', 'STADDtTm1', 'STADDtTm2']])

看起来像

   Car1 FltReg1         STADDtTm1         STADDtTm2
0    EK   A6ECI  03/05/2017 12:50  03/05/2017 15:40
1    EK   A6EDL  02/05/2017 12:15  02/05/2017 14:00
2    EK   A6EDL  02/05/2017 23:45  03/05/2017 01:15
10  NaN     NaN               NaN  01/05/2017 11:15
3    VZ   HSVKA  01/05/2017 15:25  01/05/2017 16:35
4    VZ   HSVKA  01/05/2017 19:45  02/05/2017 11:15
5    VZ   HSVKA  02/05/2017 14:25  03/05/2017 07:05
6    VZ   HSVKA  03/05/2017 15:45  03/05/2017 18:20
7    VZ   HSVKA  03/05/2017 21:50               NaN
11  NaN     NaN               NaN  01/05/2017 06:10
8    VZ   HSVKB  01/05/2017 09:20  01/05/2017 09:50
9    VZ   HSVKB  01/05/2017 13:00               NaN

当到达或离开时有NaNs。归责于这些是相当简单的。我注意到在你的报告中，你希望有一个小时的缓冲时间，在任何一个地方丢失。所有这些直接的争论都是：

df_gantt = df_display.copy()

# Convert to pandas timestamps for date arithmetic
df_gantt['STADDtTm1'] = pd.to_datetime(df_gantt['STADDtTm1'],
                                       format='%d/%m/%Y %H:%M')
df_gantt['STADDtTm2'] = pd.to_datetime(df_gantt['STADDtTm2'],
                                       format='%d/%m/%Y %H:%M')

# Impute identifiers
df_gantt['Car'] = df_gantt['Car1'].combine_first(df_gantt['Car2'])
df_gantt['FltReg'] = df_gantt['FltReg1'].combine_first(df_gantt['FltReg2'])

# Also just gonna combine Car and FltReg
# into a single column for simplicty
df_gantt['Car_FltReg'] = df_gantt['Car'] + ': ' +  df_gantt['FltReg']

# Impute hour gaps
df_gantt['STADDtTm1'] = df_gantt['STADDtTm1'] \
                            .fillna(df_gantt['STADDtTm2'] - pd.Timedelta('1 hour'))
df_gantt['STADDtTm2'] = df_gantt['STADDtTm2'] \
                            .fillna(df_gantt['STADDtTm1'] + pd.Timedelta('1 hour'))

# Date diff in day units
df_gantt['DayDiff'] = (df_gantt['STADDtTm2'] - df_gantt['STADDtTm1']).dt.seconds \
                          / 60 / 60 / 24

# matplotlib numeric date format
df_gantt['STADDtTm1'] = df_gantt['STADDtTm1'].apply(mdates.date2num)
df_gantt['STADDtTm2'] = df_gantt['STADDtTm2'].apply(mdates.date2num)

df_gantt = df_gantt[['Car_FltReg', 'STADDtTm1', 'STADDtTm2', 'DayDiff']]
print(df_gantt)

现在看起来像

   Car_FltReg      STADDtTm1      STADDtTm2   DayDiff
0   EK: A6ECI  736452.534722  736452.652778  0.118056
1   EK: A6EDL  736451.510417  736451.583333  0.072917
2   EK: A6EDL  736451.989583  736452.052083  0.062500
10  VZ: HSVKA  736450.427083  736450.468750  0.041667
3   VZ: HSVKA  736450.642361  736450.690972  0.048611
4   VZ: HSVKA  736450.822917  736451.468750  0.645833
5   VZ: HSVKA  736451.600694  736452.295139  0.694444
6   VZ: HSVKA  736452.656250  736452.763889  0.107639
7   VZ: HSVKA  736452.909722  736452.951389  0.041667
11  VZ: HSVKB  736450.215278  736450.256944  0.041667
8   VZ: HSVKB  736450.388889  736450.409722  0.020833
9   VZ: HSVKB  736450.541667  736450.583333  0.041667

现在做一个dict，其中每个键都是唯一的Car_FltReg，每个值都是一个元组列表（如前所述），这些元组可以被输入broken_barh。在

dict_gantt = df_gantt.groupby('Car_FltReg')['STADDtTm1', 'DayDiff'] \
                 .apply(lambda x: list(zip(x['STADDtTm1'].tolist(),
                                           x['DayDiff'].tolist()))) \
                 .to_dict()

所以dict_gantt看起来像

{'EK: A6ECI': [(736452.5347222222, 0.11805555555555557)],
 'EK: A6EDL': [(736451.5104166666, 0.07291666666666667),
               (736451.9895833334, 0.0625)],
 'VZ: HSVKA': [(736450.4270833334, 0.041666666666666664),
               (736450.6423611111, 0.04861111111111111),
               (736450.8229166666, 0.6458333333333334),
               (736451.6006944445, 0.6944444444444445),
               (736452.65625, 0.1076388888888889),
               (736452.9097222222, 0.041666666666666664)],
 'VZ: HSVKB': [(736450.2152777778, 0.041666666666666664),
               (736450.3888888889, 0.020833333333333332),
               (736450.5416666666, 0.041666666666666664)]}

非常适合broken_barh。现在全是matplotlib的疯狂。在准备broken_barh内容的核心逻辑之后，其他一切都只是费劲的记号格式等。如果您在matplotlib中定制了一些东西，那么这些东西应该很熟悉，我不会解释太多。在

FltReg_list = sorted(dict_gantt, reverse=True)

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

start_datetime = df_gantt['STADDtTm1'].min()
end_datetime = df_gantt['STADDtTm2'].max()

# parameters for yticks, etc.
# you might have to play around
# with the different parts to modify
n = len(FltReg_list)
bar_size = 9

for i, bar in enumerate(FltReg_list):
    ax.broken_barh(dict_gantt[bar],          # data
                   (10 * (i + 1), bar_size), # (y position, bar size)
                   alpha=0.75,
                   edgecolor='k',
                   linewidth=1.2)

# I got date formatting ideas from
# https://matplotlib.org/examples/pylab_examples/finance_demo.html
ax.set_xlim(start_datetime, end_datetime)
ax.xaxis.set_major_locator(mdates.HourLocator(byhour=range(0, 24, 6)))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%m-%d %H:%M'))
ax.xaxis.set_minor_locator(mdates.HourLocator(byhour=range(0, 24, 1)))
# omitting minor labels ...

plt.grid(b=True, which='minor', color='w', linestyle='dotted')

ax.set_yticks([5 + 10 * n for n in range(1, n + 1)])
ax.set_ylim(5, 5 + 10 * (n + 1))
ax.set_yticklabels(FltReg_list)

ax.set_title('Time on Ground')
ax.set_ylabel('Carrier: Registration')

plt.setp(plt.gca().get_xticklabels(), rotation=30, horizontalalignment='right')

plt.tight_layout()
fig.savefig('gantt.png', dpi=200)

这是最终的输出。在

相关问题更多 >

编程相关推荐

热门问题

热门文章