避免迭代以获得大Pandas的发生次数

2024-06-17 10:59:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧,其中一个包含公共汽车站号列表,df_stops,另一个包含公共汽车到达,df_arrivals,(StopNumberOnTimeStatus=-10,或1,分别对应于公共汽车是早、准时还是晚)。你知道吗

我希望向df_stops数据帧添加3个新列:

  1. PercentEarly
  2. PercentOnTime
  3. PercentLate

我很难弄清楚如何在不使用循环迭代的情况下实现这一点。如果我迭代地去做,我会按照以下的思路去做:

for row in df_stops:
    # number of early arrivals / total number of arrivals @ that stop
    row['PercentEarly'] =
        df_arrivals.loc[df_arrivals['StopNum'] == row['StopNum'] and df_arrivals['OnTimeStatus'] < 0].count() 
        / df_arrivals.loc[df_arrivals['StopNum'] == row['StopNum']].count()

    # same idea for on time and late arrivals

一般来说,我对熊猫和数据科学还比较陌生,所以非常感谢您的帮助。你知道吗

如何在不迭代df_stops中的每一行的情况下执行此操作?

编辑:

df_arrivals

       RouteNumber  ScheduledUnix  StopNumber OnTimeStatus
0               44     1511977533       40888            0
1               44     1511979273       40888            0
2               44     1511979273       40888            0
3               44     1511980353       40888            0
4               44     1511979273       40888            0
5               44     1511980353       40888            1
...            ...            ...         ...          ...
67538           85     1512005100       40900            0
67539           85     1512008700       40900            0
67540           85     1512008700       40900           -1
67541           85     1512008700       40900            0
67542           85     1512012300       40900            0

df_stops

     StopNumber
0         40877
1         40874
2         40876
3         40725
4         40875
5         40776
6         40730
7         40723
8         40721
9         40729
10        40722

所需的输出类似于:

     StopNumber    EarlyPercent    OnTimePercent    LatePercent
0         40877            0.14             0.80           0.06
...

Tags: andof数据numberdffor情况loc
3条回答

要回答有关事件计数的问题,请执行以下操作:

我要做的是:

#This represents all early, ontime, and late arrivals. If you want to grab per stopnum then you need to groupby first (see below)
#Define a specific stop num and store as stop_num = the number
early, ontime, late = df_arrivals[df_arrivals.stop_number == stop_num].OnTimeStatus.value_counts()[-1], df_arrivals.OnTimeStatus.value_counts()[0], df_arrivals.OnTimeStatus.value_counts()[1]

total_stops = len(df_stops[df_stops.StopNumber == stop_num])
EarlyPercent= early/total_stops
OntimePercent= ontime/total_stops
LatePercent= late/total_stops

现在请记住,这只是每一个stop num。实际上,我不认为有一种方法可以避免在这种情况下没有过于复杂的代码(链接等)的迭代。你知道吗

df_stops['PercentEarly'] = ''
df_stops['PercentOntime'] = ''
df_stops['PercentLate'] = ''

for stop_num in df_arrivals.stop_number.tolist():
    early, ontime, late = df_arrivals[df_arrivals.stop_number == stop_num].OnTimeStatus.value_counts()[-1], df_arrivals.OnTimeStatus.value_counts()[0], df_arrivals.OnTimeStatus.value_counts()[1]
    total_stops = len(df_stops[df_stops.StopNumber == stop_num])
    EarlyPercent= early/total_stops
    OntimePercent= ontime/total_stops
    LatePercent= late/total_stops
    df_stops.loc[df_stops.StopNumber == stop_num, 'PercentEarly'] =EarlyPercent
    df_stops.loc[df_stops.StopNumber == stop_num, 'PercentOnTime'] = OntimePercent
    df_stops.loc[df_stops.StopNumber == stop_num, 'PercentLate'] =LatePercent

你可以使用groupby

for stops in df_arrivals.groupby('StopNum'):
    stop[1].groupby('OnTimeStatus').count()

它现在能像预期的那样工作吗?你知道吗

我从来没有想过如何不用迭代就完成它。我还决定存储早/准时/晚的数量,而不是百分比。以下是我的解决方案,即使有上万个条目,它似乎也相当快:

# find the number of arrivals, make a series, and merge it with the stops DataFrame
df_stop_counts = df_arrivals['stopNumber'].value_counts().reset_index()
df_stop_counts.columns = ['StopNumber', 'NumArrivals']
df_stops = pd.merge(df_stops, df_stop_counts, left_on='stopNumber', right_on='StopNumber')

# iterate over all the stops and find the number of early/on-time/late arrivals
for index, row in df_stops.iterrows():
    df_stops.at[index, 'NumEarly'] = len(df_arrivals.loc[(df_arrivals['stopNumber'] == index) & (df_arrivals['OnTimeStatus'] == -1)])
    df_stops.at[index, 'NumOnTime'] =  len(df_arrivals.loc[(df_arrivals['stopNumber'] == index) & (df_arrivals['OnTimeStatus'] == 0)])
    df_stops.at[index, 'NumLate'] =  len(df_arrivals.loc[(df_arrivals['stopNumber'] == index) & (df_arrivals['OnTimeStatus'] == 1)])

相关问题 更多 >