Pandas:基于现有行的新行

2024-10-01 02:36:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧df,其结构如下:dataId, nodeId, tickDatetime

此数据集表示元素(dataId)通过节点(nodeId)的时间(tickDatetime

以下是一个例子:

           dataId    nodeId     tickDatetime
0          data-0   node-01             3000
1          data-0   node-02             5000    
2          data-1   node-02             4000    
3          data-1   node-01             6000    
4          data-0   node-01             8000    
5          data-0   node-00            10000    
...           ...       ...  

从这个数据帧,我想创建一个新的数据帧routes,它将包含每个dataId的节点序列和旅行时间

因此,我做了以下工作:

routes = df.sort_values('tickDatetime').groupby('dataId').agg({'nodeId':[lambda x: list(x)],'tickDatetime':lambda x: list(x)})

def datetimes_to_travel_times(datetimes):
    traveltimes = np.empty(len(datetimes))
    old_value = datetimes[0]
    traveltimes[0] = 0

    for i in range(1,len(datetimes)):
        traveltimes[i] = datetimes[i] - old_value
        old_value = datetimes[i]

    return traveltimes

routes['traveltimes'] = routes['tickDatetime'].apply(lambda row: datetimes_to_travel_times(row))

这给了我预期的输出(可能不是最好的方法?)

           dataId                              nodeId                tickDatetime           traveltimes
0          data-0   [node-01,node-02,node-01,node-00]      [3000,5000,8000,10000]    [0,2000,3000,2000]
1          data-1                   [node-02,node-01]                 [4000,6000]              [0,2000]

现在,如果旅行时间超过某个阈值,我希望我的路线被分割

例如,阈值为3000时,我希望我的routes数据帧如下所示:

           dataId   routeId              nodeId    tickDatetime    traveltimes
0          data-0         0   [node-01,node-02]      [3000,5000]      [0,2000]
1          data-0         1   [node-01,node-00]     [8000,10000]      [0,2000]
2          data-1         0   [node-02,node-01]      [4000,6000]      [0,2000]

我如何使用熊猫来实现这一点


编辑:

我设法解决了我的问题:

def split_routes(row):
    threshold = 3000
    nodes = row['nodeId']
    traveltimes = row['traveltimes']

    rows = []
    route_id = 0
    route_nodes = []
    route_traveltimes = []
    for i in range(0, len(traveltimes)):
        if(traveltimes[i]<threshold):
            route_nodes.append(nodes[i])
            route_traveltimes.append(traveltimes[i])
        else : 
            # Route route_id completed, starting a new one
            row['route_id'] = route_id
            row['Reader'] = route_nodes
            row['traveltimes'] = route_traveltimes
            rows.append(row)
            route_id+=1
            route_nodes.append(nodes[i])
            route_traveltimes.append(0)  

    # Route route_id completed, starting a new one            
    row['route_id'] = route_id
    row['Reader'] = route_nodes
    row['traveltimes'] = route_traveltimes
    rows.append(row)

    return pd.DataFrame(rows)

splitted_routes_array = []
for index, row in routes.iterrows():
    splitted_routes_array.append(split_routes(row))

splitted_routes = pd.concat(splitted_routes_array)

Tags: 数据idnodedatarouterowsrownodes
1条回答
网友
1楼 · 发布于 2024-10-01 02:36:07
df = pd.DataFrame({
'dataId':['data-0','data-0','data-1','data-1','data-0','data-0'],
'nodeId':['node-01','node-02','node-02','node-01','node-01','node-00'],
'tickDatetime':[3000,5000,4000,6000,8000,10000]})

append_ = lambda x:list(x)

df_2 = pd.DataFrame()
df_2['nodeId'] = df.groupby('dataId')['nodeId'].apply(append_)
df_2['tickDatetime'] = df.groupby('dataId')['tickDatetime'].apply(append_)
print(df_2)

输出:

                                  nodeId               tickDatetime
dataId                                                                 
data-0  [node-01, node-02, node-01, node-00]  [3000, 5000, 8000, 10000]
data-1                    [node-02, node-01]               [4000, 6000]

相关问题 更多 >