Pandas：多个键上的级联分区

nodename ip <otherfields> amelia 192.168.23.8 <...> boris 10.8.45.3 <...> boris 192.168.67.4 <...> clyde 192.168.45.3 <...> darwin 192.168.67.4 <...> ellen 192.168.23.9 <...>

nodename ip <otherfields> clump1: amelia 192.168.23.8 <...> ellen 192.168.23.8 <...> clump2: boris 10.8.45.3 <...> boris 192.168.67.4 <...> darwin 192.168.67.4 <...> clump3: clyde 192.168.45.9 <...>

1条回答

网友

1楼 · 发布于 2024-09-29 21:54:28

为子孙后代：

我最终使用以下算法在数据集中循环：

1. Select first unmatched row currently in the dataset and use that to initialise sets of search keys.

2. Iteratively select all rows matching the keys, and rebuild the sets of search keys.

3. When the sets of search keys stabilises, write a marker field so the records are not selected at step 1 of the algorithm.

4. Repeat from step 1.

下面是步骤2的代码片段。如果所有的密钥集都是空的，它就会进入一个无限循环，而且它的速度没有矢量化/map-reduce/multi-threaded解决方案的速度快，但是我的计算机可以在15分钟内通过20k数据集，这在现阶段是“足够好的”。你知道吗

i = 0
while not (nodename_set == servername_set_2 and ip_set == ip_set_2):
    nodename_set = nodename_set_2
    ip_set = ip_set_2
    built = all.loc[all['nodename'].isin(nodename_set) | all['ip'].isin(ip_set)]
    #build the comparator keys
    #the lamba filter purges nans
    nodename_set_2 = set(filter(lambda v: v==v, built['nodename']))
    ip_set_2 = set(filter(lambda v: v==v, built['ip']))
    itcount = itcount + 1
else:
    all.loc[all['nodename'].isin(nodename_set) | all['ip'].isin(ip_set), 'groupID'] = i
    i = i+1 #i is the groupID counter

相关问题更多 >

编程相关推荐

热门问题

热门文章