如何最好地通过一个大的时间戳事件列表来解析并发条目?

2024-09-23 16:28:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个更大(150000)的条目列表,这些条目在开始和结束时都有时间戳。我试图找出最大并发事件数的时间。我喜欢的语言和例子是python

dictionary EVENTS保存数据,tag是事件的id,Start和End是datetime对象,分别是每个实例的开始时间和结束时间,因此:

EVENTS[tag][end][start] = [list of occurrences at that start/end time stamp]
endkey = EVENTS[tag].keys()
endkey.sort()
peak = 0
for end in endkey:
    endentrykey = EVENTS[tag].keys()
    endentrykey.sort()
    for endtime in endentrykey:
        if endtime < end:   #  We can disregard entries that ended before the event
            break
        startentrykey = EVENTS[tag][item].keys()
        startentrykey.sort()
        for starttime in startentrykey:
            if starttime > end: # ignore events that started after the event ended
               break  
            peak = len(EVENT[tag][endtime][starttime])

我没有尝试过多线程,但我强烈怀疑我是CPU限制

有人能提出一个更好的算法来实现这一点吗


Tags: inforthattag时间条目keysevents
1条回答
网友
1楼 · 发布于 2024-09-23 16:28:17

您可以在python评测中查看这个问题的答案question,以帮助您识别瓶颈。有一点排序正在进行,这可能是次优的

如果您的事件数据是按开始时间排序的,那么您应该能够按顺序遍历事件,并向前和向后查看以确定重叠事件的计数

这里有一个例子,我的意思是,我生成150000个随机持续时间的连续开始事件。然后以上面描述的方式遍历事件,并打印最具并发性的前十个事件

>python py_counter.py
(7548: start 1546268748.0, duration: 89.0, 47)
(7547: start 1546268747.0, duration: 8.0, 46)
(7546: start 1546268746.0, duration: 86.0, 45)
(7545: start 1546268745.0, duration: 36.0, 44)
(7544: start 1546268744.0, duration: 21.0, 43)
(59089: start 1546320289.0, duration: 25.0, 43)
(7543: start 1546268743.0, duration: 75.0, 42)
(59088: start 1546320288.0, duration: 96.0, 42)
(104116: start 1546365316.0, duration: 70.0, 42)
(7542: start 1546268742.0, duration: 94.0, 41)
Finished 150,000 events in 5.1 seconds

代码如下:

from datetime import datetime, timedelta
from dataclasses import dataclass
from collections import Counter
from random import randint
import time


@dataclass(eq=True, frozen=True)  # make hashable
class Event:
    """Track event durations"""
    tag: int
    start: datetime
    end: datetime

    def __repr__(self):
        return f"{self.tag}: start {self.start.timestamp()}, duration: {self.end.timestamp() - self.start.timestamp()}"

start_time = time.time()
# create some sequentially starting events with random durations
events = []
for i in range(150000):
    start = datetime(2019, 1, 1) + timedelta(seconds=i)
    end = start + timedelta(seconds=randint(1, 100))
    events.append(Event(i, start, end))

concurrencies = Counter()
#events is a list sorted by start-time
for i, event in enumerate(events):
    # look back
    j = i - 1
    while j > 0 and event.start < events[j].end:
        j -= 1
        concurrencies[event] += 1
    # look forward
    j = i + 1
    max_index = len(events) - 1
    while j < max_index and event.end < events[i].start:
        j += 1
        concurrencies[event] += 1
# print events with highest number of concurrencies
for event in concurrencies.most_common(10):
    print(event)
print(f"Finished {len(events):,d} events in {time.time() - start_time:.1f} seconds")

如果有需要澄清的地方,请告诉我

相关问题 更多 >