Networkx从未完成2百万节点的中间性中心度计算

%matplotlib inline import pymongo import networkx as nx import time import itertools from multiprocessing import Pool from pymongo import MongoClient from sweepy.get_config import get_config config = get_config() MONGO_URL = config.get('MONGO_URL') MONGO_PORT = config.get('MONGO_PORT') MONGO_USERNAME = config.get('MONGO_USERNAME') MONGO_PASSWORD = config.get('MONGO_PASSWORD') client = MongoClient(MONGO_URL, int(MONGO_PORT)) db = client.tweets db.authenticate(MONGO_USERNAME, MONGO_PASSWORD) users = db.users graph = nx.DiGraph() for user in users.find(): graph.add_node(user['id_str']) for friend_id in user['friends_ids']: if not friend_id in graph: graph.add_node(friend_id) graph.add_edge(user['id_str'], friend_id)

""" Example of parallel implementation of betweenness centrality using the multiprocessing module from Python Standard Library. The function betweenness centrality accepts a bunch of nodes and computes the contribution of those nodes to the betweenness centrality of the whole network. Here we divide the network in chunks of nodes and we compute their contribution to the betweenness centrality of the whole network. """ def chunks(l, n): """Divide a list of nodes `l` in `n` chunks""" l_c = iter(l) while 1: x = tuple(itertools.islice(l_c, n)) if not x: return yield x def _betmap(G_normalized_weight_sources_tuple): """Pool for multiprocess only accepts functions with one argument. This function uses a tuple as its only argument. We use a named tuple for python 3 compatibility, and then unpack it when we send it to `betweenness_centrality_source` """ return nx.betweenness_centrality_source(*G_normalized_weight_sources_tuple) def betweenness_centrality_parallel(G, processes=None): """Parallel betweenness centrality function""" p = Pool(processes=processes) node_divisor = len(p._pool)*4 node_chunks = list(chunks(G.nodes(), int(G.order()/node_divisor))) num_chunks = len(node_chunks) bt_sc = p.map(_betmap, zip([G]*num_chunks, [True]*num_chunks, [None]*num_chunks, node_chunks)) # Reduce the partial solutions bt_c = bt_sc[0] for bt in bt_sc[1:]: for n in bt: bt_c[n] += bt[n] return bt_c print("Computing betweenness centrality for:") print(nx.info(graph)) start = time.time() bt = betweenness_centrality_parallel(graph, 2) print("\t\tTime: %.4F" % (time.time()-start)) print("\t\tBetweenness centrality for node 0: %.5f" % (bt[0]))

1条回答

网友

1楼 · 发布于 2024-09-27 07:32:36

TL/DR:Betweenness centrality是一个非常慢的计算，因此您可能希望通过考虑myk节点的子集来使用一个近似的度量，其中myk是比网络中的节点数小得多的某个数，但其大小足以具有统计意义（NetworkX对此有一个选项：betweenness_centrality(G, k=myk)。

我一点也不奇怪要花很长时间。中间性中心是一个缓慢的计算。networkx使用的算法是O(VE)，其中V是顶点数，而E是边数。在你的情况下VE = 10^13。我希望导入这个图需要O(V+E)时间，因此如果这需要足够长的时间，以至于您可以知道它不是瞬时的，那么O(VE)将是痛苦的。

如果一个1%的节点和1%的边（即20000个节点和50000个边）的简化网络需要时间X，那么你想要的计算需要时间1000x。如果X是1秒，那么新的计算接近3小时，我认为这是难以置信的乐观（见下面的测试）。所以在你决定你的代码有问题之前，先在一些较小的网络上运行它，然后估计一下你的网络的运行时间。

一个好的替代方法是使用一个近似的度量。标准中间性度量考虑了每一对节点及其之间的路径。Networkx提供了一个替代方案，它使用的是k节点的随机样本，然后在这些k节点和网络中所有其他节点之间找到最短路径。我认为这应该能加速在O(kE)时间内运行

所以你需要的是

betweenness_centrality(G, k=k)

如果您想确定结果的精确程度，可以使用较小的值k执行几个调用，确保它们比较接近，然后取平均结果。

下面是我对运行时的一些快速测试，随机图为（V，E）=（20,50）；（200500）；和（20005000）

import time
for n in [20,200,2000]:
    G=nx.fast_gnp_random_graph(n, 5./n)
    current_time = time.time()
    a=nx.betweenness_centrality(G)
    print time.time()-current_time

>0.00247192382812
>0.133368968964
>15.5196769238

所以在我的电脑上处理一个0.1%的网络需要15秒。做一个和你一样大的网络大约需要1500万秒。这是1.5*10^7秒，略低于π*10^7秒的一半。因为pi*10^7秒是一年中非常好的秒数的近似值，这将花费我的计算机大约6个月的时间。

所以你需要一个近似的算法。

相关问题更多 >

编程相关推荐

热门问题

热门文章