优化Python读取大文件

{'url': 'http://address.com/document/42/1998', 'referrer': 'http://address.com/search?&q=query1', 'session': '1', 'rank': 2, 'time': 1338447254} {'url': 'http://address.com/document/55/17', 'referrer': 'http://address.com/search&q=query2', 'session': '1', 'rank': 2, 'time': 13384462462}

def mine(id, tmp_sessions, chunk_file, work_q, result_q, init_qsize): #f_chunk = map(eval, codecs.open(chunk_file, "r", encoding="utf-8").readlines()) f_chunk = codecs.open(chunk_file, "r", encoding="utf-8").readlines() while True: try: k = work_q.get() if k == 'STOP': work_q.task_done() break # reached end of queue except Queue.Empty: break #with codecs.open(chunk_file, "r", encoding="utf-8") as f_chunk: for line in f_chunk: #try: jlog_nest = dict() jlog_nest = eval(line) #jlog_nest = json.loads(line) #jlog_nest = line #jlog_nest = defaultdict(line) if jlog_nest["session"] == k: # If session is the same query_nest = prepare_test_cases_lib.extract_query(jlog_nest["referrer"]) for q in tmp_sessions[k]: if q[0] == query_nest: url = jlog_nest["url"] rank = jlog_nest["rank"] doc_id = prepare_test_cases_lib.extract_document_id(url) # Increase number of hits on that document, and save its rank if doc_id in q[1]: q[1][doc_id][0] += 1 q[1][doc_id][1].append(rank) else: q[1][doc_id] = [1, [rank]] #except: # print ("error",k) result_q.put((k, tmp_sessions[k])) work_q.task_done()

1284892 function calls in 76.810 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 8 0.000 0.000 77.985 9.748 {built-in method exec} 8 1.607 0.201 77.978 9.747 prepare_hard_test_cases.py:29(mine) 1254384 75.051 0.000 76.220 0.000 {built-in method eval} 562 0.008 0.000 0.050 0.000 queues.py:99(put) 8 0.000 0.000 0.029 0.004 codecs.py:685(readlines)

50205868 function calls (37662028 primitive calls) in 121.494 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 8 0.001 0.000 121.494 15.187 {built-in method exec} 8 0.008 0.001 121.493 15.187 <string>:1(<module>) 8 4.935 0.617 121.485 15.186 prepare_hard_test_cases.py:29(mine) 1254384 5.088 0.000 116.425 0.000 ast.py:39(literal_eval) 1254384 1.098 0.000 71.432 0.000 ast.py:31(parse) 1254384 70.333 0.000 70.333 0.000 {built-in method compile} 13798224/1254384 22.996 0.000 39.336 0.000 ast.py:51(_convert) 7526304 8.539 0.000 23.042 0.000 ast.py:63(<genexpr>) 25087680 8.371 0.000 8.371 0.000 {built-in method isinstance} 8 0.001 0.000 0.047 0.006 codecs.py:685(readlines)

51460252 function calls in 45.207 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 8 0.001 0.000 45.207 5.651 {built-in method exec} 8 0.003 0.000 45.207 5.651 <string>:1(<module>) 8 1.701 0.213 45.203 5.650 prepare_hard_test_cases.py:68(mine) 1254384 5.725 0.000 43.391 0.000 prepare_hard_test_cases.py:36(extractDict) 6271920 23.433 0.000 37.665 0.000 prepare_hard_test_cases.py:20(extractKeyValue) 18819074 11.308 0.000 11.308 0.000 {method 'find' of 'str' objects} 25092651 2.927 0.000 2.927 0.000 {built-in method len}

30091 function calls in 5.285 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 8 0.000 0.000 5.285 0.661 {built-in method exec} 8 0.003 0.000 5.285 0.661 <string>:1(<module>) 8 0.173 0.022 5.281 0.660 prepare_hard_test_cases.py:68(mine) 570 0.001 0.000 5.057 0.009 queues.py:113(get) 2281 3.925 0.002 3.925 0.002 {method 'acquire' of '_multiprocessing.SemLock' objects} 570 1.133 0.002 1.133 0.002 {method 'recv' of '_multiprocessing.PipeConnection' objects} 8 0.029 0.004 0.029 0.004 {built-in method load}

1条回答

网友

1楼 · 发布于 2024-10-02 10:24:31

你应该试试^{}，它是为工作而设计的，可能会更快。在

eval()速度慢，不安全，通常是个坏主意。如果你认为你需要它，看看周围，我向你保证你不会99.99%的时间。在

另一个注意事项是：

f_chunk = codecs.open(chunk_file, "r", encoding="utf-8").readlines()
...

应该是：

^{pr2}$

文件是迭代器，所以使用readlines()只会降低程序的内存效率。使用with可确保文件在完成后正确关闭（就像在3.x中一样，您可以使用open()而不是{}，因为它已经过更新以支持后者的额外功能）。在

除此之外，据我所知，数据的每一行都应该是有效的JSON，因此json模块也应该可以工作。在

相关问题更多 >

编程相关推荐

热门问题

热门文章