Python爬虫需要我的算法帮助问题的回答

Python爬虫需要我的算法帮助

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

**在帖子末尾添加了问题摘要** 我写了一个抓取和解析网址的爬虫程序。在 在第一个版本中，为了得到下一个有效的页面，我增加了URL ID并将无效ID保存到一个文件中，有效的URL被移动到解析我需要的内容的解析器中，过了一会儿，我发现大多数有效的ID都有一个返回的子线程。在 我做了一些统计，得到了一系列的副标题-[8,18,7,17,6,16,5,15]，按重复次数最多到最少排序。在 所以我<a href="https://stackoverflow.com/questions/6809402/python-maximum-recursion-depth-exceeded-while-calling-a-python-object">changed my code</a>来- <pre><code>def checkNextID(ID): numOfRuns = 0 global curRes, lastResult while ID < lastResult: try: numOfRuns += 1 if numOfRuns % 10 == 0: time.sleep(7) # sleep every 10 iterations numOfRuns = 0 if isValid(ID + 8): parseHTML(curRes, ID) ID = ID + 8 elif isValid(ID + 18): parseHTML(curRes, ID) ID = ID + 18 elif isValid(ID + 7): parseHTML(curRes, ID) ID = ID + 7 elif isValid(ID + 17): parseHTML(curRes, ID) ID = ID + 17 elif isValid(ID+6): parseHTML(curRes, ID) ID = ID + 6 elif isValid(ID + 16): parseHTML(curRes, ID) ID = ID + 16 elif isValid(ID + 5): parseHTML(curRes, ID) ID = ID + 5 elif isValid(ID + 15): parseHTML(curRes, ID) ID = ID + 15 else: if isValid(ID + 1): parseHTML(curRes, ID) ID = ID + 1 except Exception, e: print "something went wrong: " + str(e) </code></pre> isValid（）是一个函数，它获取一个ID+其中一个子线程，如果url包含我需要的内容，则返回True，并将url的soup对象保存到名为“curRes”的全局变量中；如果url不包含我需要的数据，则返回False，并将ID保存到“baddfile”。在 parseHTML是一个函数，它获取soup对象（curRes），解析我需要的数据，然后将数据保存到csv，然后返回True。在 在一个完美的世界里，这段代码就是我在所有有效ID上运行所需的一切（5M范围内大约有400K个），它让我在更短的时间内获得更好的结果（x50更快）。在 但是，当到达不包含任何有效URL的范围时，我的代码效率非常低，在每次迭代中我都会一遍又一遍地爬行相同的URL，这是因为我将ID增加一个，以便继续前进，直到找到下一个有效的URL，然后检查ID+8，然后检查18，17等'，有时给我相同的网址，我在上一次迭代检查。在 所以我去修改了代码，这样它将保留一组无效的url，我将避免再次检查，我不能让它工作，我打破了我的头几个小时，它没有正常工作。在 这是我的新功能- ^{pr2}$ 我将每个无效的ID保存到一个集合中，在每次调用isValid（）之前，我会检查是否已经尝试过该ID，如果没有，则调用isValid（），否则，ID将增加一个。在 坏的ID文件就是这样- <pre><code>513025328 513025317 513025327 513025316 513025326 513025312 513025320 513025330 513025319 513025329 513025318 513025328 513025317 513025327 513025313 513025321 513025331 513025320 513025330 513025319 513025329 513025318 513025328 513025314 513025322 513025332 513025321 513025331 513025320 513025330 513025319 513025329 513025315 513025323 513025333 513025322 513025332 513025321 513025331 513025320 513025330 513025316 513025324 513025334 513025323 513025333 513025322 513025332 513025321 513025331 513025317 513025325 513025335 513025324 513025334 513025323 513025333 513025322 513025332 513025318 513025326 513025336 513025325 513025335 513025324 513025334 513025323 513025333 513025319 513025327 513025337 513025326 513025336 513025325 513025335 513025324 513025334 513025320 513025328 513025338 513025327 513025337 513025326 513025336 513025325 513025335 513025321 513025329 513025339 513025328 513025338 513025327 513025337 513025326 513025336 513025322 513025330 513025340 513025329 513025339 513025328 513025338 513025327 513025337 513025323 513025331 513025341 513025330 513025340 513025329 513025339 513025328 513025338 513025324 513025332 513025342 513025331 513025341 513025330 513025340 513025329 513025339 513025325 513025333 513025343 513025332 513025342 513025331 513025341 513025330 513025340 513025326 513025334 513025344 513025333 513025343 513025332 513025342 513025331 513025341 513025327 513025335 513025345 513025334 513025344 513025333 513025343 513025332 513025342 513025328 513025336 513025346 513025335 513025345 513025334 513025344 513025333 513025343 513025329 513025337 513025347 513025336 513025346 513025335 513025345 513025334 513025344 513025330 513025338 513025348 513025337 513025347 513025336 513025346 513025335 513025345 513025331 513025339 513025349 513025338 513025348 513025337 513025347 513025336 513025346 513025332 513025340 513025350 513025339 513025349 513025338 513025348 513025337 513025347 513025333 513025341 513025351 513025340 513025350 513025339 513025349 513025338 513025348 513025334 513025342 513025352 513025341 513025351 513025340 513025350 513025339 513025349 513025335 513025343 513025353 513025342 513025352 513025341 513025351 513025340 513025350 513025336 513025344 513025354 513025343 513025353 513025342 513025352 513025341 513025351 513025337 513025345 513025355 513025344 513025354 513025343 513025353 513025342 513025352 513025338 513025346 513025356 513025345 513025355 513025344 513025354 513025343 513025353 513025339 513025347 513025357 513025346 513025356 513025345 513025355 513025344 513025354 513025340 513025348 513025358 513025347 513025357 513025346 513025356 513025345 513025355 513025341 513025349 513025359 513025348 513025358 513025347 513025357 513025346 513025356 513025342 513025350 513025360 513025349 513025359 513025348 513025358 513025347 513025357 513025343 513025351 513025361 513025350 513025360 513025349 513025359 513025348 513025358 513025344 513025352 513025362 513025351 513025361 513025350 513025360 513025349 513025359 513025345 513025353 513025363 513025352 513025362 513025351 513025361 513025350 513025360 513025346 513025354 513025364 513025353 513025363 513025352 513025362 513025351 513025361 513025347 513025355 513025365 513025354 513025364 513025353 513025363 513025352 513025362 513025348 513025356 513025366 513025355 513025365 513025354 513025364 513025353 513025363 513025349 513025357 513025367 513025356 513025366 513025355 513025365 513025354 513025364 513025350 513025358 513025368 513025357 513025367 513025356 513025366 513025355 513025365 513025351 513025359 513025369 513025358 513025368 513025357 513025367 513025356 513025366 513025352 513025360 513025370 513025359 513025369 513025358 513025368 513025357 513025367 513025353 513025361 513025371 513025360 513025370 513025359 513025369 513025358 513025368 513025354 513025362 513025372 513025361 513025371 513025360 513025370 513025359 513025369 513025355 513025363 513025373 513025362 513025372 513025361 513025371 513025360 513025370 513025356 513025364 513025374 513025363 513025373 513025362 513025372 513025361 513025371 513025357 513025365 513025375 513025364 513025374 513025363 513025373 513025362 513025372 513025358 513025366 513025376 513025365 513025375 513025364 513025374 513025363 513025373 513025359 513025367 513025377 513025366 513025376 513025365 513025375 513025364 513025374 513025360 513025368 513025378 513025367 513025377 513025366 513025376 513025365 513025375 513025361 513025369 513025379 513025368 513025378 513025367 513025377 513025366 513025376 513025362 513025370 513025380 513025369 513025379 513025368 513025378 513025367 513025377 513025363 513025371 513025381 513025370 513025380 513025369 513025379 513025368 513025378 513025364 513025372 513025382 513025371 513025381 513025370 513025380 513025369 513025379 513025365 513025373 513025383 513025372 513025382 513025371 513025381 513025370 513025380 513025366 513025374 513025384 513025373 513025383 513025372 513025382 513025371 513025381 513025367 513025375 513025385 513025374 513025384 513025373 513025383 513025372 513025382 513025368 513025376 513025386 513025375 513025385 513025374 513025384 513025373 513025383 513025369 513025377 513025387 513025376 513025386 513025375 513025385 513025374 513025384 513025370 513025378 513025388 513025377 513025387 513025376 513025386 513025375 513025385 513025371 513025379 513025389 513025378 513025388 513025377 513025387 513025376 513025386 513025372 513025380 513025390 513025379 513025389 513025378 513025388 513025377 513025387 513025373 513025381 513025391 513025380 513025390 513025379 </code></pre> 正如你所看到的，它不起作用，我知道我在整个设计中有一个缺陷，但我找不到它，我真的很感谢你的帮助。在 问题摘要- 我有一个diff list[8,18,7,17,6,16,5,15]程序从获取一个id开始，每次我需要检查下一个id是-id+diff[I]（I=0） 如果（id+diff[i]）不是有效的id，我将检查下一个id，即（id+diff[i+1]）。在 如果在那个迭代（id+diff[i..n]）上没有有效的id，我将将id增加1，并检查id+1是否是有效的id，如果不是，我将使用id+diff[i..n]再次检查，直到找到下一个有效的id 在每一次迭代中，我都要检查在上一次迭代中已经检查过的ID（这需要花费大量的时间并且效率低下），我需要避免检查已经检查过的ID，并不断增加ID，直到找到下一个有效的ID为止 现在，如果id=1（并且它是一个有效的id）并且diff=[8,18,7,17,6,16,5,15]。在 第一次迭代看起来像（我用粗体标记id，我可以避免检查）- 第一个-id=1 9，19，8，18，7，17，6，16，2 秒-id=2 10，20，9，19，8，18，7，17，3 第三个-id=3 11，21，10，20，9，19，8，18，4 第四个-id=4 12，22-没错，下一个有效ID是22！ 这花费了我29个请求，而不是-17个，这是一个小例子，我的范围是从最后一个有效id到300-600个id 我拿不到我的代码，以避免检查以前检查过的ID和一个智能和有效的方式。在 谢谢！在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

当一个标识符在执行修改的函数之外时，如果想要引起对它的修改，那么就可以将它声明为<code>global</code>。在 因此，使最后结果和当前<code>global</code>是一种畸变： <ul> <li>第一个，lastResult，因为它是完整代码中的常量。最好的方法是定义函数checkNextID（）的参数lastResult，默认参数为lastResult。</li> <li>第二个，出现在中，因为在checkNextID（）中没有关于该标识符的修改</li> </ul> 现在，将curRes定义为函数isValid（）中的<code>global</code>也是一个错误的做法：1）将isValid（）的新值从isValid（）的内部发送到外部；2）然后，程序在函数checkNextID（）之外搜索curRes的值。这是一个奇怪而无用的迂回路线，你可以让curRes成为函数checkNextID（）中的一个自由变量（参见<a href="http://docs.python.org/reference/executionmodel.html" rel="nofollow">doc</a>），这个函数会自动走出去解析这个名称并获得它的值。在 一。在 就个人而言，我更喜欢重组通用算法。在下面的代码中，curRes被定义为一个局部对象，直接从函数isValid（）的返回中获取其值。这需要重新定义isValid（）：在我的代码中，isValid（）返回对象soup或False 我希望我能理解你的需要。请告诉我我的方法有什么问题。在 <pre><code>def checkNextID(ID, lastResult = lastResult, diff = [0,1,5,6,7,8,15,16,17,18]): runs = 0 maxdiff = max(diff) diff.extend(x for x in xrange(maxdiff) if x not in diff) while True: for i in diff: if ID+i==lastResult: break runs += 1 if runs % 10 == 0: time.sleep(6) curRes = isValid(ID+i): if cuRes: parseHTML(curRes, ID+i) ID = ID + i break else: runs += 1 ID += maxdiff + 1 if ID==lastResult: break def isValid(ID, urlhead = urlPath): # this function return either False OR a BeautifulSoup instance try: page = getPAGE(urlhead + str(ID)) if page == False: return False except Exception, e: print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address else: try: soup = BeautifulSoup(page) except TypeError, e: print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address return False try: companyID = soup.find('span',id='lblCompanyNumber').string if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file saveToCsv(ID, isEmpty = True) return False else: return soup #we have the data we need, save the soup obj to a global variable except Exception, e: print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address return False </code></pre> 一。在 此外，要加快您的计划： <ul> <li>您应该使用regex工具（modulere）而不是BeautifulSoup，后者大约比使用regex慢10倍左右</li> <li>您不应该在checkNextID中定义和使用所有这些函数（saveToCSV，parseHTML，isValid）：与直接代码相比，每次调用函数都需要额外的时间</li> </ul> 一。在 <h2>最终编辑</h2> 为了结束对你问题的长期研究，我做了一个基准测试。下面的代码和结果表明我的直觉是正确的：我的代码2比你的代码1运行时间至少少20% . 您的代码#1： ^{pr2}$ 我的代码2 <pre><code>from time import clock lastResult = 200 def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]): maxdiff = max(diff) others = [x for x in xrange(1,maxdiff) if x not in diff] lastothers = others[-1] li = [] while True: if ID>lastResult: break else: curRes = isValid(ID) if curRes: li.append(ID) while True: for i in diff: curRes = isValid(ID+i) if curRes: li.append(ID+i) ID += i break else: for j in others: if ID+j>lastResult: ID += j break curRes = isValid(ID+j) if curRes: li.append(ID+j) ID += j break if j==lastothers: ID += maxdiff + 1 break elif ID>lastResult: break else: ID += 1 return li def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98,114,129,137,154,166,175,180,184,200)): return ID in valid_ones te = clock() for i in xrange(10000): checkNextID(0) print clock()-te,'seconds' print checkNextID(0) </code></pre> 结果： <pre><code>your code 0.398804596674 seconds [1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200] my code 0.268061164198 seconds [1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200] </code></pre> 0.268061164198/0.398804596674=67.3% 我也尝试过lastResult=100，得到72%。 当lastResult=480时，我得到了80%。在

Python爬虫需要我的算法帮助

1 个回答

相关Python问题