Python爬虫需要我的算法帮助

2024-09-27 04:24:42 发布

您现在位置:Python中文网/ 问答频道 /正文

**在帖子末尾添加了问题摘要**

我写了一个抓取和解析网址的爬虫程序。在

在第一个版本中,为了得到下一个有效的页面,我增加了URL ID并将无效ID保存到一个文件中,有效的URL被移动到解析我需要的内容的解析器中,过了一会儿,我发现大多数有效的ID都有一个返回的子线程。在

我做了一些统计,得到了一系列的副标题-[8,18,7,17,6,16,5,15],按重复次数最多到最少排序。在

所以我changed my code来-

def checkNextID(ID):
    numOfRuns = 0
    global curRes, lastResult
    while ID < lastResult:
        try:
            numOfRuns += 1
            if numOfRuns % 10 == 0:
                time.sleep(7) # sleep every 10 iterations
                numOfRuns = 0
            if isValid(ID + 8): 
                parseHTML(curRes, ID)
                ID = ID + 8
            elif isValid(ID + 18):
                parseHTML(curRes, ID)
                ID = ID + 18
            elif isValid(ID + 7):
                parseHTML(curRes, ID)
                ID = ID + 7
            elif isValid(ID + 17):
                parseHTML(curRes, ID)
                ID = ID + 17
            elif isValid(ID+6):
                parseHTML(curRes, ID)
                ID = ID + 6
            elif isValid(ID + 16):
                parseHTML(curRes, ID)
                ID = ID + 16
            elif isValid(ID + 5):
                parseHTML(curRes, ID)
                ID = ID + 5
            elif isValid(ID + 15):
                parseHTML(curRes, ID)
                ID = ID + 15
            else:
                if isValid(ID + 1):
                    parseHTML(curRes, ID)
                ID = ID + 1
        except Exception, e:
            print "something went wrong: " + str(e) 

isValid()是一个函数,它获取一个ID+其中一个子线程,如果url包含我需要的内容,则返回True,并将url的soup对象保存到名为“curRes”的全局变量中;如果url不包含我需要的数据,则返回False,并将ID保存到“baddfile”。在

parseHTML是一个函数,它获取soup对象(curRes),解析我需要的数据,然后将数据保存到csv,然后返回True。在

在一个完美的世界里,这段代码就是我在所有有效ID上运行所需的一切(5M范围内大约有400K个),它让我在更短的时间内获得更好的结果(x50更快)。在

但是,当到达不包含任何有效URL的范围时,我的代码效率非常低,在每次迭代中我都会一遍又一遍地爬行相同的URL,这是因为我将ID增加一个,以便继续前进,直到找到下一个有效的URL,然后检查ID+8,然后检查18,17等',有时给我相同的网址,我在上一次迭代检查。在

所以我去修改了代码,这样它将保留一组无效的url,我将避免再次检查,我不能让它工作,我打破了我的头几个小时,它没有正常工作。在

这是我的新功能-

^{pr2}$

我将每个无效的ID保存到一个集合中,在每次调用isValid()之前,我会检查是否已经尝试过该ID,如果没有,则调用isValid(),否则,ID将增加一个。在

坏的ID文件就是这样-

513025328
513025317
513025327
513025316
513025326
513025312
513025320
513025330
513025319
513025329
513025318
513025328
513025317
513025327
513025313
513025321
513025331
513025320
513025330
513025319
513025329
513025318
513025328
513025314
513025322
513025332
513025321
513025331
513025320
513025330
513025319
513025329
513025315
513025323
513025333
513025322
513025332
513025321
513025331
513025320
513025330
513025316
513025324
513025334
513025323
513025333
513025322
513025332
513025321
513025331
513025317
513025325
513025335
513025324
513025334
513025323
513025333
513025322
513025332
513025318
513025326
513025336
513025325
513025335
513025324
513025334
513025323
513025333
513025319
513025327
513025337
513025326
513025336
513025325
513025335
513025324
513025334
513025320
513025328
513025338
513025327
513025337
513025326
513025336
513025325
513025335
513025321
513025329
513025339
513025328
513025338
513025327
513025337
513025326
513025336
513025322
513025330
513025340
513025329
513025339
513025328
513025338
513025327
513025337
513025323
513025331
513025341
513025330
513025340
513025329
513025339
513025328
513025338
513025324
513025332
513025342
513025331
513025341
513025330
513025340
513025329
513025339
513025325
513025333
513025343
513025332
513025342
513025331
513025341
513025330
513025340
513025326
513025334
513025344
513025333
513025343
513025332
513025342
513025331
513025341
513025327
513025335
513025345
513025334
513025344
513025333
513025343
513025332
513025342
513025328
513025336
513025346
513025335
513025345
513025334
513025344
513025333
513025343
513025329
513025337
513025347
513025336
513025346
513025335
513025345
513025334
513025344
513025330
513025338
513025348
513025337
513025347
513025336
513025346
513025335
513025345
513025331
513025339
513025349
513025338
513025348
513025337
513025347
513025336
513025346
513025332
513025340
513025350
513025339
513025349
513025338
513025348
513025337
513025347
513025333
513025341
513025351
513025340
513025350
513025339
513025349
513025338
513025348
513025334
513025342
513025352
513025341
513025351
513025340
513025350
513025339
513025349
513025335
513025343
513025353
513025342
513025352
513025341
513025351
513025340
513025350
513025336
513025344
513025354
513025343
513025353
513025342
513025352
513025341
513025351
513025337
513025345
513025355
513025344
513025354
513025343
513025353
513025342
513025352
513025338
513025346
513025356
513025345
513025355
513025344
513025354
513025343
513025353
513025339
513025347
513025357
513025346
513025356
513025345
513025355
513025344
513025354
513025340
513025348
513025358
513025347
513025357
513025346
513025356
513025345
513025355
513025341
513025349
513025359
513025348
513025358
513025347
513025357
513025346
513025356
513025342
513025350
513025360
513025349
513025359
513025348
513025358
513025347
513025357
513025343
513025351
513025361
513025350
513025360
513025349
513025359
513025348
513025358
513025344
513025352
513025362
513025351
513025361
513025350
513025360
513025349
513025359
513025345
513025353
513025363
513025352
513025362
513025351
513025361
513025350
513025360
513025346
513025354
513025364
513025353
513025363
513025352
513025362
513025351
513025361
513025347
513025355
513025365
513025354
513025364
513025353
513025363
513025352
513025362
513025348
513025356
513025366
513025355
513025365
513025354
513025364
513025353
513025363
513025349
513025357
513025367
513025356
513025366
513025355
513025365
513025354
513025364
513025350
513025358
513025368
513025357
513025367
513025356
513025366
513025355
513025365
513025351
513025359
513025369
513025358
513025368
513025357
513025367
513025356
513025366
513025352
513025360
513025370
513025359
513025369
513025358
513025368
513025357
513025367
513025353
513025361
513025371
513025360
513025370
513025359
513025369
513025358
513025368
513025354
513025362
513025372
513025361
513025371
513025360
513025370
513025359
513025369
513025355
513025363
513025373
513025362
513025372
513025361
513025371
513025360
513025370
513025356
513025364
513025374
513025363
513025373
513025362
513025372
513025361
513025371
513025357
513025365
513025375
513025364
513025374
513025363
513025373
513025362
513025372
513025358
513025366
513025376
513025365
513025375
513025364
513025374
513025363
513025373
513025359
513025367
513025377
513025366
513025376
513025365
513025375
513025364
513025374
513025360
513025368
513025378
513025367
513025377
513025366
513025376
513025365
513025375
513025361
513025369
513025379
513025368
513025378
513025367
513025377
513025366
513025376
513025362
513025370
513025380
513025369
513025379
513025368
513025378
513025367
513025377
513025363
513025371
513025381
513025370
513025380
513025369
513025379
513025368
513025378
513025364
513025372
513025382
513025371
513025381
513025370
513025380
513025369
513025379
513025365
513025373
513025383
513025372
513025382
513025371
513025381
513025370
513025380
513025366
513025374
513025384
513025373
513025383
513025372
513025382
513025371
513025381
513025367
513025375
513025385
513025374
513025384
513025373
513025383
513025372
513025382
513025368
513025376
513025386
513025375
513025385
513025374
513025384
513025373
513025383
513025369
513025377
513025387
513025376
513025386
513025375
513025385
513025374
513025384
513025370
513025378
513025388
513025377
513025387
513025376
513025386
513025375
513025385
513025371
513025379
513025389
513025378
513025388
513025377
513025387
513025376
513025386
513025372
513025380
513025390
513025379
513025389
513025378
513025388
513025377
513025387
513025373
513025381
513025391
513025380
513025390
513025379

正如你所看到的,它不起作用,我知道我在整个设计中有一个缺陷,但我找不到它,我真的很感谢你的帮助。在

问题摘要-

我有一个diff list[8,18,7,17,6,16,5,15]程序从获取一个id开始,每次我需要检查下一个id是-id+diff[I](I=0) 如果(id+diff[i])不是有效的id,我将检查下一个id,即(id+diff[i+1])。在

如果在那个迭代(id+diff[i..n])上没有有效的id,我将将id增加1,并检查id+1是否是有效的id,如果不是,我将使用id+diff[i..n]再次检查,直到找到下一个有效的id

在每一次迭代中,我都要检查在上一次迭代中已经检查过的ID(这需要花费大量的时间并且效率低下),我需要避免检查已经检查过的ID,并不断增加ID,直到找到下一个有效的ID为止

现在,如果id=1(并且它是一个有效的id)并且diff=[8,18,7,17,6,16,5,15]。在

第一次迭代看起来像(我用粗体标记id,我可以避免检查)- 第一个-id=1

9,19,8,18,7,17,6,16,2

秒-id=2

10,20,919818717,3

第三个-id=3

11,21,1020919818,4

第四个-id=4

12,22-没错,下一个有效ID是22!

这花费了我29个请求,而不是-17个,这是一个小例子,我的范围是从最后一个有效id到300-600个id

我拿不到我的代码,以避免检查以前检查过的ID和一个智能和有效的方式。在

谢谢!在


Tags: 数据代码程序idurlifdiff网址
3条回答

首先,你应该把工作分成两个过程。一个用于确定有效的id,另一个用于检索数据。在

确定有效id的程序只需要使用httphead命令,并且比检索页面的程序工作得更快。在

对于检查页面,在检查diff中的id增量之后,再将18添加到导致您开始使用diff的id中。您甚至可以使用diff记录只部分检查过的范围,然后在进程结束时返回,并检查所有这些范围。在

如果不能跳过任何id,那么保存最近检查的n个id的缓存,其中n等于len(diff)。使用类似这样的环形缓冲器:

nextelem = 0
...
# check before retrieving
if not id in ringbuff:
    #retrieve an id
    ringbuf[nextelem] = id
    nextelem += 1
    if nextelem > len[ringbuff]:
        nextelem = 0

。。。在

从表面上看,这样一个简单的循环应该检查所有ID:

^{pr2}$

这将检查所有可能的页面。但当你被击中时,你要提前阅读,如果我理解正确的话,也要部分回溯。在任何情况下,您所要做的基本工作是从xrange返回的简单序列中更改id的范围,因此我认为您需要编写一个生成器并执行以下操作:

for id in myrange(1000000):
    checkpage(id)

你可能仍然想使用一个环形缓冲区,这取决于你在18个可能的额外点击范围内做了什么。如果您需要检查diff中的所有可能性,然后返回到diff中小于maximum元素的值,那么在checkpage中使用环形缓冲区是很有用的。在

但诀窍是编写myrange()。在

def myrange(maxnum):
    global hitfound
    global nextnum
    global diff
    curnum = 0
    while curnum < maxnum:
        yield curnum
        if hitfound:
            nextnum = curnum
            hitnum = curnum
            for e in diff:
                yield hitnum + e
            curnum = nextnum - 1
        curnum += 1

这三个全局变量允许您影响id的范围。如果您在checkpage()内设置hitfound = True,那么您将影响myrange开始应用diff中的增量。然后,您可以设置nextnum来影响它在开始应用diff增量后开始递增的位置。例如,在检查diff增量时,您可能决定将其设置为比第一次(或最后一次)命中大1。或者您可以不使用它,使用环缓冲区来确保不再请求任何diff增量页。在

我建议您提取id递增逻辑,并像上面的代码一样单独测试它。调整生成器myrange()直到它产生正确的序列,然后将其弹出到您的web抓取程序中。在

我想我明白了。在

首先,基于您的想法的代码:

import time

lastResult = 100

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    runs = 0
    SEEN = set()

    while True:
        if ID>lastResult:
            print ('\n=========================='
                   '\nID==%s'
                   '\n   ID>lastResult is %s : program STOPS')\
                  % (ID,ID>lastResult,)
            break
        runs += 1
        if runs % 10 == 0:  time.sleep(0.5)
        if ID in SEEN:
            print '        -\nID=='+str(ID)+'  already seen, not examined'
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                print '             \nID=='+str(ID)+'  vaaaalid'
                while True:
                    for i in diff:
                        runs += 1
                        if runs % 10 == 0:  time.sleep(0.5)
                        curRes = isValid(ID+i)
                        SEEN.add(ID+i)
                        if curRes:
                            print '   i==%2s   ID+i==%s   valid' % (i,ID+i)
                            ID += i
                            print '             \nID==%s' % str(ID)
                            break
                        else:
                            print '   i==%2s   ID+i==%s   not valid' % (i,ID+i)
                    else:
                        ID += 1
                        break
            else:
                print '             \nID==%s  not valid' % ID
                ID += 1


def isValid(ID, valid_ones = (1,9,17,25,50,52,60,83,97,98)):
    return ID in valid_ones


checkNextID(0)

结果

^{pr2}$

一。在

以下是基于我想法的代码:

import time

lastResult = 100

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    runs = 0
    maxdiff = max(diff)
    others = [x for x in xrange(1,maxdiff) if x not in diff]
    lastothers = others[-1]
    SEEN = set()

    while True:
        if ID>lastResult:
            print ('\n=========================='
                   '\nID==%s'
                   '\n   ID>lastResult is %s : program STOPS')\
                  % (ID,ID>lastResult,)
            break
        runs += 1
        if runs % 10 == 0:  time.sleep(0.5)
        if ID in SEEN:
            print '        -\nID=='+str(ID)+'  already seen, not examined'
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                print '                  \nID=='+str(ID)+'  vaaaalid'
                while True:
                    for i in diff:
                        runs += 1
                        if runs % 10 == 0:  time.sleep(0.5)
                        curRes = isValid(ID+i)
                        SEEN.add(ID+i)
                        if curRes:
                            print '   i==%2s   ID+i==%s   valid' % (i,ID+i)
                            ID += i
                            print '             \nID==%s' % str(ID)
                            break
                        else:
                            print '   i==%2s   ID+i==%s   not valid' % (i,ID+i)
                    else:
                        for j in others:
                            if ID+j>lastResult:
                                print '\n   j==%2s   %s+%s==%s>lastResult==%s is %s' \
                                      % (j,ID,j,ID+j,lastResult,ID+j>lastResult)
                                ID += j
                                print '\n             \nnow ID==',ID
                                break
                            runs += 1
                            if runs % 10 == 0:  time.sleep(0.5)
                            curRes = isValid(ID+j)
                            SEEN.add(ID+j)
                            if curRes:
                                print '   j==%2s   ID+j==%s   valid' % (j,ID+j)
                                ID += j
                                print '             \nID=='+str(ID)
                                break
                            else:
                                print '   j==%2s   ID+j==%s   not valid' % (j,ID+j)

                        if j==lastothers:
                            ID += maxdiff + 1
                            print '   ID += %s + 1 ==> ID==%s' % (maxdiff,ID)
                            break
                        elif ID>lastResult:
                            print '   ID>lastResult==%s>%s is %s ==> WILL STOP' % (ID,lastResult,ID>lastResult)
                            break

            else:
                print '            -\nID=='+str(ID)+'  not valid'
                ID += 1




def isValid(ID, valid_ones = (1,9,17,25,50,52,60,83,97,98)):
    return ID in valid_ones


checkNextID(0)

结果

            -
ID==0  not valid
                  
ID==1  vaaaalid
   i== 8   ID+i==9   valid
             
ID==9
   i== 8   ID+i==17   valid
             
ID==17
   i== 8   ID+i==25   valid
             
ID==25
   i== 8   ID+i==33   not valid
   i==18   ID+i==43   not valid
   i== 7   ID+i==32   not valid
   i==17   ID+i==42   not valid
   i== 6   ID+i==31   not valid
   i==16   ID+i==41   not valid
   i== 5   ID+i==30   not valid
   i==15   ID+i==40   not valid
   j== 1   ID+j==26   not valid
   j== 2   ID+j==27   not valid
   j== 3   ID+j==28   not valid
   j== 4   ID+j==29   not valid
   j== 9   ID+j==34   not valid
   j==10   ID+j==35   not valid
   j==11   ID+j==36   not valid
   j==12   ID+j==37   not valid
   j==13   ID+j==38   not valid
   j==14   ID+j==39   not valid
   ID += 18 + 1 ==> ID==44
            -
ID==44  not valid
            -
ID==45  not valid
            -
ID==46  not valid
            -
ID==47  not valid
            -
ID==48  not valid
            -
ID==49  not valid
                  
ID==50  vaaaalid
   i== 8   ID+i==58   not valid
   i==18   ID+i==68   not valid
   i== 7   ID+i==57   not valid
   i==17   ID+i==67   not valid
   i== 6   ID+i==56   not valid
   i==16   ID+i==66   not valid
   i== 5   ID+i==55   not valid
   i==15   ID+i==65   not valid
   j== 1   ID+j==51   not valid
   j== 2   ID+j==52   valid
             
ID==52
   i== 8   ID+i==60   valid
             
ID==60
   i== 8   ID+i==68   not valid
   i==18   ID+i==78   not valid
   i== 7   ID+i==67   not valid
   i==17   ID+i==77   not valid
   i== 6   ID+i==66   not valid
   i==16   ID+i==76   not valid
   i== 5   ID+i==65   not valid
   i==15   ID+i==75   not valid
   j== 1   ID+j==61   not valid
   j== 2   ID+j==62   not valid
   j== 3   ID+j==63   not valid
   j== 4   ID+j==64   not valid
   j== 9   ID+j==69   not valid
   j==10   ID+j==70   not valid
   j==11   ID+j==71   not valid
   j==12   ID+j==72   not valid
   j==13   ID+j==73   not valid
   j==14   ID+j==74   not valid
   ID += 18 + 1 ==> ID==79
            -
ID==79  not valid
            -
ID==80  not valid
            -
ID==81  not valid
            -
ID==82  not valid
                  
ID==83  vaaaalid
   i== 8   ID+i==91   not valid
   i==18   ID+i==101   not valid
   i== 7   ID+i==90   not valid
   i==17   ID+i==100   not valid
   i== 6   ID+i==89   not valid
   i==16   ID+i==99   not valid
   i== 5   ID+i==88   not valid
   i==15   ID+i==98   valid
             
ID==98
   i== 8   ID+i==106   not valid
   i==18   ID+i==116   not valid
   i== 7   ID+i==105   not valid
   i==17   ID+i==115   not valid
   i== 6   ID+i==104   not valid
   i==16   ID+i==114   not valid
   i== 5   ID+i==103   not valid
   i==15   ID+i==113   not valid
   j== 1   ID+j==99   not valid
   j== 2   ID+j==100   not valid

   j== 3   98+3==101>lastResult==100 is True

             
now ID== 101
   ID>lastResult==101>100 is True ==> WILL STOP

==========================
ID==101
   ID>lastResult is True : program STOPS

    if ID in SEEN:
        print '        -\nID=='+str(ID)+'  already seen, not examined'
        ID += 1

在这段代码中,但是消息“already seen”在执行过程中从不打印;但是,valids ID的检测有相同的结果;这意味着在我的代码中不需要使用set seen。在

一。在

编辑1

代码#1与指令定期清空SEEN

import time

lastResult = 100

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    runs = 0
    SEEN = set()
    while True:
        if ID>lastResult:
            print ('\n=========================='
                   '\nID==%s'
                   '\n   ID>lastResult is %s : program STOPS')\
                  % (ID,ID>lastResult,)
            break
        runs += 1
        if runs % 10 == 0:  time.sleep(0.5)
        if ID in SEEN:
            print '        -\n%s\nID==%s  already seen, not examined' % (SEEN,ID)
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                print '             \n%s\nID==%s  vaaaalid'  % (SEEN,ID)
                while True:
                    for i in diff:
                        runs += 1
                        if runs % 10 == 0:  time.sleep(0.5)
                        curRes = isValid(ID+i)
                        print '   '+str(SEEN)
                        if i==diff[0]:
                            SEEN = set([ID+i])
                        else:
                            SEEN.add(ID+i)
                        if curRes:
                            print '   i==%2s   ID+i==%s   valid' % (i,ID+i)
                            ID += i
                            print '             \nID==%s' % str(ID)
                            break
                        else:
                            print '   i==%2s   ID+i==%s   not valid' % (i,ID+i)
                    else:
                        ID += 1
                        break
            else:
                print '             \n%s\nID==%s  not vaaaaalid' % (SEEN,ID)
                ID += 1


def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98)):
    return ID in valid_ones


checkNextID(0)

结果

             
set([])
ID==0  not vaaaaalid
             
set([])
ID==1  vaaaalid
   set([])
   i== 8   ID+i==9   valid
             
ID==9
   set([9])
   i== 8   ID+i==17   valid
             
ID==17
   set([17])
   i== 8   ID+i==25   valid
             
ID==25
   set([25])
   i== 8   ID+i==33   not valid
   set([33])
   i==18   ID+i==43   not valid
   set([33, 43])
   i== 7   ID+i==32   not valid
   set([32, 33, 43])
   i==17   ID+i==42   not valid
   set([32, 33, 42, 43])
   i== 6   ID+i==31   not valid
   set([32, 33, 42, 43, 31])
   i==16   ID+i==41   not valid
   set([32, 33, 41, 42, 43, 31])
   i== 5   ID+i==30   valid
             
ID==30
   set([32, 33, 41, 42, 43, 30, 31])
   i== 8   ID+i==38   not valid
   set([38])
   i==18   ID+i==48   not valid
   set([48, 38])
   i== 7   ID+i==37   not valid
   set([48, 37, 38])
   i==17   ID+i==47   not valid
   set([48, 37, 38, 47])
   i== 6   ID+i==36   not valid
   set([48, 36, 37, 38, 47])
   i==16   ID+i==46   not valid
   set([36, 37, 38, 46, 47, 48])
   i== 5   ID+i==35   not valid
   set([35, 36, 37, 38, 46, 47, 48])
   i==15   ID+i==45   not valid
             
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==31  not vaaaaalid
             
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==32  not vaaaaalid
             
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==33  not vaaaaalid
             
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==34  not vaaaaalid
        -
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==35  already seen, not examined
        -
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==36  already seen, not examined
        -
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==37  already seen, not examined
        -
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==38  already seen, not examined
             
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==39  not vaaaaalid
             
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==40  not vaaaaalid
             
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==41  not vaaaaalid
             
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==42  not vaaaaalid
             
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==43  not vaaaaalid
             
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==44  not vaaaaalid
        -
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==45  already seen, not examined
        -
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==46  already seen, not examined
        -
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==47  already seen, not examined
        -
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==48  already seen, not examined
             
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==49  not vaaaaalid
             
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==50  vaaaalid
   set([35, 36, 37, 38, 45, 46, 47, 48])
   i== 8   ID+i==58   not valid
   set([58])
   i==18   ID+i==68   not valid
   set([58, 68])
   i== 7   ID+i==57   not valid
   set([57, 58, 68])
   i==17   ID+i==67   not valid
   set([57, 58, 67, 68])
   i== 6   ID+i==56   not valid
   set([56, 57, 58, 67, 68])
   i==16   ID+i==66   not valid
   set([66, 67, 68, 56, 57, 58])
   i== 5   ID+i==55   not valid
   set([66, 67, 68, 55, 56, 57, 58])
   i==15   ID+i==65   not valid
             
set([65, 66, 67, 68, 55, 56, 57, 58])
ID==51  not vaaaaalid
             
set([65, 66, 67, 68, 55, 56, 57, 58])
ID==52  vaaaalid
   set([65, 66, 67, 68, 55, 56, 57, 58])
   i== 8   ID+i==60   valid
             
ID==60
   set([60])
   i== 8   ID+i==68   not valid
   set([68])
   i==18   ID+i==78   not valid
   set([68, 78])
   i== 7   ID+i==67   not valid
   set([67, 68, 78])
   i==17   ID+i==77   not valid
   set([67, 68, 77, 78])
   i== 6   ID+i==66   not valid
   set([66, 67, 68, 77, 78])
   i==16   ID+i==76   not valid
   set([66, 67, 68, 76, 77, 78])
   i== 5   ID+i==65   not valid
   set([65, 66, 67, 68, 76, 77, 78])
   i==15   ID+i==75   not valid
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==61  not vaaaaalid
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==62  not vaaaaalid
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==63  not vaaaaalid
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==64  not vaaaaalid
        -
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==65  already seen, not examined
        -
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==66  already seen, not examined
        -
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==67  already seen, not examined
        -
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==68  already seen, not examined
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==69  not vaaaaalid
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==70  not vaaaaalid
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==71  not vaaaaalid
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==72  not vaaaaalid
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==73  not vaaaaalid
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==74  not vaaaaalid
        -
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==75  already seen, not examined
        -
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==76  already seen, not examined
        -
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==77  already seen, not examined
        -
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==78  already seen, not examined
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==79  not vaaaaalid
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==80  not vaaaaalid
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==81  not vaaaaalid
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==82  not vaaaaalid
             
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==83  vaaaalid
   set([65, 66, 67, 68, 75, 76, 77, 78])
   i== 8   ID+i==91   not valid
   set([91])
   i==18   ID+i==101   not valid
   set([91, 101])
   i== 7   ID+i==90   not valid
   set([90, 91, 101])
   i==17   ID+i==100   not valid
   set([90, 91, 100, 101])
   i== 6   ID+i==89   not valid
   set([89, 90, 91, 100, 101])
   i==16   ID+i==99   not valid
   set([99, 100, 101, 89, 90, 91])
   i== 5   ID+i==88   not valid
   set([99, 100, 101, 88, 89, 90, 91])
   i==15   ID+i==98   valid
             
ID==98
   set([98, 99, 100, 101, 88, 89, 90, 91])
   i== 8   ID+i==106   not valid
   set([106])
   i==18   ID+i==116   not valid
   set([106, 116])
   i== 7   ID+i==105   not valid
   set([105, 106, 116])
   i==17   ID+i==115   not valid
   set([105, 106, 115, 116])
   i== 6   ID+i==104   not valid
   set([104, 105, 106, 115, 116])
   i==16   ID+i==114   not valid
   set([104, 105, 106, 114, 115, 116])
   i== 5   ID+i==103   not valid
   set([103, 104, 105, 106, 114, 115, 116])
   i==15   ID+i==113   not valid
             
set([103, 104, 105, 106, 113, 114, 115, 116])
ID==99  not vaaaaalid
             
set([103, 104, 105, 106, 113, 114, 115, 116])
ID==100  not vaaaaalid

==========================
ID==101
   ID>lastResult is True : program STOPS

上面的代码显示了记录和清空已经看到的ID值的过程。这是一个很好的代码,因为该算法包括定期清空SEEN,因为在给定要测试的id数量的情况下,清空是必要的。在

但从一开始,我的观点是,在这个算法中,与SEEN有关的记录和测试指令在程序的每一步都会重复执行,这对性能有很大的影响。在

这就是为什么我认为应该有另一个算法没有这个缺点。我终于成功地写了这样一个替代代码,现在我们有两个代码,有两个不同的算法。在

关于你的问题,“你确定没有必要在第二个问题中使用所见的逻辑吗?”
我回答‘是的,我想我可以肯定’。用指令管理SEEN来运行我的代码#2的目的是在验证了什么是一个思想概念和一个概念算法之后,让我确定。如果你想确定,你必须这样做: -从概念上和精确地研究算法 -尽可能多地写两个代码的执行过程,并比较它们的结果,只要你需要实验证明,改变lastResult,valid_one和diff的值 对我来说,只要没有矛盾的实际案例证明我的结论是错误的,这一点就结束了。在

我继续另一个答案,因为这个答案中的字符数是有限的

当一个标识符在执行修改的函数之外时,如果想要引起对它的修改,那么就可以将它声明为global。在

因此,使最后结果当前global是一种畸变:

  • 第一个,lastResult,因为它是完整代码中的常量。最好的方法是定义函数checkNextID()的参数lastResult,默认参数为lastResult

  • 第二个,出现在中,因为在checkNextID()中没有关于该标识符的修改

现在,将curRes定义为函数isValid()中的global也是一个错误的做法:1)将isValid()的新值从isValid()的内部发送到外部;2)然后,程序在函数checkNextID()之外搜索curRes的值。这是一个奇怪而无用的迂回路线,你可以让curRes成为函数checkNextID()中的一个自由变量(参见doc),这个函数会自动走出去解析这个名称并获得它的值。在

一。在

就个人而言,我更喜欢重组通用算法。在下面的代码中,curRes被定义为一个局部对象,直接从函数isValid()的返回中获取其值。这需要重新定义isValid():在我的代码中,isValid()返回对象soupFalse

我希望我能理解你的需要。请告诉我我的方法有什么问题。在

def checkNextID(ID, lastResult = lastResult, diff = [0,1,5,6,7,8,15,16,17,18]):
    runs = 0
    maxdiff = max(diff)
    diff.extend(x for x in xrange(maxdiff) if x not in diff)
    while True:
        for i in diff:
            if ID+i==lastResult:  break
            runs += 1
            if runs % 10 == 0:  time.sleep(6)
            curRes = isValid(ID+i):
            if cuRes:
                parseHTML(curRes, ID+i)
                ID = ID + i
                break
        else:
            runs += 1
            ID += maxdiff + 1
            if ID==lastResult:  break






def isValid(ID, urlhead = urlPath):
    # this function return either False OR a BeautifulSoup instance
    try:
        page = getPAGE(urlhead + str(ID))
        if page == False:  return False
    except Exception, e:
        print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
    else:
        try:
            soup = BeautifulSoup(page)
        except TypeError, e:
            print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address
            return False
        try:
            companyID = soup.find('span',id='lblCompanyNumber').string
            if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file
                saveToCsv(ID, isEmpty = True)
                return False
            else:
                return soup #we have the data we need, save the soup obj to a global variable
        except Exception, e:
            print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address
            return False

一。在

此外,要加快您的计划:

  • 您应该使用regex工具(modulere)而不是BeautifulSoup,后者大约比使用regex慢10倍左右

  • 您不应该在checkNextID中定义和使用所有这些函数(saveToCSV,parseHTML,isValid):与直接代码相比,每次调用函数都需要额外的时间

一。在

最终编辑

为了结束对你问题的长期研究,我做了一个基准测试。下面的代码和结果表明我的直觉是正确的:我的代码2比你的代码1运行时间至少少20% . 您的代码#1:

^{pr2}$

我的代码2

from time import clock

lastResult = 200

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    maxdiff = max(diff)
    others = [x for x in xrange(1,maxdiff) if x not in diff]
    lastothers = others[-1]

    li = []

    while True:
        if ID>lastResult:
            break
        else:
            curRes = isValid(ID)
            if curRes:
                li.append(ID)
                while True:
                    for i in diff:
                        curRes = isValid(ID+i)
                        if curRes:
                            li.append(ID+i)
                            ID += i
                            break
                    else:
                        for j in others:
                            if ID+j>lastResult:
                                ID += j
                                break
                            curRes = isValid(ID+j)
                            if curRes:
                                li.append(ID+j)
                                ID += j
                                break

                        if j==lastothers:
                            ID += maxdiff + 1
                            break
                        elif ID>lastResult:
                            break

            else:
                ID += 1

    return li




def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98,114,129,137,154,166,175,180,184,200)):
    return ID in valid_ones

te = clock()
for i in xrange(10000):
    checkNextID(0)
print clock()-te,'seconds'
print checkNextID(0)

结果:

your code
0.398804596674 seconds
[1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200]

my code
0.268061164198 seconds
[1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200]

0.268061164198/0.398804596674=67.3%

我也尝试过lastResult=100,得到72%。
当lastResult=480时,我得到了80%。在

相关问题 更多 >

    热门问题