<p>当一个标识符在执行修改的函数之外时,如果想要引起对它的修改,那么就可以将它声明为<code>global</code>。在</p>
<p>因此,使<strong>最后结果</strong>和<strong>当前</strong><code>global</code>是一种畸变:</p>
<ul>
<li><p>第一个,<strong>lastResult</strong>,因为它是完整代码中的常量。最好的方法是定义函数<strong>checkNextID()</strong>的参数<strong>lastResult</strong>,默认参数为<strong>lastResult</strong>。</p></li>
<li><p>第二个,<strong>出现在</strong>中,因为在<strong>checkNextID()中没有关于该标识符的修改</strong></p></li>
</ul>
<p>现在,将<strong>curRes</strong>定义为函数<strong>isValid()</strong>中的<code>global</code>也是一个错误的做法:1)将<strong>isValid()</strong>的新值从<strong>isValid()</strong>的内部发送到外部;2)然后,程序在函数<strong>checkNextID()</strong>之外搜索<strong>curRes的值</strong>。这是一个奇怪而无用的迂回路线,你可以让<strong>curRes</strong>成为函数<strong>checkNextID()</strong>中的一个<em>自由变量</em>(参见<a href="http://docs.python.org/reference/executionmodel.html" rel="nofollow">doc</a>),这个函数会自动走出去解析这个名称并获得它的值。在</p>
<p>一。在</p>
<p>就个人而言,我更喜欢重组通用算法。在下面的代码中,<strong>curRes</strong>被定义为一个局部对象,直接从函数<strong>isValid()的返回中获取其值。这需要重新定义<strong>isValid()</strong>:在我的代码中,<strong>isValid()</strong>返回对象<strong>soup</strong>或<strong>False</strong></p>
<p>我希望我能理解你的需要。请告诉我我的方法有什么问题。在</p>
<pre><code>def checkNextID(ID, lastResult = lastResult, diff = [0,1,5,6,7,8,15,16,17,18]):
runs = 0
maxdiff = max(diff)
diff.extend(x for x in xrange(maxdiff) if x not in diff)
while True:
for i in diff:
if ID+i==lastResult: break
runs += 1
if runs % 10 == 0: time.sleep(6)
curRes = isValid(ID+i):
if cuRes:
parseHTML(curRes, ID+i)
ID = ID + i
break
else:
runs += 1
ID += maxdiff + 1
if ID==lastResult: break
def isValid(ID, urlhead = urlPath):
# this function return either False OR a BeautifulSoup instance
try:
page = getPAGE(urlhead + str(ID))
if page == False: return False
except Exception, e:
print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
else:
try:
soup = BeautifulSoup(page)
except TypeError, e:
print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address
return False
try:
companyID = soup.find('span',id='lblCompanyNumber').string
if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file
saveToCsv(ID, isEmpty = True)
return False
else:
return soup #we have the data we need, save the soup obj to a global variable
except Exception, e:
print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address
return False
</code></pre>
<p>一。在</p>
<p>此外,要加快您的计划:</p>
<ul>
<li><p>您应该使用<em>regex工具</em>(module<strong>re</strong>)而不是BeautifulSoup,后者大约比使用regex慢10倍左右</p></li>
<li><p>您不应该在<strong>checkNextID</strong>中定义和使用所有这些函数(saveToCSV,parseHTML,isValid):与直接代码相比,每次调用函数都需要额外的时间</p></li>
</ul>
<p>一。在</p>
<h2>最终编辑</h2>
<p>为了结束对你问题的长期研究,我做了一个基准测试。下面的代码和结果表明我的直觉是正确的:我的代码2比你的代码1运行时间至少少20%
.
您的代码#1:</p>
^{pr2}$
<p>我的代码2</p>
<pre><code>from time import clock
lastResult = 200
def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
maxdiff = max(diff)
others = [x for x in xrange(1,maxdiff) if x not in diff]
lastothers = others[-1]
li = []
while True:
if ID>lastResult:
break
else:
curRes = isValid(ID)
if curRes:
li.append(ID)
while True:
for i in diff:
curRes = isValid(ID+i)
if curRes:
li.append(ID+i)
ID += i
break
else:
for j in others:
if ID+j>lastResult:
ID += j
break
curRes = isValid(ID+j)
if curRes:
li.append(ID+j)
ID += j
break
if j==lastothers:
ID += maxdiff + 1
break
elif ID>lastResult:
break
else:
ID += 1
return li
def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98,114,129,137,154,166,175,180,184,200)):
return ID in valid_ones
te = clock()
for i in xrange(10000):
checkNextID(0)
print clock()-te,'seconds'
print checkNextID(0)
</code></pre>
<p>结果:</p>
<pre><code>your code
0.398804596674 seconds
[1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200]
my code
0.268061164198 seconds
[1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200]
</code></pre>
<p>0.268061164198/0.398804596674=67.3%</p>
<p>我也尝试过lastResult=100,得到72%。<br/>
当lastResult=480时,我得到了80%。在</p>