回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我正在尝试使用BeautifulSoup编写一个python脚本,该脚本在网页<a href="http://tbc-python.fossee.in/completed-books/" rel="nofollow">http://tbc-python.fossee.in/completed-books/</a>中爬行并从中收集必要的数据。基本上,它必须将所有书籍章节中出现的<code>page loading errors, SyntaxErrors, NameErrors, AttributeErrors, etc</code>提取到一个文本文件<code>errors.txt</code>。大约有273本书。写的剧本做得很好。我正在以很好的速度使用带宽。但是代码要花很多时间来浏览所有的书籍。请帮助我优化python脚本与必要的调整,也许使用函数等。谢谢</p>
<pre><code>import urllib2, urllib
from bs4 import BeautifulSoup
website = "http://tbc-python.fossee.in/completed-books/"
soup = BeautifulSoup(urllib2.urlopen(website))
errors = open('errors.txt','w')
# Completed books webpage has data stored in table format
BookTable = soup.find('table', {'class': 'table table-bordered table-hover'})
for BookCount, BookRow in enumerate(BookTable.find_all('tr'), start = 1):
# Grab book names
BookCol = BookRow.find_all('td')
BookName = BookCol[1].a.string.strip()
print "%d: %s" % (BookCount, BookName)
# Open each book
BookSrc = BeautifulSoup(urllib2.urlopen('http://tbc-python.fossee.in%s' %(BookCol[1].a.get("href"))))
ChapTable = BookSrc.find('table', {'class': 'table table-bordered table-hover'})
# Check if each chapter page opens, if not store book & chapter name in error.txt
for ChapRow in ChapTable.find_all('tr'):
ChapCol = ChapRow.find_all('td')
ChapName = (ChapCol[0].a.string.strip()).encode('ascii', 'ignore') # ignores error : 'ascii' codec can't encode character u'\xef'
ChapLink = 'http://tbc-python.fossee.in%s' %(ChapCol[0].a.get("href"))
try:
ChapSrc = BeautifulSoup(urllib2.urlopen(ChapLink))
except:
print '\t%s\n\tPage error' %(ChapName)
errors.write("Page; %s;%s;%s;%s" %(BookCount, BookName, ChapName, ChapLink))
continue
# Check for errors in chapters and store the errors in error.txt
EgError = ChapSrc.find_all('div', {'class': 'output_subarea output_text output_error'})
if EgError:
for e, i in enumerate(EgError, start=1):
errors.write("Example;%s;%s;%s;%s\n" %(BookCount,BookName,ChapName,ChapLink)) if 'ipython-input' or 'Error' in i.pre.get_text() else None
print '\t%s\n\tExample errors: %d' %(ChapName, e)
errors.close()
</code></pre>