<p>查看<a href="http://docs.python.org/library/urllib2.html" rel="nofollow">^{<cd1>}</a>从url获取html,查看<a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow">^{<cd2>}</a>/<a href="http://docs.python.org/library/htmlparser.html" rel="nofollow">^{<cd3>}</a>/etc解析html。然后,您可以使用以下内容作为脚本的起点:</p>
<pre><code>import time
import urllib2
import BeautifulSoup
import HTMLParser
def getSource(url, postdata):
source = ""
req = urllib2.Request(url, postdata)
try:
sock = urllib2.urlopen(req)
except urllib2.URLError, exc:
# handle the error..
pass
else:
source = sock.read()
finally:
try:
sock.close()
except:
pass
return source
def parseSource(source):
pass
# parse source with BeautifulSoup/HTMLParser, or here...
def main():
last_run = 0
while True:
t1 = time.time()
# check if 1 hour has passed since last_run
if t1 - last_run >= 3600:
source = getSource("someurl.com", "user=me&blah=foo")
last_run = time.time()
parseSource(source)
else:
# sleep for 60 seconds and check time again.
time.sleep(60)
return 0
if __name__ == "__main__":
sys.exit(main())
</code></pre>
<p>这是一篇关于<a href="http://unethicalblogger.com/2008/05/03/parsing-html-with-python.html" rel="nofollow">parsing-html-with-python</a>的好文章</p>