我想把社交书签网站中提到的所有*.wordpress URL都刮去。页面中的URL采用以下格式:
<span class="domain">somedomain.com </span>
以下是我的想法:
import os
import urllib2
import re
from os.path import basename
from urlparse import urlsplit
import time
baseurl = 'https://targetwebsite/pages/'
print baseurl
spage = int(raw_input("Start page?"))
epage = int(raw_input("End page?"))
for p in range (spage, epage):
url= baseurl+ str(p)
print url
urlContent = urllib2.urlopen(url).read()
#WHAT REGEXP HERE?
domainUrls = re.findall('span .*.wordpress.com (.*?) ', urlContent)
try:
for dUrl in domainUrls:
print dUrl
except:
print "an error occured"
pass
我尝试了不同的regexp,但都不起作用。谢谢你的帮助
含糊其辞的回答是公正的
相关问题 更多 >
编程相关推荐