检查源HTML、Python的链接中的所有链接

url = args.url[0] url_list = [url] checkedURLs = [] AmountVisited = 0 while (url_list and AmountVisited<maxhits): url = url_list.pop() s = readwebpage(url) print("testing url: http",url) #Print the url being tested, this code is here only for testing.. AmountVisited = AmountVisited + 1 if s == None: print("* bad reference to http", url) else: urls_list = re.findall(r'href="http([\s:]?[^\'" >]+)', s) #Creates a list of all links in HTML code starting with... while urls_list: #... http or https insert = urls_list.pop() while(insert in checkedURLs and urls_list): insert = urls_list.pop() url_list.append(insert) checkedURLs = insert

3条回答

网友

1楼 · 编辑于 2024-10-01 02:24:44

不是Python，但是因为您提到您没有严格地绑定到regex，所以我认为您可能会发现使用wget来实现这一点有些用处。你知道吗

wget  spider -o C:\wget.log -e robots=off -w 1 -r -l 10 http://www.stackoverflow.com

细分：

spider：使用此选项调用Wget时，Wget将表现为一个webspider，这意味着它不会下载页面，只需检查页面是否存在。
-o C:\wget.log：将所有消息记录到C:\日志.
-e robots=off：忽略机器人.txt
-w 1：设置1秒的等待时间
-r：设置递归搜索 -l 10：将递归深度设置为10，这意味着wget的深度只能达到10级，这可能需要根据最大请求数进行更改
http://www.stackoverflow.com：要以其开头的URL

完成后，您可以查看wget.log条目，通过搜索HTTP状态码404等来确定哪些链接有错误

网友

2楼 · 编辑于 2024-10-01 02:24:44

这是你想要的密码。但是，请停止使用regex解析HTML。漂亮的组合才是最好的选择。你知道吗

import re
from urllib import urlopen

def readwebpage(url):
  print "testing ",current     
  return urlopen(url).read()

url = 'http://xrisk.esy.es' #put starting url here

yet_to_visit= [url]
visited_urls = []

AmountVisited = 0
maxhits = 10

while (yet_to_visit and AmountVisited<maxhits):

    print yet_to_visit
    current = yet_to_visit.pop()
    AmountVisited = AmountVisited + 1
    html = readwebpage(current)


    if html == None:
        print "* bad reference to http", current
    else:
        r = re.compile('(?<=href=").*?(?=")')
        links = re.findall(r,html) #Creates a list of all links in HTML code starting with...
        for u in links:

          if u in visited_urls: 
            continue
          elif u.find('http')!=-1:
            yet_to_visit.append(u)
        print links
    visited_urls.append(current)

网友

3楼 · 编辑于 2024-10-01 02:24:44

我怀疑你的正则表达式是你问题的一部分。现在，您的捕获组外有http，并且[\s:]匹配“某种空格（即\s）或：”

我将正则表达式改为：urls_list = re.findall(r'href="(.*)"',s)。也称为“在href=”之后匹配引号中的任何内容。如果您确实需要确保http[s]：//，请使用r'href="(https?://.*)"'（s?=>；一或零s）

编辑：使用实际工作的正则表达式，使用非greedglom:href=(?P<q>[\'"])(https?://.*?)(?P=q)'

（另外，呃，虽然在你的例子中这在技术上是不必要的，因为re缓存，但我认为养成使用^{}的习惯是一种很好的做法。）

我认为这是非常好的，你所有的网址都是完整的网址。你必须处理相对网址吗？ `你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章