在python中查找不带BeautifulSoup的页面的超链接

2024-09-27 00:19:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我要做的是找到一个网页的所有超链接这里是我到目前为止,但它不工作

from urllib.request import urlopen

def findHyperLinks(webpage):
    link = "Not found"
    encoding = "utf-8"
    for webpagesline in webpage:
        webpagesline = str(webpagesline, encoding)
        if "<a href>" in webpagesline:
            indexstart = webpagesline.find("<a href>")
            indexend = webpagesline.find("</a>")
            link = webpagesline[indexstart+7:indexend]
            return link
    return link

def main():
    address = input("Please enter the adress of webpage to find the hyperlinks")
    try:
        webpage = urlopen(address)
        link =  findHyperLinks(webpage)
        print("The hyperlinks are", link)

        webpage.close()
    except Exception as exceptObj:
        print("Error:" , str(exceptObj))

main()

Tags: inreturnmaindeflinkfindencodingurlopen
2条回答

代码中存在多个问题。其中一个问题是,您正在尝试查找具有present、empty和唯一一个href属性:<a href>的链接。你知道吗

无论如何,如果您使用HTML解析器(好吧,解析HTML),事情会变得更加简单和可靠。使用^{}的示例:

from bs4 import BeautifulSoup
from urllib.request import urlopen

soup = BeautifulSoup(urlopen(address))
for link in soup.find_all("a", href=True):
    print(link["href"], link.get_text())

如果没有BeautifulSoap,您可以使用RegExp和simple函数。你知道吗

from urllib.request import urlopen
import re

def find_link(url):
    response = urlopen(url)
    res = str(response.read())
    my_dict = re.findall('(?<=<a href=")[^"]*', res)

    for x in my_dict:
        # simple skip page bookmarks, like #about
        if x[0] == '#':
            continue

        # simple control absolute url, like /about.html
        # also be careful with redirects and add more flexible
        # processing, if needed
        if x[0] == '/':
            x = url + x

        print(x)

find_link('http://cnn.com')

相关问题 更多 >

    热门问题