想要使用获取网页中的所有链接urllib.reques

2024-10-01 15:34:41 发布

您现在位置:Python中文网/ 问答频道 /正文

当我测试它时,它会一直打印出来(None,0),即使我使用的url有几个<;a href=

import urllib.request as ur
def getNextlink(url): 
    sourceFile = ur.urlopen(url)
    sourceText = sourceFile.read()
    page = str(sourceText)

    startLink = page.find('<a href=')
    if startLink == -1:
        return None, 0
    startQu = page.find('"', startLink)
    endQu = page.find('"', startQu+1)
    url = page[startQu +1:endQu]
    return url, endQu

Tags: importltnoneurlreturnpagefindurllib
3条回答

你应该用漂亮的汤来代替它,它可以很顺利地满足你的要求。下面我举个例子:

from bs4 import BeautifulSoup
import requests

def links(url):
    html = requests.get(url).content
    bsObj = BeautifulSoup(html, 'lxml')

    links = bsObj.findAll('a')
    finalLinks = set()
    for link in links:
        finalLinks.add(link.attrs['href'])

试试这个

import urllib.request

import re

#pass any url url = "Want to get all links in a webpage using urllib.request"

urllist = re.findall(r"""<\s*a\s*href=["']([^=]+)["']""", urllib.request.urlopen(url).read().decode("utf-8"))

print(urllist)

这是另一种解决方案:

from urllib.request import urlopen

url = ''
html = str(urlopen(url).read())

for i in range(len(html) - 3):
    if html[i] == '<' and html[i+1] == 'a' and html[i+2] == ' ':
        pos = html[i:].find('</a>')
        print(html[i: i+pos+4])

定义你的网址。 希望这有帮助,别忘了投赞成票。在

相关问题 更多 >

    热门问题