以下是我检索页面链接的代码。在
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import re
def getExternalLinks(includeURL):
html = urlopen(includeURL)
bsObj = soup(html, "html.parser")
externalLinks = []
links = bsObj.findAll("a",
href=re.compile("^(http://www.homedepot.com/b)"))
for link in links:
if link.attrs['href'] is not None:
if link.attrs['href'] not in externalLinks:
externalLinks.append(link.attrs['href'])
print(externalLinks)
getExternalLinks("http://www.homedepot.com/")
链接存储在下面的数组中。在
^{pr2}$现在,我尝试遍历这些链接,并转到每个页面并获取信息。当我运行下一个代码时,会出现一些错误。在
def getInternalLinks(includeLinks):
internalHTML = urlopen(includeLinks)
Inner_bsObj = soup(internalHTML, "html.parser")
internalLinks = []
inner_links = Inner_bsObj.findAll("a", "href")
for inner_link in inner_links:
if inner_link.attrs['href'] is not None:
if inner_link.attrs['href'] not in internalLinks:
internalLinks.append(inner_link.attrs['href'])
print(internalLinks)
getInternalLinks(getExternalLinks("http://www.homedepot.com"))
File "C:/Users/anag/Documents/Python
Scripts/Webscrapers/BeautifulSoup/HomeDepot/HomeDepotScraper.py", line 20,
in getInternalLinks
internalHTML = urlopen(includeLinks)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 517, in open
req.timeout = timeout
AttributeError: 'NoneType' object has no attribute 'timeout'
我应该如何从存储在externalLinks数组中的每个网页提取信息?在
它是一个列表而不是数组。Python中的Array通常表示Numpy数组,这与list有很大不同。在
代码的问题在于
getExternalLinks()
函数返回None
,并将其作为getInternalLinks()
函数的参数,该函数只需要一个URL。第一个函数需要返回URL的列表(或集合),而不是(仅仅)打印它们,然后需要循环返回值并将每个URL提供给第二个函数。在两个函数包含几乎相同的代码。尽管名称不同,但
findAll()
方法的参数不同。我将把它重构成一个公共函数。在相关问题 更多 >
编程相关推荐