我有一个接收str值的函数,但当我执行错误时,会说这是一个字节值:
Traceback (most recent call last):
File "C:\Users\sdand\Documents\Python\Engine\engine.py", line 4, in <module>
print (find.crawl_web('https://google.com',4))
File "C:\Users\sdand\Documents\Python\Engine\finder.py", line 68, in crawl_web
links = self.get_all_links(content)
File "C:\Users\sdand\Documents\Python\Engine\finder.py", line 20, in get_all_links
url, endpos = self.get_next_target(page)
File "C:\Users\sdand\Documents\Python\Engine\finder.py", line 7, in get_next_target
start_link = s.find('<a href=')
TypeError: a bytes-like object is required, not 'str'
这是我调用get\u all\u links的函数:
def crawl_web(self,seed, max_depth):
tocrawl = [seed]
crawled = []
next_depth = []
depth = 0
index=[]
while tocrawl and depth <= max_depth:
page = tocrawl.pop()
if page not in crawled:
#here content content is str
content = self.get_page(page)
self.add_page_to_index(index,page,content)
links = self.get_all_links(content)
self.union(next_depth,links)
crawled.append(page)
if not tocrawl:
tocrawl, next_depth = next_depth, []
depth = depth+1
return index
这是获取页面:
def get_page(self,url):
try:
import urllib.request
return urllib.request.urlopen(url).read()
except:
return ""
这是获取所有链接:
def get_all_links(self,page):
#but here it is byte i dont now why
links=[]
while True:
url, endpos = self.get_next_target(page)
print(url)
if url != None:
links.append(url)
page = page[endpos:]
else:
break
return links
我不知道为什么我的str变量“Content”在get\u all\u links中被转换成byte类型,有人可以向我解释,我如何解决它?你知道吗
您可能不知道,
.read()
返回的是一个byte
对象,而不是str
,尽管在web抓取时更建议使用byte
对象,但最简单的修复方法是通过解码将其转换为str
。你知道吗相关问题 更多 >
编程相关推荐