我正在尝试为一个名为https://www.arukereso.hu的匈牙利电子商务网站构建一个webscraper
from bs4 import BeautifulSoup as soup
import requests
#The starting values
#url = input("Illeszd ide egy Árukeresős keresésnek a linkjét: ")
url = 'https://www.arukereso.hu/notebook-c3100/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
page_num = 1
allproducts = []
#Defining functions for better readability
def nextpage():
further_pages = usefulsoup.find("div", class_="pagination hidden-xs")
nextpage_num = page_num + 1
try:
next_page = further_pages.find("a", string=str(nextpage_num))
next_page = next_page['href']
return next_page
except:
return None
while True:
if url == None:
break
r = requests.get(url, headers=headers)
page_html = r.content
r.close()
soup = soup(page_html, "html.parser")
#print(soup)
usefulsoup = soup.find("div", id="product-list")
#print(usefulsoup)
products = usefulsoup.find_all("div", class_="product-box-container clearfix")
print(products)
for product in products:
allproducts.append(product)
url = nextpage()
print(allproducts)
问题是,当第一次调用nextpage()
函数时,它返回一个有效的链接(https://www.arukereso.hu/notebook-c3100/?start=25),请求的内容也是有效的html,但BeautifulSoup会从中生成一个空列表,因此程序以错误结束
如果有人能解释原因以及如何修复,我将不胜感激
代码中的问题如下所示:
当循环第一次运行时,它会工作,因为
soup
名称尚未被覆盖。下次它运行时,包中的soup方法会被覆盖,这就是问题的原因。重命名此变量,它应该可以工作。我已经测试过了我不知道为什么会发生这种情况,但scrapy可能是解决这类问题的好办法 https://scrapy.org/
相关问题 更多 >
编程相关推荐