网络爬虫无法处理多个网页问题的回答

网络爬虫无法处理多个网页

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

代码失败的原因之一是，您不使用cookies。站点似乎需要这些来允许分页。你知道吗 提取您感兴趣的数据的简洁方法如下： <pre><code>import requests from bs4 import BeautifulSoup # the site actually uses this url under the hood for paging - check out Google Dev Tools paging_url = "https://mtgsingles.gr/search?ajax=products-listing&lang=en&page={}&q=dragon" return_list = [] # the page-scroll will only work when we support cookies # so we fetch the page in a session session = requests.Session() session.get("https://mtgsingles.gr/") </code></pre> 除最后一页外，所有页面都有“下一步”按钮。所以我们用这些知识循环直到下一个按钮消失。当它到达最后一页时，按钮被替换为一个“li”标记，类为“next hidden”。这只存在于最后一页 现在我们可以开始循环了 <pre><code>page = 1 # set count for start page keep_paging = True # use flag to end loop when last page is reached while keep_paging: print("[*] Extracting data for page {}".format(page)) r = session.get(paging_url.format(page)) soup = BeautifulSoup(r.text, "html.parser") items = soup.select('.iso-item.item-row-view.clearfix') for item in items: name = item.find('div', class_='col-md-10').get_text().strip().split('\xa0')[0] toughness_element = item.find('div', class_='card-power-toughness') try: toughness = toughness_element.get_text().strip() except: toughness = None cardtype = item.find('div', class_='cardtype').get_text() card_dict = { "name": name, "toughness": toughness, "cardtype": cardtype } return_list.append(card_dict) if soup.select('li.next.hidden'): # this element only exists if the last page is reached keep_paging = False print("[*] Scraper is done. Quitting...") else: page += 1 # do stuff with your list of dicts - e.g. load it into pandas and save it to a spreadsheet </code></pre> 这将滚动，直到没有更多的页面存在-无论有多少子页面将在网站上。你知道吗 我在上面的评论中的观点仅仅是，如果在代码中遇到异常，pagecount将永远不会增加。这可能不是你想做的，这就是为什么我建议你学习更多关于整个尝试的行为，除非其他最终交易。你知道吗

网络爬虫无法处理多个网页

1 个回答

相关Python问题