<p>代码失败的原因之一是,您不使用cookies。站点似乎需要这些来允许分页。你知道吗</p>
<p>提取您感兴趣的数据的简洁方法如下:</p>
<pre><code>import requests
from bs4 import BeautifulSoup
# the site actually uses this url under the hood for paging - check out Google Dev Tools
paging_url = "https://mtgsingles.gr/search?ajax=products-listing&lang=en&page={}&q=dragon"
return_list = []
# the page-scroll will only work when we support cookies
# so we fetch the page in a session
session = requests.Session()
session.get("https://mtgsingles.gr/")
</code></pre>
<p>除最后一页外,所有页面都有“下一步”按钮。所以我们用这些知识循环直到下一个按钮消失。当它到达最后一页时,按钮被替换为一个“li”标记,类为“next hidden”。这只存在于最后一页</p>
<p>现在我们可以开始循环了</p>
<pre><code>page = 1 # set count for start page
keep_paging = True # use flag to end loop when last page is reached
while keep_paging:
print("[*] Extracting data for page {}".format(page))
r = session.get(paging_url.format(page))
soup = BeautifulSoup(r.text, "html.parser")
items = soup.select('.iso-item.item-row-view.clearfix')
for item in items:
name = item.find('div', class_='col-md-10').get_text().strip().split('\xa0')[0]
toughness_element = item.find('div', class_='card-power-toughness')
try:
toughness = toughness_element.get_text().strip()
except:
toughness = None
cardtype = item.find('div', class_='cardtype').get_text()
card_dict = {
"name": name,
"toughness": toughness,
"cardtype": cardtype
}
return_list.append(card_dict)
if soup.select('li.next.hidden'): # this element only exists if the last page is reached
keep_paging = False
print("[*] Scraper is done. Quitting...")
else:
page += 1
# do stuff with your list of dicts - e.g. load it into pandas and save it to a spreadsheet
</code></pre>
<p>这将滚动,直到没有更多的页面存在-无论有多少子页面将在网站上。你知道吗</p>
<p>我在上面的评论中的观点仅仅是,如果在代码中遇到异常,pagecount将永远不会增加。这可能不是你想做的,这就是为什么我建议你学习更多关于整个尝试的行为,除非其他最终交易。你知道吗</p>