Beautifulsoup找不到超过24个具有find的类_

from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup import re import lxml my_url = 'https://www.alza.co.uk/tablets/18852388.htm' uClient = uReq(my_url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "lxml") classname = "box browsingitem" containers = page_soup.find_all("div", {"class":re.compile(classname)}) #len(containers) will be equal to 24 for container in containers: title_container = container.find_all("a",{"class":"name browsinglink"}) product_name = title_container[0].text print("product_name: " + product_name)

1条回答

网友

1楼 · 发布于 2024-10-04 03:16:16

因此，在本例中，当您访问页面时，DOM中只加载了24个项。我想到的两个选项是：1）使用无头浏览器单击“加载更多”按钮并将更多项目加载到DOM；2）创建简单的分页方案并循环浏览这些页面

下面是第二个选项的示例：

for page in range(0, 10):
    print("Trying page # {}".format(page))
    if page == 0:
        my_url = 'https://www.alza.co.uk/tablets/18852388.html'
    else: 
        my_url = 'https://www.alza.co.uk/tablets/18852388-p{}.html'.format(page)
        requests.get(my_url)

    page_html = requests.get(my_url)
    page_soup = soup(page_html.content, "lxml")
    items = page_soup.find_all('div', {"class": "browsingitem"})
    print("Found a total of {}".format(len(items)))
    for item in items:
        title  = page_soup.find('a', 'browsinglink')

您可以看到url内置了分页信息，所以您所需要做的就是确定要刮取多少页，然后可以保存所有这些信息。以下是输出：

Trying page # 0
Found a total of 24
Trying page # 1
Found a total of 24
Trying page # 2
Found a total of 24
Trying page # 3
Found a total of 24
Trying page # 4
Found a total of 24
Trying page # 5
Found a total of 24
Trying page # 6
Found a total of 24
Trying page # 7
Found a total of 24
Trying page # 8
Found a total of 17
Trying page # 9
Found a total of 0

相关问题更多 >

编程相关推荐

热门问题

热门文章