使用python和beatifulsoup对onlineforum线程的页面进行web抓取

[http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches.html, http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches-2.html, http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches-3.html]

#import modules import requests from bs4 import BeautifulSoup #define main-url url = 'http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches.html' #create a list of urls urls=[url] #load url page = requests.get(url) #parse it using BeautifulSoup soup = BeautifulSoup(page.text, 'html.parser') #search for the url of the next page nextpage = soup.find("a", ["pag next"]).get('href') #append the urls of the next page to the list of urls urls.append(nextpage) print(urls)

2条回答

网友

1楼 · 编辑于 2024-09-29 23:27:33

分页的url模式始终与此站点一致，因此不需要请求获取页面url。相反，您可以解析按钮中显示“第1页，共10页”的文本，并在知道最终页码后构建页面URL。你知道吗

import re

import requests
from bs4 import BeautifulSoup

thread_url = "http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches.html"
r = requests.get(thread_url)
soup = BeautifulSoup(r.content, 'lxml')
pattern = re.compile(r'Seite\s\d+\svon\s(\d+)', re.I)
pages = soup.find('a', text=pattern).text.strip()
pages = int(pattern.match(pages).group(1))
page_urls = [f"{thread_url[:-5]}-{p}.html" for p in range(1, pages + 1)]
for url in page_urls:
    print(url)

网友

2楼 · 编辑于 2024-09-29 23:27:33

您正在遍历url并将其添加到自身中，因此列表的大小将无限期地继续增加。你知道吗

您正在将每个url从一个url添加到另一个url-您看到问题了吗？URL继续增长。。。它里面已经有了你正在迭代的每个url。您的意思是调用一个函数，在下一个url上执行前面的代码，然后将其添加到列表中吗？你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章