<p>正如@Sri在评论中提到的,当你打开该url时,你会看到一个页面,你必须首先接受cookies,这需要交互。
当你需要交互时,考虑使用一些类似于硒的东西(<a href="https://selenium-python.readthedocs.io/" rel="nofollow noreferrer">https://selenium-python.readthedocs.io/</a>)。<p>
<p>这里有一些东西应该让你开始</p>
<p>(编辑:在运行下面的代码之前,您需要运行<code>pip install selenium</code>)</p>
<pre class="lang-py prettyprint-override"><code>import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://ad.nl'
# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Firefox()
driver.get(url)
# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()
# grab the html. It'll wait here until the page is finished loading
html = driver.page_source
# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")
for article in articles:
# check for article titles in both h2 and h3 elems
h2_titles = article.findAll('h2', {'class': 'ankeiler__title'})
h3_titles = article.findAll('h3', {'class': 'ankeiler__title'})
for t in h2_titles:
# first I was doing print(t.text), but some of them had leading
# newlines and things like '22:30', which I assume was the hour of the day
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
for t in h3_titles:
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
# close the browser
driver.close()
</code></pre>
<p>这可能不是你想要的,但这只是一个如何使用硒和靓汤的例子。请随意复制/使用/修改您认为合适的内容。
如果您想知道使用什么选择器,请阅读@JL Peyret的评论</p>