擅长:python、mysql、java
<p>由于span元素是隐藏的,您将无法使用BeautifulSoup检索它。也许,您可以使用其他属性来获取所需的链接。如果您知道要为其提取链接的.htm文件的名称,则只需使用内部文本找到“a”元素(它还绑定了所需的链接和隐藏的span元素),然后从元素中提取“href”,如下所示:</p>
<pre><code>import requests
from bs4 import BeautifulSoup
import html5lib
import string
ascii = set(string.printable)
def remove_non_ascii(s):
return filter(lambda x: x in ascii, s)
url = 'https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Examination'
home_url = 'https://wwwn.cdc.gov'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
page = requests.get(url, headers = headers, allow_redirects = True)
soup = BeautifulSoup(remove_non_ascii(page.text), "html5lib")
link = soup.find_all('a', text='ARX_F Doc')[0]
complete_url = home_url + link.get('href')
print complete_url
</code></pre>