<p>每个人都会告诉你用regex处理HTML是错误的。我不想向您展示如何用这种方法来完成,我想向您展示用库解析HTML实际上是多么容易,例如经常推荐的<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">BeautifulSoup 4</a>。在</p>
<p>为了使其简单并接近示例代码,我只需展平您的输入列表。通常,您会将原始HTML直接提供给解析器(例如,请参见<a href="https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup">here</a>)。在</p>
<pre><code>from bs4 import BeautifulSoup
links = [ '<a class="tablebluelink" href="https://www.samplewebsite.com/xml-data/abcdef/higjkl/Thisisthe-required-document-b4df-16t9g8p93808.pdf" target="_blank"><img alt="Download PDF" border="0" src="../Include/images/pdf.png"/></a>', '<a class="tablebluelink" href="https://www.samplewebsite.com/xml-data/abcdef/higjkl/Thisisthe-required-document-link-4ea4-8f1c-dd36a1f55d6f.pdf" target="_blank"><img alt="Download PDF" border="0" src="../Include/images/pdf.png"/></a>']
soup = BeautifulSoup(''.join(links), 'lxml')
for link in soup.find_all('a', href=True):
if link['href'].lower().endswith(".pdf"):
print(link['href'])
</code></pre>
<p>简单明了,不是吗?在</p>