擅长:python、mysql、java
<p>我会考虑使用attribute=value css选择器并使用<code>^</code>运算符指定<code>href</code>属性以<code>https</code>开头。你将只有有效的协议。另外,使用set comprehension来确保没有重复,并使用<code>Session</code>来重用连接。在</p>
<pre><code>from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
final = []
with requests.Session() as s:
r = s.get('https://pythonprogramming.net/parsememcparseface/')
soup = bs(r.content, 'lxml')
httpsLinks = {item['href'] for item in soup.select('[href^=https]')}
for link in httpsLinks:
r = s.get(link)
soup = bs(r.content, 'lxml')
newHttpsLinks = [item['href'] for item in soup.select('[href^=https]')]
final.append(newHttpsLinks)
tidyList = list({item for sublist in final for item in sublist})
df = pd.DataFrame(tidyList)
print(df)
</code></pre>