如何获取html页面中的所有标记和信息（特别是页面中的所有链接）？

def fetch_html(fullurl,contextstring): print("Opening the file connection for " + fullurl) uh= urllib.request.urlopen(fullurl, context=contextstring) print("HTTP status",uh.getcode()) html =uh.read() bs = BeautifulSoup(html, 'lxml') return bs ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE mainurl ='https://www.daad.de/deutschland/studienangebote/international-programmes/en/result/?q=&degree%5B%5D=2&lang%5B%5D=2&fos=3&crossFac=&cert=&admReq=&scholarshipLC=&scholarshipSC=&langDeAvailable=&langEnAvailable=&lvlEn%5B%5D=&cit%5B%5D=&tyi%5B%5D=&fee=&bgn%5B%5D=&dur%5B%5D=&sort=4&ins%5B%5D=&subjects%5B%5D=&limit=10&offset=&display=list' a=(fetch_html(mainurl, ctx)) f= open("F:\Harsh docs\python\courselinks.py","w") f.write(a.prettify()) f.close

2条回答

网友

1楼 · 编辑于 2024-09-27 19:20:52

你正在抓取的页面似乎是用javascript呈现的。你可以尝试使用硒和铬。或者您可以使用requests\uhtml包https://html.python-requests.org/ 在获取html之前呈现javascript

网友

2楼 · 编辑于 2024-09-27 19:20:52

只需从页面获取所有链接，请使用下面的代码：（python3）

from bs4 import BeautifulSoup
import re
from urllib.request import urlopen

html_page = urlopen("http://www.google.com/")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    print (link.get('href'))

相关问题更多 >

编程相关推荐

热门问题

热门文章