获取HTML href链接，该链接与包含Beautiful Soup的字符串列表中的字符串相匹配

r = requests.get(url) soup = BeautifulSoup(r.content, 'html5lib') links = soup.findAll('a', string = filenames[0]) file_links = [link['href'] for link in links if "export" in link['href']]

<a href="https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi"> ECZ Mathematics Paper 2 2019.</a> <a href="https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf"> ECZ Mathematics Paper 1 2019.</a> <a href="https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp"> ECZ Science Paper 3 2009.</a> <a href="https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc"> ECZ Civic Education Paper 2 2009.</a>

3条回答

网友
1楼 · 编辑于 2024-09-29 00:22:15

您可以构造CSS选择器，然后一次性选择链接。例如（html是问题中的代码片段）：
filenames = ['ECZ Mathematics Paper 1 2019.', 'ECZ Mathematics Paper 2 2019.', 'ECZ Science Paper 3 2009.'] from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') for a in soup.select(','.join('a:contains("{}")'.format(i) for i in filenames)): print(a['href'])
印刷品：
https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp

网友
2楼 · 编辑于 2024-09-29 00:22:15

如果请求已成功接收。然后使用bs解析它，并使用findAll查找链接“a”的标记。我认为没有必要在findAll中传递（string=filenames[0]）
from bs4 import BeautifulSoup as bs temp = """<a href="https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi"> ECZ Mathematics Paper 2 2019.</a> <a href="https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf"> ECZ Mathematics Paper 1 2019.</a> <a href="https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp"> ECZ Science Paper 3 2009.</a> <a href="https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc"> ECZ Civic Education Paper 2 2009.</a>""" soup =bs(temp, 'html5lib') links = soup.findAll('a') file_links = [link['href'] for link in links if "export" in link['href']]
输出：
['https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi', 'https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf', 'https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp', 'https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc']

网友
3楼 · 编辑于 2024-09-29 00:22:15

用这种方法试试看是否有效：

   html = """    
    <p><a href="https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi">
                                ECZ Mathematics Paper 2 2019.</a></p>    
    <p><a href="https://drive.google.com/uc?export=download&id=1x_9E3PaviCuSsqfJqOsQKOwVlCWZ1jqf">
                                ECZ Mathematics Paper 1 2019.</a></p>    
    <p><a href="https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp">
                                ECZ Science Paper 3 2009.</a></p>    
    <p><a href="https://drive.google.com/uc?export=download&id=0B0lFc6TrfIg7aENYc1V6akRVVnc">
                                ECZ Civic Education Paper 2 2009.</a></p>   
   """
    filenames = ['ECZ Mathematics Paper 2 2019.', 'ECZ Mathematics Paper 2 2019.',
                 'ECZ Science Paper 3 2009.']

    soup = bs(html, 'html5lib')

    all_links = soup.findAll('a')

    for link in all_links:           
        for nam in filenames:                
            if link.text.strip()==nam:
                print(link['href'])

输出：

https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi
https://drive.google.com/uc?export=download&id=1wVjbdN9fztrjxhONGRX5U6N1OJDAChOi
https://drive.google.com/uc?export=download&id=1QFOzpPLuQPup8FtKgOoIcvzTnzCaRzUp

相关问题更多 >

编程相关推荐

热门问题

热门文章