如何让靓汤只抓取网页中一组“[:”“:]”之间的东西?

2024-10-02 14:20:47 发布

您现在位置:Python中文网/ 问答频道 /正文

下午好!如何让Beautifulsoup只抓取多组“[:”和“:]”之间的内容到目前为止,我已经在我的soup中获得了整个页面,但遗憾的是,它没有标记。你知道吗

What it looks like so far

到目前为止,我已经尝试了一些方法:

  • soup.findAll(text="[")
  • keys = soup.find("span", attrs = {"class": "objectBox objectBox-string"})

    import bs4 as bs
    import urllib.request
    
    source = urllib.request.urlopen("https://login.microsoftonline.com/common/discovery/keys").read()
    soup = bs.BeautifulSoup(source,'lxml')
    
    # ---------------------------------------------
    
    #  prior script that I was playing with trying to tackle this issue
    
    import requests
    import urllib.request
    import time
    from bs4 import BeautifulSoup
    
    # Set URL to scrape new certs from
    newcerts = "https://login.microsoftonline.com/common/discovery/keys"
    
    # Connect to the URL
    response = requests.get(newcerts)
    
    # Parse HTML and save to BeautifulSoup Object
    soup = BeautifulSoup(response.text, "html.parser")
    
    keys = soup.find("span", attrs = {"class": "objectBox objectBox-string"})
    

最终目标是从Azure的网站https://login.microsoftonline.com/common/discovery/keys检索公共PKI密钥


Tags: totexthttpsimportcomrequestlogincommon
2条回答

您从该url获得的数据已经被结构化为Json或python dict格式。 我将通过请求获取数据,并使用ast将其从字符串转换为dict格式。你知道吗

让我举个例子:

import requests, ast

# get the response data
response = requests.get("https://login.microsoftonline.com/common/discovery/keys")

#convert from string to dict with ast
my_dict = ast.literal_eval(response.text)

#see here the output info in your dict
print(my_dict)
#check that it's a dict 
print(type(my_dict))

从这里开始,您可以使用python中dict的一些知识来访问每个值。你知道吗

不知道这是不是你想要的。请尝试以下脚本:

import json
import requests

url = 'https://login.microsoftonline.com/common/discovery/keys'

res = requests.get(url)
jsonobject = json.loads(res.content)
for item in jsonobject['keys']:
    print(item['x5c'])

相关问题 更多 >