无法从某些json内容中提取不同深度的所有可用URL

2024-10-02 12:30:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图解析一些json内容中不同深度的URL的所有值。我附上a file包含不同深度的URL供您考虑

这就是它们的结构(截断):

{'hasSub': True,
 'navigationTitle': 'Products',
 'nodeName': 'products',
 'pages': [{'hasSub': True,
            'navigationTitle': 'Enclosures',
            'nodeName': 'PG0002SCHRANK1',
            'pages': [{'hasSub': True,
                       'navigationTitle': 'Hygienic Design',
                       'nodeName': 'PG0125SCHRANK1',
                       'pages': [{'hasSub': False,
                                  'navigationTitle': 'Hygienic Design Terminal '
                                                     'box HD',
                                  'nodeName': 'PRO0130',
                                  'target': '_self',
                                  'url': '/com-en/products/PG0002SCHRANK1/PG0125SCHRANK1/PRO0130'},
                                 {'hasSub': False,
                                  'navigationTitle': 'Hygienic Design Compact '
                                                     'enclosure HD, '
                                                     'single-door',
                                  'nodeName': 'PRO0131',
                                  'target': '_self',
                                  'url': '/com-en/products/PG0002SCHRANK1/PG0125SCHRANK1/PRO0131'},

如果我考虑上面的内容,我的输出是:

/com-en/products/PG0002SCHRANK1/PG0125SCHRANK1/PRO0130
/com-en/products/PG0002SCHRANK1/PG0125SCHRANK1/PRO0131

我编写的用于生成json内容的脚本:

import requests
from pprint import pprint

url = 'https://www.rittal.com/.rest/nav/menu/tree?'
params = {
    'path': 'com',
    'locale': 'en',
    'deep': '10'
}
with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    s.headers['Accept'] = 'application/json, text/plain, */*'
    r = s.get(url,params=params)
    pprint(r.json()['pages'][0])

How can I scrape all the urls from different depth out of the json content?


Tags: comjsontrueurl内容pagesenproducts
2条回答

好的,看来我在别处找到了一个解决方案,可以从任何嵌套的json中获取所有可用的链接

import requests
from pprint import pprint

url = 'https://www.rittal.com/.rest/nav/menu/tree?'
params = {
    'path': 'com',
    'locale': 'en',
    'deep': '10'
}

def json_extract(obj, key):
    arr = []

    def extract(obj, arr, key):
        if isinstance(obj, dict):
            for k, v in obj.items():
                if isinstance(v, (dict, list)):
                    extract(v, arr, key)
                elif k == key:
                    arr.append(v)
        elif isinstance(obj, list):
            for item in obj:
                extract(item, arr, key)
        return arr

    values = extract(obj, arr, key)
    return values

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    s.headers['Accept'] = 'application/json, text/plain, */*'
    r = s.get(url, params=params).json()
    for item in json_extract(r,'url'):
        print(item)

脚本生成的链接数约为3500

您可以做的是在JSON上递归。这是处理不同深度URL的最佳方法

下面的递归将通过在JSON上递归来检索最深的URL

import requests
from pprint import pprint

url = 'https://www.rittal.com/.rest/nav/menu/tree'
params = {
    'path': 'com',
    'locale': 'en',
    'deep': '10'
}

def recurse(data):
    if 'pages' in data:
        for page in data['pages']:
            recurse(page)
    elif 'url' in data and data['url'].startswith('/com-en/'):
        urls.append(data['url'])

urls = []
with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    s.headers['Accept'] = 'application/json, text/plain, */*'
    r = s.get(url, params=params).json()
    recurse(r)
    pprint(urls)

这就是它的工作原理:

  • 递归大小写-如果当前级别有页面,则在当前级别为每个page递归
  • 基本大小写-如果URL出现在当前级别,则将其附加到URL列表中

此外,如果您将elif切换为if,它将为您提供所有任何级别的URL

更新:该JSON中似乎有2个胭脂URL。特别是,一个是https://www.eplan-software.com/solutions/eplan-platform/,另一个是空的!因此,我添加了条件data['url'].startswith('/com-en/'),仅附加符合预期模式的URL

相关问题 更多 >

    热门问题