Python3将一堆javascript变量从网页中刮到python dict obj中

2024-09-29 21:51:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用requests和BeautifulSoup4从网页上下载和刮取信息,我成功地将其缩小到我试图从中获取数据的特定<;script>;标记中的所有内容。为了让这部分代码正常工作,我跳过了所有的请求和BS4内容,只是在代码的开头添加了以下字符串:

Content = '''// <![CDATA[
devicetype = "computer";
isios = false;
videocdn = "media";
videopath = "updates/na/vid01";
poster = {
    "file": "preview/vidsplash.jpg",
    "st": "1557499029",
    "et": "1557502629",
    "hs": "f3ad16f42fec5224d323915cdfbf43ed"
};
attachname = "some-video-00001234";
videos[0] = {
    "wmv": {
        "file": "wmv/01.wmv",
        "name": "01",
        "duration": 502,
        "size": "195.1MB",
        "wid": 854,
        "hgt": 480,
        "st": "1557499029",
        "et": "1557502629",
        "hs": "a0cfdef3b8b9e3dea576368a5bfbaef9",
        "caps": []
    },
    "h264": {
        "file": "h264/01.mp4",
        "name": "01",
        "duration": 502,
        "size": "73.9MB",
        "wid": 854,
        "hgt": 480,
        "st": "1557499029",
        "et": "1557502629",
        "hs": "32901a1870d0b32458b465ac9c3d6cad",
        "caps": [{
            "file": "001.jpg",
            "fs": {
                "st": "1557499029",
                "et": "1557502629",
                "hs": "5b328642a84fa6406bda527c18e46c27"
            },
            "tn": {
                "st": "1557499029",
                "et": "1557502629",
                "hs": "0a4ad7d0edf1b92538b8127f8e297c41"
            }
        }, {
            "file": "002.jpg",
            "fs": {
                "st": "1557499029",
                "et": "1557502629",
                "hs": "4390c0d9b321b5e86c88cb8ca5e56ede"
            },
            "tn": {
                "st": "1557499029",
                "et": "1557502629",
                "hs": "9cf83158268379df660d6d01750a047c"
            }
        }]
    }
};
// ]]>'''

还要注意这是美化。通常情况下,“poster”和“videos[0]”变量都在自己的行中,而不是像现在这样多行缩进。这不是<;script>;标记中的完整数据集,我只是将重复的部分剥离出来,以便大家了解数据的结构。还要注意的是,“videos[0]”会将类似的数据结构重复到“videos[1]”等不同的次数。你知道吗

我要做的是把这个大的多行字符串以某种方式转换成一个合适的字典,我可以在python代码中操作它来提取我需要的位

print(NewContent)

输出:

{'devicetype' = 'computer', 'isios' = False, "videocdn" = "media"}

等等。你知道吗

我一直在玩弄js2py,试图让它做我需要它做的事情,但到目前为止,我得到的最远的是以下代码:

splitrawlines = CONTENT.splitlines()
rawvars = []
for line in splitrawlines:
    # need to add the videos declaration in case it gets to a line where it expects it to already be declared.
    rawvars.append(js2py.eval_js("videos = [];\n" + line))

print(rawvars)

唯一的问题是它没有将它输出为dict,而是将它输出为一个列表,我可能仍然可以使用它,但它甚至不是python可以操纵的列表,从技术上讲它仍然是一个js2py.base.JsObjectWrapper对象。我可以将该对象转换为字符串,但将字符串转换为列表的唯一方法是用空格分隔字符串中的所有内容,并将每个分隔的部分放入列表中自己的条目中。我基本上有一个已经格式化的列表,就在一个字符串里面。你知道吗

我可能是走错了方向的代码,但这是我迄今为止得到的最接近。因此,我需要找到一种方法,将基本上已经格式化为完整列表的字符串转换为实际的列表对象,或者更可取的是,找到一种不同的方法,将随机JavaScript代码中的所有变量转换为我可以操作的本机python变量。你知道吗


Tags: to字符串代码内容列表linevideoset
1条回答
网友
1楼 · 发布于 2024-09-29 21:51:17

JavaScript数据主要是JSON格式的,因此可以使用python模块json将其转换为python s dictionary。你知道吗

例如,"videos[0] = "之后的数据创建了正确的JSON数据,您可以使用data = json.loads(stringg)创建字典,然后您就可以得到ie.data['wmv']['size']

data = '''{
    "wmv": {
        "file": "wmv/01.wmv",
        "name": "01",
        "duration": 502,
        "size": "195.1MB",
        "wid": 854,
        "hgt": 480,
        "st": "1557499029",
        "et": "1557502629",
        "hs": "a0cfdef3b8b9e3dea576368a5bfbaef9",
        "caps": []
    },
    "h264": {
        "file": "h264/01.mp4",
        "name": "01",
        "duration": 502,
        "size": "73.9MB",
        "wid": 854,
        "hgt": 480,
        "st": "1557499029",
        "et": "1557502629",
        "hs": "32901a1870d0b32458b465ac9c3d6cad",
        "caps": [{
            "file": "001.jpg",
            "fs": {
                "st": "1557499029",
                "et": "1557502629",
                "hs": "5b328642a84fa6406bda527c18e46c27"
            },
            "tn": {
                "st": "1557499029",
                "et": "1557502629",
                "hs": "0a4ad7d0edf1b92538b8127f8e297c41"
            }
        }, {
            "file": "002.jpg",
            "fs": {
                "st": "1557499029",
                "et": "1557502629",
                "hs": "4390c0d9b321b5e86c88cb8ca5e56ede"
            },
            "tn": {
                "st": "1557499029",
                "et": "1557502629",
                "hs": "9cf83158268379df660d6d01750a047c"
            }
        }]
    }
}'''

import json

data = json.loads(data)

print(data['wmv']['size'])

# 195.1MB

如果每个变量都是一行,那么可以使用split('\n')获取行,然后使用split('=')获取键和值。你知道吗

然后只需检查值是否以{[开头即可使用json。其他值可以是普通字符串,因此它们不需要json—可能只需要删除"。你知道吗

Content = '''// <![CDATA[
devicetype = "computer";
isios = false;
videocdn = "media";
videopath = "updates/na/vid01";
poster = {"file": "preview/vidsplash.jpg","st": "1557499029","et": "1557502629","hs": "f3ad16f42fec5224d323915cdfbf43ed"};
attachname = "some-video-00001234";'''

import json

results = {}

for line in Content.split('\n'):
    if ' = ' in line:
        line = line[:-1]  # remove `;`

        key, val = line.split(' = ', 1)

        if val.startswith( ('[', '{') ):
            results[key] = json.loads(val)
        elif val.startswith('"'):
            val = val[1:-1] # remove `"`
            results[key] = val
        elif val == 'false':
            results[key] = False
        elif val == 'true':
            results[key] = True

print(results['devicetype'])
print(results['isios'])
print(results['videocdn'])
print(results['poster']['file'])

# computer
# False
# media
# preview/vidsplash.jpg

相关问题 更多 >

    热门问题