如何从下面的html脚本中获取信息?

2024-09-27 23:26:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用beautiful soup从html中提取了以下脚本:

<script>
   dataLayer =[{
  "pageTitle": "PRODUCT: Macculloch Parka Print( 9512MP )",
  "pageCategory": "shop-mens-parkas",
  "visitorLoginState": "Guest",
  "EmployeeLoginState": false,
  "customerEmail": "null",
  "customerOrders": "null",
  "customerValue": "0",
  "Country": "CA",
  "State": "ON",
  "ecommerce": {
    "currencyCode": "CAD",
    "detail": {
      "actionField": {
        "list": "Product Category / Search Results"
         },

      "products": [
        {
          "name": "Macculloch Parka Print",
          "id": "9512MP",
          "price": 1295,
          "brand": "Canada Goose",
          "category": "shop-mens-parkas"}]}}}];</script>

我想提取与产品相关的信息(名称、id、价格和品牌)作为数据帧。有没有不使用regex的方法呢?你知道吗


Tags: 脚本idhtmlscriptshopnullprintsoup
2条回答

您可以使用regex获取json并进行解析:

import json
import re

data = json.loads(re.search(r"dataLayer =(.*);", d, re.DOTALL).group(1))
products = data[0]["ecommerce"]["detail"]["products"]
product_name = products[0]["name"]
product_id = products[0]["id"]
product_price = products[0]["price"]
product_brand = products[0]["brand"]
product_category = products[0]["category"]

这是一个临时解决方案,取决于接收到更多关于数据格式的信息。你知道吗

import re
import json

def get_datalayer_json(raw_script_tag: str):
    parser_re = r"<script>\s*dataLayer =(.*);\s*</script>"
    parser_result = re.match(parser_re, raw_script_tag.strip(), re.DOTALL)
    if parser_result is None:
        return None
    else:
        return json.loads(parser_result.group(1))

相关问题 更多 >

    热门问题