我如何才能刮这个特定网站(电影图集)的内容?

2024-10-02 22:32:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在努力搜集这个特定网站的内容:https://www.cineatlas.com/

我试着把日期部分刮下来,如打印屏幕所示:

enter image description here

我用了这个基本的漂亮的密码

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text,'html.parser')
type(soup)
time = soup.find('ul',class_='slidee')

这是我得到的,而不是元素列表

<ul class="slidee">
<!-- adding dates -->
</ul>

Tags: fromhttpscom密码内容屏幕网站www
2条回答
lis = time.findChildren()

这将返回子节点列表

该站点从Javascript内容动态创建HTML元素。您可以使用re获取JS内容,例如:

import re
import json
import requests
from ast import literal_eval

url = 'https://www.cineatlas.com/'

html_data = requests.get(url).text
movieData = re.findall(r'movieData = ({.*?}), movieDataByReleaseDate', html_data, flags=re.DOTALL)[0]
movieData = re.sub(r'\s*/\*.*?\*/\s*', '', movieData)   # remove comments
movieData = literal_eval(movieData) # in movieData you have now the information about the current movies

print(json.dumps(movieData, indent=4))  # print data to the screen

印刷品:

{
    "2019-08-06": [
        {
            "url": "fast furious hobbs shaw",
            "image-portrait": "https://d10u9ygjms7run.cloudfront.net/dd2qd1xaf4pceqxvb41s1xpzs0/1562603443098_891497ecc8b16b3a662ad8b036820ed1_500x735.jpg",
            "image-landscape": "https://d10u9ygjms7run.cloudfront.net/dd2qd1xaf4pceqxvb41s1xpzs0/1562603421049_7c233477779f25725bf22aeaacba469a_700x259.jpg",
            "title": "FAST &amp; FURIOUS : HOBBS &amp; SHAW",
            "releaseDate": "2019-08-07",
            "endpoint": "ST00000392",
            "duration": "120 mins",
            "rating": "Classification TOUT",
            "director": "",
            "actors": "",
            "times": [
                {
                    "time": "7:00pm",
                    "bookingLink": "https://ticketing.eu.veezi.com/purchase/8388?siteToken&#x3D;b4ehk19v6cqkjfwdsyctqra72m",
                    "attributes": [
                        {
                            "_id": "5d468c20f67cc430833a5a2b",
                            "shortName": "VF",
                            "description": "Version Fran\u00e7aise"
                        },
                        {
                            "_id": "5d468c20f67cc430833a5a2a",
                            "shortName": "3D",
                            "description": "3D"
                        }
                    ]
                },
                {
                    "time": "9:50pm",
                    "bookingLink": "https://ticketing.eu.veezi.com/purchase/8389?siteToken&#x3D;b4ehk19v6cqkjfwdsyctqra72m",

... and so on.

相关问题 更多 >