如何使用beautifulsoup解析<pre>标记中的数据？

2条回答

网友

1楼 · 编辑于 2024-06-24 13:43:05

我使用了^{}，因为materials对象包含多个键（BVRRRatingSummarySourceID、BVRRSecondaryRatingSummarySourceID和{}），如果您需要的话，用regex从它的值中获取HTML要困难得多。在

from bs4 import BeautifulSoup
import js2py
import requests

r = requests.get('https://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/1061083288/reviews.djs?format=embeddedhtml')

pattern = (r'var'
           r'\s+'
           r'materials'
           r'\s*=\s*'
           r'{"BVRRRatingSummarySourceID".*}')

js_materials = re.search(pattern, r.text).group()
obj = js2py.eval_js(js_materials).to_dict()
html = obj['BVRRSourceID']
soup = BeautifulSoup(html, 'lxml')
spans = soup.select('span.BVRRReviewAbbreviatedText')

^{pr2}$

在下面的示例中，我只使用了BVRRSourceID键下的HTML，但是您可以通过将值连接在一起来使用整个HTML：

html = ''.join(obj.values())

如果您想使用lxml解析器，不要忘记安装js2py：pip install js2py和{}。在

网友

2楼 · 编辑于 2024-06-24 13:43:05

您可以使用selenium webdriver来获取您感兴趣的html内容。例如

from selenium import webdriver


def get_html(url):
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get(url)

    time.sleep(5)
    html_content = driver.page_source.strip()
    return html_content

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用beautifulsoup解析<pre>标记中的数据？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >