为什么beauthulsoup在我的结果中添加<html><body><p>？

<!DOCTYPE html><html lang="it-IT"><head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=Edge"> <head><title>Title here</title></head> <body> <script id="TargetID" type="application/json"><![CDATA[ { "name":"Kate", "age":22, "city":"Boston"} ]]> </script><script id=“AnotherID” type="application/json"><![CDATA[{ "name":"John", "age":31, "city":"New York"}]]> </script> </body></html>

2条回答

网友

1楼 · 编辑于 2024-10-02 02:41:06

它可以简化得多：

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/Users/me/Page01.htm", encoding='utf-8'), "html.parser")
result = soup.find('script', type='application/json', id='TargetID').text
# Workaround to get CDATA content (It seems that it can't be done with bs):
result = result.replace("<![CDATA[", "").replace("]]>", "").strip()

网友

2楼 · 编辑于 2024-10-02 02:41:06

beauthulsoup几乎可以接受任何东西，并尝试将其转换为一个完整的HTML页面。这就是你收到'<html><body> ...'的原因。通常这是一件好事，因为HTML的格式可能很糟糕，但是BeautifulSoup仍然会处理它。在

在您的例子中，提取json的一种方法如下所示。在

>>> import bs4
>>> page = bs4.BeautifulSoup(open('Page01.htm').read(), 'lxml')
>>> first_script = page.select('#TargetID')[0].text
>>> first_script 
'<![CDATA[\n{ "name":"Kate", "age":22, "city":"Boston"}\n]]>\n'
>>> content = first_script[first_script.find('{'): 1+first_script.rfind('}')]
>>> content
'{ "name":"Kate", "age":22, "city":"Boston"}'

一旦你有了这个，你就可以把它变成一个Python字典，就像这样。在

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章