从Html页面获取数据到Python数组中

2024-05-17 00:46:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我只是想从这样一个网页上获取一些数据:

[ . . . ]

<p class="special-large">Lorem Ipsum 01</p>
<p class="special-large">Lorem Ipsum 02</p>
<p class="special-large">Lorem Ipsum 03</p>
<p class="special-large">Lorem Ipsum 04</p>
<p class="special-large">Lorem Ipsum 05</p>

[ . . . ]

我想要一个类似以下的python数组:

myArrayWebPage = ["Lorem Ipsum 01","Lorem Ipsum 02","Lorem Ipsum 03","Lorem Ipsum 04","Lorem Ipsum 05"]

这是我的python脚本:

import urllib.request

urlAddress = "http:// ... /" # my url address
getPage = urllib.request.urlopen(urlAddress)
outputPage = getPage.read()
print(outputPage)

如何从“outputPage”获取数组?你知道吗


Tags: 数据脚本网页request数组urllibclassspecial
1条回答
网友
1楼 · 发布于 2024-05-17 00:46:41

这似乎是你想要的:

Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> html = '''<p class="special-large">Lorem Ipsum 01</p>
<p class="special-large">Lorem Ipsum 02</p>
<p class="special-large">Lorem Ipsum 03</p>
<p class="special-large">Lorem Ipsum 04</p>
<p class="special-large">Lorem Ipsum 05</p>'''
>>> import re
>>> re.findall('<p class="special-large">([^<]+)</p>', html)
['Lorem Ipsum 01', 'Lorem Ipsum 02', 'Lorem Ipsum 03', 'Lorem Ipsum 04', 'Lorem Ipsum 05']
>>> 

请注意,regular expressions通常不适用于这种情况。您应该改用Beautiful Soup这样的库。你知道吗

相关问题 更多 >