如何从网站中的表格列中提取信息?

2024-09-27 07:17:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我只能这么做了!我正在尝试获取代理服务器的

import urllib.request

page = urllib.request.urlopen("http://www.samair.ru/proxy/ip-address-01.htm")

page('\d+\.\d+\.\d+\.\d+')

Tags: importiphttpaddressrequestwwwrupage
1条回答
网友
1楼 · 发布于 2024-09-27 07:17:31

在本例中,表不是真正的HTML表,而是用<pre></pre>包装的纯文本。您可以通过查看页面源代码来验证它。 不管怎样,只要BeautifulSoup就可以在公园里散步了:

In [1]: from bs4 import BeautifulSoup

In [2]: from urllib.request import urlopen

In [3]: bs = BeautifulSoup(urlopen('http://www.samair.ru/proxy/ip-address-01.htm'))

In [4]: print(bs.find('pre').text)

IP address               Anonymity level   Checked time        Country
056.249.66.50:8080       transparent       Apr-21, 10:33       Bulgaria
1.63.18.22:8080          transparent       Apr-21, 05:56       China
1.9.75.8:8080            transparent       Apr-21, 12:58       Malaysia
103.247.219.165:8080     transparent       Apr-21, 04:01       Indonesia
103.4.165.190:80         transparent       Apr-21, 11:34       Indonesia
103.9.126.110:8080       transparent       Apr-21, 12:19       Indonesia
109.173.98.64:8080       transparent       Apr-20, 22:39       Russian Federation
109.197.194.142:8080     transparent       Apr-21, 12:07       Russian Federation
109.207.61.141:8090      transparent       Apr-21, 11:14       Poland
109.207.61.145:8090      transparent       Apr-21, 13:04       Poland
109.207.61.149:8090      transparent       Apr-21, 10:21       Poland
109.207.61.165:8090      transparent       Apr-21, 03:57       Poland
109.207.61.170:8090      transparent       Apr-21, 11:02       Poland
109.207.61.208:8090      transparent       Apr-21, 10:45       Poland
109.224.55.46:80         transparent       Apr-20, 21:50       Iraq
109.227.124.105:8080     transparent       Apr-21, 09:57       Ukraine
109.69.6.118:8080        transparent       Apr-21, 11:44       Albania
110.138.248.135:8080     transparent       Apr-21, 09:10       Indonesia
110.139.13.121:8080      transparent       Apr-21, 11:31       Indonesia
110.159.179.108:80       transparent       Apr-20, 20:35       Malaysia

In [5]: [l.split()[0] for l in bs.find('pre').text.split('\n')[1:]][1:]
Out[5]: 
['056.249.66.50:8080',
 '1.63.18.22:8080',
 '1.9.75.8:8080',
 '103.247.219.165:8080',
 '103.4.165.190:80',
 '103.9.126.110:8080',
 '109.173.98.64:8080',
 '109.197.194.142:8080',
 '109.207.61.141:8090',
 '109.207.61.145:8090',
 '109.207.61.149:8090',
 '109.207.61.165:8090',
 '109.207.61.170:8090',
 '109.207.61.208:8090',
 '109.224.55.46:80',
 '109.227.124.105:8080',
 '109.69.6.118:8080',
 '110.138.248.135:8080',
 '110.139.13.121:8080',
 '110.159.179.108:80']

相关问题 更多 >

    热门问题