尝试使用Python从标记中提取“text”

from lxml import html import requests import re page = requests.get('https://www.proxynova.com/proxy-server-list/country-fr/') tree = html.fromstring(page.content.decode('utf-8')) for elem in tree.xpath('//table[@class="table"]//tbody//td[@align="left"]'): print elem.text_content()

2条回答

网友

1楼 · 编辑于 2024-10-01 22:28:32

我承认，如果没有泰尔的答案，我不会得到这个，因为我错过了IP地址在脚本中的编码方式。在

import re
import requests
from lxml import etree

page = requests.get('https://www.proxynova.com/proxy-server-list/country-fr/').text
parser = etree.HTMLParser()
tree = etree.fromstring(page, parser=parser)
table = tree.xpath('.//table[@id="tbl_proxy_list"]//script/text()')

for item in table:
    m = re.match(r"document.write\('23([0-9.]+)'[^']+'([0-9.]+)'",item)
    if m:
        print (''.join(m.groups()))

网友

2楼 · 编辑于 2024-10-01 22:28:32

我建议使用BeautifulSoup。这样地。在

import requests
import re
from bs4 import BeautifulSoup

res = requests.get('https://www.proxynova.com/proxy-server-list/country-fr/')
soup = BeautifulSoup(res.content, "lxml")

REGEX_JS = re.compile("^document\.write\('([^']+)'\.substr\(2\) \+ '([^']+)'\);$")

proxy_ip_list = []
for table in soup.find_all("table", id="tbl_proxy_list"):
    for script in table.find_all("script"):
        m = REGEX_JS.search(script.text)
        if m:
            proxy_ip_list.append(m.group(1)[2:] + m.group(2))

for ip in proxy_ip_list:
    print(ip)

相关问题更多 >

编程相关推荐

热门问题

热门文章

尝试使用Python从标记中提取“text”

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >