如何从HTML标题中获取带引号的字符串？

<dl> <dt><a href="oq-phys.htm"> Physics and Astronomy</a> <dt><a href="oq-math.htm"> Mathematics</a> <dt><a href="oq-life.htm"> Life Sciences</a> <dt><a href="oq-tech.htm"> Technology</a> <dt><a href="oq-geo.htm"> Earth and Environmental Science</a> </dl>

3条回答

网友

1楼 · 编辑于 2024-05-19 14:43:38

对于上面的示例，假设我们有包含上述代码段的html\u字符串。你知道吗

import requests
import lxml.etree as LH
html_string =  LH.fromstring(requests.get('http://openquestions.com').text)

for quoted_link in html_string.xpath('//a'): print(quoted_link.attrib['href'], quoted_link.text_content())

网友

2楼 · 编辑于 2024-05-19 14:43:38

有很多方法可以剥这只猫的皮。下面是一个requests/lxml解决方案，它不包含（显式）for循环：

import requests
from lxml.html import fromstring

req = requests.get('http://www.openquestions.com')
resp = fromstring(req.content)
hrefs = resp.xpath('//dt/a/@href') 
print(hrefs)

编辑

我为什么这样写：

我更喜欢XPath而不是CSS选择器
很快

基准：

import requests,bs4
from lxml.html import fromstring
import timeit

req = requests.get('http://www.openquestions.com').content

def myfunc() :
    resp = fromstring(req)
    hrefs = resp.xpath('//dl/dt/a/@href')

print("Time for lxml: ", timeit.timeit(myfunc, number=100))

##############################################################

resp2 = requests.get('http://www.openquestions.com').content

def func2() :
    soup = bs4.BeautifulSoup(resp2, 'html.parser')
    hrefs = [a['href'] for a in soup.select('dl dt a')]

print("Time for beautiful soup:", timeit.timeit(func2, number=100))

输出：

('Time for lxml: ', 0.09621267095780464)
('Time for beautiful soup:', 0.8594218329542824)

网友
3楼 · 编辑于 2024-05-19 14:43:38

to find the quoted strings after href=

短requests+beautifulsoup溶液：

import requests, bs4

soup = bs4.BeautifulSoup(requests.get('http://.openquestions.com').content, 'html.parser')
hrefs = [a['href'] for a in soup.select('dl dt a')]
print(hrefs)

输出：

['oq-phys.htm', 'oq-math.htm', 'oq-life.htm', 'oq-tech.htm', 'oq-geo.htm', 'oq-map.htm', 'oq-about.htm', 'oq-howto.htm', 'oqc/oqc-home.htm', 'oq-indx.htm', 'oq-news.htm', 'oq-best.htm', 'oq-gloss.htm', 'oq-quote.htm', 'oq-new.htm']

相关问题更多 >

编程相关推荐

热门问题

热门文章