如何用beauthoulsoup找到两个标签之间的所有列表项?

2024-10-01 09:23:47 发布

您现在位置:Python中文网/ 问答频道 /正文

例如,我只想从下面的列表中拉出Child1、Child2和Child3,它位于h3的第一个实例之后和h3的下一个标记之前

<h3>HeaderName1<h3>
<ul class="prodoplist">
 <li>Parent</li>
 <li class="lev1">Child1</li>
 <li class="lev1">Child2</li>
 <li class="lev1">Child3</li>
  </ul>
  <h3>HeaderName2<h3>
   <ul class="prodoplist">
   <li>Parent2</li>
   <li class="lev1">Child4</li>
   <li class="lev1">Child5</li>
   <li class="lev1">Child6</li>
   </ul>

Tags: 实例标记列表liulh3classparent
3条回答

这应该行得通。在

import re
from BeautifulSoup import BeautifulSoup
html_doc = '<h3>HeaderName1</h3><ul class="prodoplist"><li>Parent</li><li class="lev1">Child1</li><li class="lev1">Child2</li><li class="lev1">Child3</li></ul>  <h3>HeaderName2</h3><ul class="prodoplist"><li>Parent2</li><li class="lev1">Child4</li><li class="lev1">Child5</li><li class="lev1">Child6</li></ul>'
m = re.search(r'<h3>.*?<h3>', html_doc, re.DOTALL)
s = m.start()
e = m.end() - len('<h3>')
target_html = html_doc[s:e]
new_bs = BeautifulSoup(target_html)
ul_eles = new_bs.findAll('ul', attrs={'class' : 'prodoplist'})
for ul_ele in ul_eles:
    li_eles = new_bs.findAll('li', attrs={'class' : 'lev1'})
    for li_ele in li_eles:
        print li_ele.text

使用findChildren,如:

for ul in soup.find_all('ul'):
    print 'ul start'
    for idx, li in enumerate(ul.findChildren('li')):
        if idx in range(3):
            print li

输出:

^{pr2}$

然而,在大多数情况下,lxml and xpath是一个更好的解决方案:

from lxml import html
doc = html.parse('input.html')
print [ul.xpath('li[1] | li[2] | li[3]') for ul in doc.xpath('//ul')]
import requests
from BeautifulSoup import BeautifulSoup

children = []

url = "http://someurl.html"
r = requests.get(url)
bs = BeautifulSoup(r.text)
for uls in bs.findAll('ul', 'prodoplist'):
    lis = uls.findAll('li', 'lev1')
    for li in lis:
        children.append(li.text)

print children

相关问题 更多 >