如何使用BeautifulGroup获取父值和嵌套值?

2024-09-27 00:16:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用BeautifulGroup从HTML页面中提取类别和子类别。html如下所示:

<a class='menuitem submenuheader' href='#'>Beverages</a><div class='submenu'><ul><li><a href='productlist.aspx?parentid=053&catid=055'>Juice</a></li></ul></div><a class='menuitem submenuheader' href='#'>DIY</a><div class='submenu'><ul><li><a href='productlist.aspx?parentid=007&catid=052'>Miscellaneous</a></li><li><a href='productlist.aspx?parentid=007&catid=047'>Sockets</a></li><li><a href='productlist.aspx?parentid=007&catid=046'>Spanners</a></li><li><a href='productlist.aspx?parentid=007&catid=045'>Tool Boxes</a></li></ul></div><a class='menuitem submenuheader' href='#'>Electronics</a><div class='submenu'><ul><li><a href='productlist.aspx?parentid=003&catid=019'>Audio/Video</a></li><li><a href='productlist.aspx?parentid=003&catid=027'>Cameras</a></li><li><a href='productlist.aspx?parentid=003&catid=023'>Cookers</a></li><li><a href='productlist.aspx?parentid=003&catid=024'>Freezers</a></li><li><a href='productlist.aspx?parentid=003&catid=025'>Kitchen Appliances</a></li><li><a href='productlist.aspx?parentid=003&catid=048'>Measuring Instruments</a></li><li><a href='productlist.aspx?parentid=003&catid=020'>Microwaves</a></li><li><a href='productlist.aspx?parentid=003&catid=050'>Miscellaneous</a></li><li><a href='productlist.aspx?parentid=003&catid=026'>Personal Care</a></li><li><a href='productlist.aspx?parentid=003&catid=021'>Refrigerators</a></li><li><a href='productlist.aspx?parentid=003&catid=018'>TV</a></li><li><a href='productlist.aspx?parentid=003&catid=022'>Washers/Dryers/Vacuum Cleaners</a></li></ul></div>

其中饮料是类别,果汁是子类别。在

我使用以下代码提取类别:

^{pr2}$

如何以这种格式获取子类别?在

[Beverages = Category]
 [Juice = Sub]
[DIY = Category]
 [Miscellaneous = Sub]
 [Spanners = Sub]
 [Sockets = Sub]
[Electronics]
 [Audio = Sub]
 [Cameras]

Tags: divli类别ulclassmiscellaneoushrefsub
3条回答

查看有问题的网页,看起来所有的newstory都在h3标记中,类为item-heading。您可以使用BeautifulGroup选择所有的报道标题,然后向上一步访问它们包装在其中的a href

In [54]: [i.parent.attrs["href"] for i in soup.select('a > h3.item-heading')]
Out[55]:
[{'href': '/news/us-news/civil-rights-groups-fight-trump-s-refugee-ban-uncertainty-continues-n713811'},
 {'href': '/news/us-news/protests-erupt-nationwide-second-day-over-trump-s-travel-ban-n713771'},
 {'href': '/politics/politics-news/some-republicans-criticize-trump-s-immigration-order-n713826'},
...  # trimmed for readability
]

我使用了列表理解,但您可以将其分解为以下组合步骤:

^{pr2}$

一旦有了链接列表,就可以遍历它来检查第一个字符是否是/,以便只匹配本地链接而不匹配外部链接。在

考虑到您的html总是有那些子菜单div,可能更好的方法是以cats[i]对应subcats[i]的方式为类别返回一个列表,为子类别返回另一个列表,或者根据需要返回字典。在

在Python shell中:

>>> from BeautifulSoup import BeautifulSoup
>>> html = '''<a class="menuitem submenuheader" href="#">Beverages</a>
... <div class="submenu">
...  <ul>
...   <li><a href="productlist.aspx?parentid=053&amp;catid=055">Juice</a></li>
...   <li><a href="productlist.aspx?parentid=053&amp;catid=055">Milk</a></li>
...  </ul>
... </div>
... <a class="menuitem submenuheader" href="#">DIY</a>
... <div class="submenu">
...  <ul>
...   <li><a href="productlist.aspx?parentid=053&amp;catid=055">Micellaneous</a></li>
...   <li><a href="productlist.aspx?parentid=053&amp;catid=055">Spanners</a></li>
...   <li><a href="productlist.aspx?parentid=053&amp;catid=055">Sockets</a></li>
...  </ul>
... </div>'''
>>> soup = BeautifulSoup(html)
>>> categories = soup.findAll("a", {"class": 'menuitem submenuheader'})
>>> cats = [cat.text for cat in categories]
>>> sub_menus = soup.findAll("div", {"class": "submenu"})
>>> subcats = []
>>> for menu in sub_menus:
...     subcat = [item.text for item in menu.findAll('li')]
...     subcats.append(subcat)
... 
>>> print cats
[u'Beverages', u'DIY']
>>> print subcats
[[u'Juice', u'Milk'], [u'Micellaneous', u'Spanners', u'Sockets']]
>>> cat_dict = dict(zip(cats,subcats))
>>> print cat_dict
{u'Beverages': [u'Juice', u'Milk'], u'DIY': [u'Micellaneous', u'Spanners', u'Sockets']}

从每个类别的html中,您必须找到下一个元素,然后从中找到li元素:

print cat.findNext().findAll('li')

相关问题 更多 >

    热门问题