使用beauthulsoup解析多个层

2024-09-29 20:16:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个网页保存为.htm。基本上,有6层div需要解析并从中获取特定数据,我很困惑如何处理这一点。我试过不同的方法,但都没用。在

HTM文件有一堆标记,但有一个div如下所示:

<div id="fbbuzzresult" class.....>
   <div class="postbuzz"> .... </div>
      <div class="linkbuzz">...</div>
      <div class="descriptionbuzz">...</div>
      <div class="metabuzz>
         <div class="time">...</div>
      <div>
   <div class="postbuzz"> .... </div>
   <div class="postbuzz"> .... </div>
   <div class="postbuzz"> .... </div>
</div>

我正在试着美容。更多的背景。。。在

  1. 整个文件中只有一个fbbuzzresult。在
  2. fBuzzResult中有多个PostBuzzResult
  3. postbush

我需要在每个postbush div中提取并打印上面显示的每一个内容

非常感谢您对一些框架代码的帮助和指导! P、 S-忽略div类中的破折号。 谢谢!在


Tags: 文件数据方法标记divid网页class
2条回答

您应该能够以与您的父项相同的方式使用结果soup

from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)
div = soup.find("div",{"id":"fbbuzzresult"})
post_buzz = div.findAll("div",{"class":"postbuzz"})

但在这样做之前,我遇到了一些错误,因此作为第二种方法,您可以做一种sub_soup

^{pr2}$

首先阅读BeautifulSoup文档http://www.crummy.com/software/BeautifulSoup/bs4/doc/

第二,这里有一个小例子可以让你走得更远:

from bs4 import BeautifulSoup as bs

soup = bs(your_html_content)

# for fbbuzzresult
buzz = soup.findAll("div", {"id" : "fbbuzzresult"})[0]

# to get postbuzz
pbuzz = buzz.findAll("div", {"class" : "postbuzz"})

"""pbuzz is now an array with the postbuzz divs
   so now you can iterate through them, get
   the contents, keep traversing the DOM with BS 
   or do whatever you are trying to do

   So say you want the text from an element, you
   would just do: the_element.contents[0]. However
   if I'm remembering correctly you have to traverse 
   down through all of it's children to get the text.
"""

相关问题 更多 >

    热门问题