使用Beautiful Soup 4解析元素后如何获取其名称

<HTML> <HEAD> <TITLE>Eine einfache HTML-Datei</TITLE> <meta name="description" content="A simple HTML page for BS4"> <meta name="author" content="Uwe Ziegenhagen"> <meta charset="UTF-8"> </HEAD> <BODY> <H1>Hallo Welt</H1> <p>Ein kurzer Absatz mit ein wenig Text, der relativ nichtssagend ist.</p> <H1>Nochmal Hallo Welt!</H1> <p>Schon wieder ein kurzer Absatz mit ein wenig Text, der genauso nichtssagend ist wie der Absatz zuvor.</p> </BODY> </HTML>

2条回答

网友

1楼 · 编辑于 2024-10-03 11:21:06

请尝试以下代码：

from bs4 import BeautifulSoup
with open ("simple.html", "r") as htmlsource:
    html=htmlsource.read()

soup = BeautifulSoup(html)

for item in soup.body:
    print(item)

# You will select all of elements in the HTML page
elems = soup.findAll()
for item in elems:
   try:
      # Check if the class element is equal to a specified class
      if 'myClass' == item['class'][0]:
         print(item)

     # Check if the tagname element is equal to a specified tagname
     elif 'p' == item.name:
        print(item)

  except KeyError:
     pass

网友

2楼 · 编辑于 2024-10-03 11:21:06

beautifulGroup标记对象有一个name属性，您可以检查它。例如，下面是一个函数，它通过向postwalk中的每个节点添加字符串“Done with this”+适当的标记名来转换树：

def walk(soup):
    if hasattr(soup, "name"):
        for child in soup.children:
            walk(child)
        soup.append("Done with this " + soup.name)

注意。表示文本内容的NavigableString对象和表示注释的Comment对象没有诸如name或{}之类的属性，因此，如果您像上面一样遍历整个树，则需要检查是否确实手头有一个标记（我正在使用上面的hasattr调用；我想您可以检查类型是bs4.element.Tag）。在

相关问题更多 >

编程相关推荐

热门问题

热门文章