从标签中获取数据（BeautifulSoup）

import argparse, os, socket, urllib2, re from bs4 import BeautifulSoup pge = urllib2.urlopen("").read() src = BeautifulSoup(pge) body = src.findAll('body') el = body[0].findChildren() for s in el: cname = s.get('class') if cname[0] == "work": print s.text

3条回答

网友

1楼 · 编辑于 2024-10-05 12:22:59

我用换行符格式化了您的html，以帮助说明为什么4没有打印到您期望的位置。在

您正在迭代的子类，并从属于类“work”的任何子级打印文本。数字4不符合这个标准，因为它是文本，而不是一个有“工作”类的孩子。在

我不认为beauthulsoup能像您期望的那样解码这个特定的html。在

一种解决方案是自己解析html，因为这不是一种典型的情况。一种方法可能是使用regex来查找类似以下内容的实例：

</span>(not_blank)<span class="{classregex}">(remember)</span>

建立一个{记住：不是空的}的字典。当你循环时身体。孩子们根据此词典（）验证文本。如果是键，则打印该值，然后打印s.text（）。在

根据实际的html是什么，这可能会工作。。。在

网友

2楼 · 编辑于 2024-10-05 12:22:59

你可以：

arr = []
# Get all text elements
for i in body[0].find_all(text=True):
  # append to array if it's 'work' element or has no class
  if not i.parent.has_attr("class") or "work" in i.parent["class"]:
    arr.append(i)

当然，只有当以下两条规则始终有效时，此方法才有效：

有效的文本元素在^{cl1}内$
有效的文本元素位于没有class属性的标记内

网友

3楼 · 编辑于 2024-10-05 12:22:59

简单地说：

print soup.find('body').text

相关问题更多 >

编程相关推荐

热门问题

热门文章