如何通过lxml解析从htmlfile打印出所有的文本信息？

<html> <head></head> <body> <dfn>Definition</dfn>sometext / '' (othertext)someothertext / '' (...) (...) <dfn>Definition2</dfn>sometext / '' (othertext)someothertext / '' blabla bubu </body> </html>

tree = etree.parse(filename) places = [] for dfn in tree.getiterator('dfn'): def_text = dfn.text def_tail = dfn.tail for sibling in dfn.itersiblings(): sib_text = sibling.text sib_tail = sibling.tail if def_text not in places: places.append(def_text) if def_tail == None or sib_text == None or sib_tail == None: continue else: places.append(def_tail), places.append(sib_text), places.append(sib_tail) return places

2条回答

网友

1楼 · 编辑于 2024-09-30 02:25:20

谢谢你的提示，我有这个给我我需要的：

for p in tree.xpath("//p"):
  dfn = p.xpath('./dfn/text()')
  after_dfn = p.xpath("./dfn/following::text()")
  if dfn!=None:
    print dfn
  if after_dfn !=None:    
    for x in after_dfn:
        print x

唯一的问题是-它导致了一个无限循环，我如何才能摆脱它？你知道吗

网友

2楼 · 编辑于 2024-09-30 02:25:20

我会尝试以下方法：

for p in tree.xpath("//p"):  # This gets all the p elements
    dfn = p.xpath('./dfn')[0]  # may want to check this exists first
    after_dfn = p.xpath("./dfn/following-sibling::node()")
    for x in after_dfn:
        pass  # do whatever you need to do with the stuff after dfn

相关问题更多 >

编程相关推荐

热门问题

热门文章