无法分析电话号码和地址

html_content=''' <div style=""> <strong>Pamela Banchy, Chief Information Officer</strong> <br>Western Reserve Hospital<br> <br>Lyndhurst, OH <br> <a href="mailto:pbanchy@westernreservehospital.org">pbanchy@westernreservehospital.org</a> <br>(330) 971-7456<br> </div> '''

from lxml.html import fromstring tree = fromstring(html_content) phone = ' '.join([elem.text_content().strip().split()[-2] for elem in tree.cssselect("div")]) phone1 = ' '.join([elem.text_content().strip().split()[-1] for elem in tree.cssselect("div")]) print(phone+phone1)

3条回答

网友

1楼 · 编辑于 2024-09-26 22:51:19

另一种方法是：

text_nodes = [node for node in tree.cssselect('div')[0].itertext() if node.split()]
adress, phone = text_nodes[2], text_nodes[4]

网友

2楼 · 编辑于 2024-09-26 22:51:19

你可以通过换行来分割你的文本，这样你就可以用最少的后处理来更方便地获取地址和电话号码。你知道吗

for elem in tree.cssselect('div'):
    elem = elem.text_content().split('\n')
    address = elem[-4].strip()
    phone = elem[-2].replace(' ', '')

网友

3楼 · 编辑于 2024-09-26 22:51:19

我认为更好的方法是使用xpath。 address, phone = tree.xpath('./div/br/text()')[-2:]

相关问题更多 >

编程相关推荐

热门问题

热门文章