无法从某些html元素中提取某些地址

from bs4 import BeautifulSoup import re html = """ <div class="ACA_TabRow ACA_FLeft"> Mailing <br/> 1961 MAIN ST #186 <br/> WATSONVILLE, CA, 95076 <br/> United States <br/> </div> """ soup = BeautifulSoup(html,"lxml") items = soup.find(class_="ACA_TabRow").find(string=re.compile("Mailing")).find_next_siblings() print(items)

3条回答

网友

1楼 · 编辑于 2024-10-02 20:31:07

from bs4 import BeautifulSoup
import re

html = """
<div class="ACA_TabRow ACA_FLeft">
 Mailing
 <br/>
 1961 MAIN ST #186
 <br/>
 WATSONVILLE, CA, 95076
 <br/>
 United States
 <br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow")

items_list = items.text.split('\n')

results = [ x.strip() for x in items_list if x.strip() != '' ]

输出：

print (results)
['Mailing', '1961 MAIN ST #186', 'WATSONVILLE, CA, 95076', 'United States']

网友

2楼 · 编辑于 2024-10-02 20:31:07

我将继续检查div startswith Mailing中的stripped字符串

soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow")

for i,item in enumerate(items.stripped_strings):
    if i==0 and not item.startswith('Mailing'):
        break
    if i!=0:
        print(item)

输出

1961 MAIN ST #186
WATSONVILLE, CA, 95076
United States

网友

3楼 · 编辑于 2024-10-02 20:31:07

看来我找到了更好的解决办法：

from bs4 import BeautifulSoup
import re

html = """
<div class="ACA_TabRow ACA_FLeft">
 Mailing
 <br/>
 1961 MAIN ST #186
 <br/>
 WATSONVILLE, CA, 95076
 <br/>
 United States
 <br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow").find(string=re.compile("Mailing")).find_parent()
find_text = ' '.join([item.strip() for item in items.strings])
print(find_text)

输出：

Mailing 1961 MAIN ST #186 WATSONVILLE, CA, 95076 United States

相关问题更多 >

编程相关推荐

热门问题

热门文章

无法从某些html元素中提取某些地址

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >