处理表式网站结构的BeautifulSoup | returning dictionary - 问答

car = ''' <div class="info flexbox"> <div class="infoEntity"> <span class="manufacturer website"> <a class="link" href="http://www.ford.com" rel="nofollow noreferrer" target="_blank"> www.ford.com </a> </span> </div> <div class="infoEntity"> <label> Headquarters </label> <span class="value"> Dearbord, MI </span> </div> <div class="infoEntity"> <label> Model </label> <span class="value"> Mustang </span> </div> ''' car_soup = BeautifulSoup(car, 'lxml') print(car_soup.prettify()) elements = car_soup.findAll('div', class_ = 'infoEntity') for x in elements: print(x) ###and then we start iterating over x, with beautiful soup, to find value of each element.

2条回答

网友

1楼 · 编辑于 2024-10-02 14:19:19

当前的HTML结构非常通用，它包含多个infoEntitydiv，子内容可以通过多种方式格式化。要处理此问题，可以迭代infoEntitydiv并应用格式化对象，如下所示：

from bs4 import BeautifulSoup as soup
result, label = {}, None
for i in soup(car, 'html.parser').find_all('div', {'class':'infoEntity'}):
   for b in i.find_all(['span', 'label']):
      if b.name == 'label':
         label = b.get_text(strip=True)
      elif b.name == 'span' and label is not None:
         result[label] = b.get_text(strip=True)
         label = None
      else:
         result[' '.join(b['class'])] = b.get_text(strip=True)

输出：

{'manufacturer website': 'www.ford.com', 'Headquarters': 'Dearbord, MI', 'Model': 'Mustang'}

网友

2楼 · 编辑于 2024-10-02 14:19:19

或者，为了使事情更加通用和简单，您可以使用标签和制造商网站链接拆分字段处理：

soup = BeautifulSoup(car, 'lxml')

car_info = soup.select_one('.info')
data = {
    label.get_text(strip=True): label.find_next_sibling().get_text(strip=True)
    for label in car_info.select('.infoEntity label')
}
data['manufacturer website'] = car_info.select_one('.infoEntity a').get_text(strip=True)

print(data)

印刷品：

{'Headquarters': 'Dearbord, MI', 
 'Model': 'Mustang', 
 'manufacturer website': 'www.ford.com'}

处理表式网站结构的BeautifulSoup | returning dictionary

相关问题更多 >

编程相关推荐

热门问题

热门文章

处理表式网站结构的BeautifulSoup | returning dictionary

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >