如何在Python BeautifulSoup上高效地解析大型htmldivclass和span数据？问题的回答

如何在Python BeautifulSoup上高效地解析大型htmldivclass和span数据？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

一些关于解析<code>html</code>的建议使用'beauthoulsoup'，这对我很有帮助也许对你也有帮助。在 <blockquote> <ol> <li>use 'id' to location the element, instead of using 'class' because the 'class' change more frequently than id.</li> <li>use structure info to location the element instead of using 'class', the structure info change less frequently.</li> <li>use headers with user-agent info to get response is always better than no headers. In this case, if do not specify headers info, you can not find id 'Col1-1-Financials-Proxy', but you can find 'Col1-3-Financials-Proxy', which is not same with result in Chrome inspector.</li> </ol> </blockquote> 下面是针对您的需求的可运行代码使用结构信息定位元素。你绝对可以使用“类”信息来制作它。只要记住，当你的代码不能正常工作时，请检查网站的源代码。在 <pre class="lang-py prettyprint-override"><code># import libraries import requests from bs4 import BeautifulSoup # set the URL you want to webscrape from first_page_url = 'https://finance.yahoo.com/quote/AAPL/balance-sheet?p=AAPL' second_page_url = 'https://finance.yahoo.com/quote/AAPL/financials?p=AAPL' headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36' } ################# # first page ################# print('*' * 10, ' FIRST PAGE RESULT ', '*' * 10) total_assets = {} total_current_liabilities = {} operating_income_or_loss = {} page1_table_keys = [] page2_table_keys = [] # connect to the first page URL response = requests.get(first_page_url, headers=headers) # parse HTML and save to BeautifulSoup object¶ soup = BeautifulSoup(response.text, "html.parser") # the nearest id to get the result sheet = soup.find(id='Col1-1-Financials-Proxy') sheet_section_divs = sheet.section.find_all('div', recursive=False) # last child sheet_data_div = sheet_section_divs[-1] div_ele_table = sheet_data_div.find('div').find('div').find_all('div', recursive=False) # table header div_ele_header = div_ele_table[0].find('div').find_all('div', recursive=False) # first element is label, the remaining element containing data, so use range(1, len()) for i in range(1, len(div_ele_header)): page1_table_keys.append(div_ele_header[i].find('span').text) # table body div_ele = div_ele_table[-1] div_eles = div_ele.find_all('div', recursive=False) tgt_div_ele1 = div_eles[0].find_all('div', recursive=False)[-1] tgt_div_ele1_row = tgt_div_ele1.find_all('div', recursive=False)[-1] tgt_div_ele1_row_eles = tgt_div_ele1_row.find('div').find_all('div', recursive=False) # first element is label, the remaining element containing data, so use range(1, len()) for i in range(1, len(tgt_div_ele1_row_eles)): total_assets[page1_table_keys[i - 1]] = tgt_div_ele1_row_eles[i].find('span').text tgt_div_ele2 = div_eles[1].find_all('div', recursive=False)[-1] tgt_div_ele2 = tgt_div_ele2.find('div').find_all('div', recursive=False)[-1] tgt_div_ele2 = tgt_div_ele2.find('div').find_all('div', recursive=False)[-1] tgt_div_ele2_row = tgt_div_ele2.find_all('div', recursive=False)[-1] tgt_div_ele2_row_eles = tgt_div_ele2_row.find('div').find_all('div', recursive=False) # first element is label, the remaining element containing data, so use range(1, len()) for i in range(1, len(tgt_div_ele2_row_eles)): total_current_liabilities[page1_table_keys[i - 1]] = tgt_div_ele2_row_eles[i].find('span').text print('Total Assets', total_assets) print('Total Current Liabilities', total_current_liabilities) ################# # second page, same logic as the first page ################# print('*' * 10, ' SECOND PAGE RESULT ', '*' * 10) # Connect to the second page URL response = requests.get(second_page_url, headers=headers) # Parse HTML and save to BeautifulSoup object¶ soup = BeautifulSoup(response.text, "html.parser") # the nearest id to get the result sheet = soup.find(id='Col1-1-Financials-Proxy') sheet_section_divs = sheet.section.find_all('div', recursive=False) # last child sheet_data_div = sheet_section_divs[-1] div_ele_table = sheet_data_div.find('div').find('div').find_all('div', recursive=False) # table header div_ele_header = div_ele_table[0].find('div').find_all('div', recursive=False) # first element is label, the remaining element containing data, so use range(1, len()) for i in range(1, len(div_ele_header)): page2_table_keys.append(div_ele_header[i].find('span').text) # table body div_ele = div_ele_table[-1] div_eles = div_ele.find_all('div', recursive=False) tgt_div_ele_row = div_eles[4] tgt_div_ele_row_eles = tgt_div_ele_row.find('div').find_all('div', recursive=False) for i in range(1, len(tgt_div_ele_row_eles)): operating_income_or_loss[page2_table_keys[i - 1]] = tgt_div_ele_row_eles[i].find('span').text print('Operating Income or Loss', operating_income_or_loss) </code></pre> 带标题信息的输出： ^{pr2}$

如何在Python BeautifulSoup上高效地解析大型htmldivclass和span数据？

1 个回答

相关Python问题