如何在python中使用BeautifulSoup刮取隐藏的表内容?

2024-09-23 22:30:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个股票网站上搜集数据,但问题是表的内容是隐藏的。该网站是http://www.moneycontrol.com/stocks/histstock.php

1.Select Index
2.Select S&P BSE MIDCAP
3.Filter data from Jan 2019 to Jan 2020 to get to the final page 
4.I want to scrape the table contents of this page

这就是我尝试用的汤

import requests
from bs4 import BeautifulSoup
link='http://www.moneycontrol.com/stocks/hist_index_result.php?indian_indices=25'
html=requests.get(link)
html.status_code #200
raw=html.content
soup=BeautifulSoup(raw,'html.parser') #have tried with xml and html5lib
soup.find_all('table',{'class':'tblchart'})
#output
[<table border="0" cellpadding="0" cellspacing="0" class="tblchart">
                    </table>]

我也尝试过使用硒,但结果是一样的

我很难获得信息

如有任何建议、回答或是朝着正确的方向轻推,我们将不胜感激


Tags: thetofromcomhttpget网站html
2条回答

仅使用BeautifulSoup的解决方案。数据是通过Ajax动态加载的,但您只需使用requests模块即可模拟请求:

import requests
from bs4 import BeautifulSoup


data = {
    'mth_frm_mth':'01',
    'mth_frm_yr':'2019',
    'mth_to_mth':'01',
    'mth_to_yr':'2020',
    'hdn':'monthly'
}

url = 'https://www.moneycontrol.com/stocks/hist_index_result.php?indian_indices=26'
soup = BeautifulSoup(requests.post(url, data=data).content, 'html.parser')

all_data = []
for tr in soup.select('.tblchart tr:has(td)'):
    tds = [td.get_text(strip=True) for td in tr.select('td')]
    all_data.append(tds)

# print on screen
print('{:<15}{:<15}{:<15}{:<15}{:<15}'.format('Date', 'Open', 'High', 'Low', 'Close'))
for row in all_data:
    print('{:<15}{:<15}{:<15}{:<15}{:<15}'.format(*row))

印刷品:

Date           Open           High           Low            Close          
Jan 2020       13720.24       14946.21       13686.28       14667.96       
Dec 2019       13584.07       13716.74       13103.54       13699.37       
Nov 2019       13598.71       13729.32       13310.46       13560.57       
Oct 2019       13190.78       13583.13       12669.63       13558.05       
Sep 2019       12536.96       13648.30       12321.25       13170.76       
Aug 2019       12698.94       12755.07       11950.86       12534.70       
July 2019      14275.76       14375.47       12492.30       12692.18       
June 2019      14882.18       15022.09       13803.07       14239.33       
May 2019       14653.64       15039.53       13693.41       14867.04       
Apr 2019       15069.13       15229.85       14585.92       14624.56       
Mar 2019       13719.93       15034.53       13719.80       15027.36       
Feb 2019       13961.93       14064.51       13099.46       13689.84       
Jan 2019       14724.03       14790.99       13652.03       13926.22       

好了,伙计们,我实际上用硒解决了这个问题,我必须更新我的硒包,它就像一个符咒

我是这样做的:

  import pandas as pd
  from selenium import webdriver

  link='http://www.moneycontrol.com/stocks/histstock.php'

  driver=webdriver.Chrome()
  driver.get(link)

  #selecting the index in Step 1
  driver.find_element_by_xpath('//*[@id="wutabs2"]').click()

  #Selecting from the dropdown Index options in step 2
  drop=driver.find_element_by_xpath('//*[@id="indian_indices"]')
  drop.click()
  drop.send_keys('S&P BSE MIDCAP')      

  #select the month in step 3

  month=driver.find_element_by_xpath('/html/body/div[3]/div[3]/div/div[7]/div[2]/div[6]/table/tbody/tr/td[3]/form/div[2]/select[2]')
  month.click()
  month.send_keys('2019')

  #click on search 
  driver.find_element_by_xpath('/html/body/div[3]/div[3]/div/div[7]/div[2]/div[6]/table/tbody/tr/td[3]/form/div[4]/input[1]').click()

  #getting the contents
  for i in driver.find_elements_by_css_selector('table.tblchart'):
       a=i.text

  a=a.split('\n')

  #storing it as a data frame
  df=pd.DataFrame(a)

  #removing the first column as it contained table headers
  df.drop(df.iloc[0:1,:],inplace=True)

  #splitting the columns using space and storing them seperately
  df['Month']=df[0].str.split(' ', expand=True)[0]
  df['Year']=df[0].str.split(' ', expand=True)[1]
  df['Open']=df[0].str.split(' ', expand=True)[2]
  df['High']=df[0].str.split(' ', expand=True)[3]
  df['Low']=df[0].str.split(' ', expand=True)[4]
  df['Close']=df[0].str.split(' ', expand=True)[5]

相关问题 更多 >