我正在使用SeleniumWebDriver和BeautifulSoup来抓取一个具有可变多个页面数的网站。我是通过xpath
粗略地做的。一个页面显示五个页面,在计数为五之后,我按下next按钮并重置xpath
计数以获得下一个5页。为此,我需要通过代码或更好的方式导航到不同的网页在网站总页面。你知道吗
我认为这个页面使用了java脚本进行导航。代码如下:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.maximize_window()
spg_index=' '
url = "https://www.bseindia.com/corporates/ann.html"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
html=soup.prettify()
with open('bseann.txt', 'w', encoding='utf-8') as f:
f.write(html)
time.sleep(1)
i=1 #index for page numbers navigated. ket at maximum 31 at present
k=1 #goes upto 5, the maximum navigating pages shown at one time
while i <31:
next_pg=9 #xpath number to pinpoint to "next" page
snext_pg=str(next_pg)
snext_pg=snext_pg.strip()
if i> 5:
next_pg=10 #when we go to next set of pages thr is a addl option
if(i==6) or(i==11)or(i==16):#resetting xpath indx for set of pg's
k=2
path='/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['
path=path+snext_pg+']/a'
next_page_btn_list=driver.find_elements_by_xpath(path)
next_page_btn=next_page_btn_list[0]
next_page_btn.click() #click next page
time.sleep(1)
pg_index= k+2
spg_index=str(pg_index)
spg_index=spg_index.strip()
path= '/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['
path=path+spg_index+']/a'
next_page_btn_list=driver.find_elements_by_xpath(path)
next_page_btn=next_page_btn_list[0]
next_page_btn.click() #click specific pg no.
time.sleep(1)
soup = BeautifulSoup(driver.page_source, 'html.parser')
html=soup.prettify()
i=i+1
k=k+1
with open('bseann.txt', 'a', encoding='utf-8') as f:
f.write(html)
这里不需要使用Selenium,因为您可以从API访问信息。共发布了247条公告:
输出:
关于您的用例的更多信息将有助于回答您的问题。但是,要提取website中总页数的信息,您可以访问该站点,单击文本为下一步的项目,然后提取所需的数据,您可以使用以下解决方案:
代码块:
控制台输出:
相关问题 更多 >
编程相关推荐