如何在经典的asp-websi上访问和解析python web抓取应用程序中的源框架

2024-10-04 07:25:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图创建一个网络爬虫应用程序,以本教程为基础,从SAI下载一堆各式各样的pdf https://www.youtube.com/watch?v=sVNJOiTBi_8

我试过在网站上获取(url),我相信我想从中获取源代码

验证码(不包括在内)。。。。你知道吗

def subscription_spider(max_pages):
page = 1
while page <= max_pages:
    url= 'https://www.saiglobal.com/online/Script/listvwstds.asp?TR=' + str(page)
    source_code = session.get(url) # need to get the frame source!!
    # extra code to find the href for the frame than get the frame source
#https://www.saiglobal.com/online/Script/ListVwStds.asp

    plain_text = source_code.text #source_code.text printed is not the one we want

    playFile = open('source_code.text', 'wb')
    for chunk in r.iter_content(100000):
            playFile.write(chunk)
    playFile.close()

    soup = BeautifulSoup(source_code.content, features='html.parser') 
    for link in soup.findAll('a',{'class':'stdLink'}): #can't find the std link
        href= "https://www.saiglobal.com/online/" + link.get('href')
        '''
        get_pdf(href)
        ''' 
        print(href)
    page += 1

'''
#function will probably give a bad name to the files but can fix this later
def get_pdf(std_url)
    source_code = session.get(std_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for link in soup.select("a[href$='.pdf']"):
        #Name the pdf files using the last portion of each link which are unique in this case
        filename = os.path.join(folder_location,link['href'].split('/')[-1])
        with open(filename, 'wb') as f:
            f.write(session.get(urljoin(url,link['href'])).content)
'''
r = session.get("https://www.saiglobal.com/online/")
print (r.status_code)

subscription_spider(1)

r = session.get("https://www.saiglobal.com/online/Script/Logout.asp") #not sure if this logs out or not
print (r.status_code) 

文本文件将创建

<frameset rows="*, 1">
<frame SRC="Script/Login.asp?">
<frame src="Script/Check.asp" noresize>
</frameset>

但是当我检查元素时,我想要的是我不能直接访问的框架源代码,我认为这个问题与经典asp页面的结构有关,但我不确定该怎么办 html element

程序的输出是

200个 200 按任意键继续。。。你知道吗


Tags: thetexthttpscomurlsourcegetpdf