我正试图创建一个网络爬虫应用程序,以本教程为基础,从SAI下载一堆各式各样的pdf https://www.youtube.com/watch?v=sVNJOiTBi_8
我试过在网站上获取(url),我相信我想从中获取源代码
验证码(不包括在内)。。。。你知道吗
def subscription_spider(max_pages):
page = 1
while page <= max_pages:
url= 'https://www.saiglobal.com/online/Script/listvwstds.asp?TR=' + str(page)
source_code = session.get(url) # need to get the frame source!!
# extra code to find the href for the frame than get the frame source
#https://www.saiglobal.com/online/Script/ListVwStds.asp
plain_text = source_code.text #source_code.text printed is not the one we want
playFile = open('source_code.text', 'wb')
for chunk in r.iter_content(100000):
playFile.write(chunk)
playFile.close()
soup = BeautifulSoup(source_code.content, features='html.parser')
for link in soup.findAll('a',{'class':'stdLink'}): #can't find the std link
href= "https://www.saiglobal.com/online/" + link.get('href')
'''
get_pdf(href)
'''
print(href)
page += 1
'''
#function will probably give a bad name to the files but can fix this later
def get_pdf(std_url)
source_code = session.get(std_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(session.get(urljoin(url,link['href'])).content)
'''
r = session.get("https://www.saiglobal.com/online/")
print (r.status_code)
subscription_spider(1)
r = session.get("https://www.saiglobal.com/online/Script/Logout.asp") #not sure if this logs out or not
print (r.status_code)
文本文件将创建
<frameset rows="*, 1">
<frame SRC="Script/Login.asp?">
<frame src="Script/Check.asp" noresize>
</frameset>
但是当我检查元素时,我想要的是我不能直接访问的框架源代码,我认为这个问题与经典asp页面的结构有关,但我不确定该怎么办 html element
程序的输出是
200个 200 按任意键继续。。。你知道吗
目前没有回答
相关问题 更多 >
编程相关推荐