仅从浏览器中打开的页面获取数据

2024-06-25 07:12:58 发布

您现在位置:Python中文网/ 问答频道 /正文

` 网上 第一请求机构: eprocTenders:投标编号: EPR八烯类:招标类别:-1 eprocTenders:投标名称: EPR辛烯类:投标描述: eprocTenders:ecvRange:-1 eprocTenders:部门ID: eprocTenders:状态:评估完成 eprocTenders:部门地址: eprocTenders:投标日期:2019年4月1日 eprocTenders:投标日期:2020年3月31日 eprocTenders:投标提交日期: eprocTenders:标书提交日期: eprocTenders:selectTender:SEARCHTENDERS eprocTenders:butSearch:Search eprocTenders_提交:1 jsf_序列:2 eprocTenders:dataScrollerId: eprocTenders:链接\u隐藏

第二请求机构:

eprocTenders:投标编号: EPR八烯类:招标类别:-1 eprocTenders:投标名称: EPR辛烯类:投标描述: eprocTenders:ecvRange:-1 eprocTenders:部门ID: eprocTenders:状态:评估完成 eprocTenders:部门地址: eprocTenders:投标日期:2019年4月1日 eprocTenders:投标日期:2020年3月31日 eprocTenders:投标提交日期: eprocTenders:标书提交日期: eprocTenders:selectTender:SEARCHTENDERS eprocTenders_提交:1 jsf_序列:3 eprocTenders:dataScrollerId:idx2 eprocTenders:链接隐藏:eprocTenders:dataScrollerIdidx2 `

我正试图从这个网站上搜集数据:URL

这是我正在尝试的代码:

import requests
import time
from bs4 import BeautifulSoup
import pandas as pd

mydata = 'https://eproc.karnataka.gov.in/eprocurement/common/eproc_tenders_list.seam'

with requests.Session() as session:
   

     session.headers = {'Cookie':'JSESSIONID=DEBFA1809C30CE2F3F04D0044DFCA784.appp1vm22','Content-Type':'multipart/form-data; boundary=----WebKitFormBoundaryYxNGT6chlbwn3Ots','Content-Disposition': 'form-data', "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}

   
    mydata_Text = []

    response = session.post(mydata , data=data ,verify =False)
    soup = BeautifulSoup(response.content, 'html.parser')
    for x in range(1,5):
        data = {
      
       
        'eprocTenders:status': 'EVALUATION_COMPLETED',
        'eprocTenders:tenderCreateDateFrom': '01/04/2019',
        'eprocTenders:tenderCreateDateTo': '31/03/2020',
        'eprocTenders:butSearch' : 'Search',
        'eprocTenders_SUBMIT': 1,
        'eprocTenders:dataScrollerId':'idx'+str(x),
     #         'eprocTenders:_link_hidden_: eprocTenders':'dataScrollerIdidx'+str(x),
        'jsf_sequence': str(x),
        'eprocTenders:selectTender': 'SEARCHTENDERS',
     
        }
        print(data)
        time.sleep(5)
        mycontent = soup.find('table', attrs={'id':'eprocTenders:browserTableEprocTenders'})
        table_body = mycontent.find('tbody')
        rows = table_body.find_all('tr')
        for row in rows:
            cols = row.find_all('td')
            cols = [me.text.strip() for me in cols]
            mydata_Text.append([me for me in cols if me])
            print(len(mydata_Text))

我错过了什么


Tags: inimportfordatafind部门mecols
1条回答
网友
1楼 · 发布于 2024-06-25 07:12:58

你只得到第一页,因为在那之后你再也没有提出要求。您将继续从相同的初始response.content创建一个soup对象。您需要在循环中进行请求和解析。尝试以下方法:

import requests
import time
from bs4 import BeautifulSoup
import pandas as pd

mydata = 'https://eproc.karnataka.gov.in/eprocurement/common/eproc_tenders_list.seam'

with requests.Session() as session:
   

     session.headers = {'Cookie':'JSESSIONID=DEBFA1809C30CE2F3F04D0044DFCA784.appp1vm22','Content-Type':'multipart/form-data; boundary=  WebKitFormBoundaryYxNGT6chlbwn3Ots','Content-Disposition': 'form-data', "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}

   
    mydata_Text = []

    #response = session.post(mydata , data=data ,verify =False) #< - Put inside the loop
    #soup = BeautifulSoup(response.content, 'html.parser') #< - Put inside the loop
    for x in range(1,5):
        data = {
      
       
        'eprocTenders:status': 'EVALUATION_COMPLETED',
        'eprocTenders:tenderCreateDateFrom': '01/04/2019',
        'eprocTenders:tenderCreateDateTo': '31/03/2020',
        'eprocTenders:butSearch' : 'Search',
        'eprocTenders_SUBMIT': 1,
        'eprocTenders:dataScrollerId':'idx'+str(x),
     #         'eprocTenders:_link_hidden_: eprocTenders':'dataScrollerIdidx'+str(x),
        'jsf_sequence': str(x),
        'eprocTenders:selectTender': 'SEARCHTENDERS',
     
        }
        print(data)
        response = session.post(mydata , data=data ,verify =False) #<  - HERE
        soup = BeautifulSoup(response.content, 'html.parser') #< - HERE

        time.sleep(5)
        mycontent = soup.find('table', attrs={'id':'eprocTenders:browserTableEprocTenders'})
        table_body = mycontent.find('tbody')
        rows = table_body.find_all('tr')
        for row in rows:
            cols = row.find_all('td')
            cols = [me.text.strip() for me in cols]
            mydata_Text.append([me for me in cols if me])
            print(len(mydata_Text))

相关问题 更多 >