使用python在网站上搜索并返回结果?

2024-05-04 21:00:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图建立一个程序,搜索车辆的有效期,但是当我做我的post请求时,什么都没有发生,我也没有得到数据

我把车辆登记放在这个网站:https://www.vicroads.vic.gov.au/registration/buy-sell-or-transfer-a-vehicle/check-vehicle-registration/vehicle-registration-enquiry

它会将结果重定向到下面的链接

Result Image

谁能帮帮我吗

我目前的代码是:

import requests
my_url = 'https://www.vicroads.vic.gov.au/registration/buy-sell-or-transfer-a-vehicle/check-vehicle-registration/vehicle-registration-enquiry'

s = requests.session()

s.get(my_url)

data = {'ph_pagebody_0$phthreecolumnmaincontent_1$panel$VehicleSearch$vehicle-type' :  'car/truck' ,
'ph_pagebody_0$phthreecolumnmaincontent_1$panel$VehicleSearch$vehicle-identifier-type' :  'registration number' ,
'ph_pagebody_0$phthreecolumnmaincontent_1$panel$VehicleSearch$RegistrationNumberCar$RegistrationNumber_CtrlHolderDivShown' : 'abc123'        
        }

result = requests.post(my_url, data = data)

print(result)

Tags: httpsurldatamywwwregistrationpostrequests
2条回答

长话短说:CORS策略:请求的资源上不存在“访问控制允许来源””标头

JS代码:

$.ajax({
    type: "POST",
    url: "https://www.vicroads.vic.gov.au/registration/buy-sell-or-transfer-a-vehicle/check-vehicle-registration/vehicle-registration-enquiry",
    data: data,
    success: function( data ) {
        console.log(data);
    }
});

使用JavaScript在页面内执行POST请求,将得到预期的响应,该页面显示以下内容:

Registration check
Results for abc123 as at 29/03/2020 18:05 AEDT
Registration number:
ABC123
Registration status & expiry date:
Current - 26/03/2021
Vehicle:
2013 SILVER ISUZU DC UTE
VIN/Chassis:
MPATFS85JDT005836
Engine number:
LB8052
Registration serial number:
2276051
Compliance plate date:
07/2013
Sanction(s) applicable:
None
Goods carrying vehicle:
Yes
Transfer in dispute:
No
Download report PDF

但是,从网站外部执行请求将获得CORS策略:无“访问控制允许源站”错误。从Python执行请求不一定会产生错误,但会导致错误的响应,这就是您得到的响应。此外,请求的数据应包括以下所有内容:

data = {
    '__EVENTTARGET': '',  
    '__EVENTARGUMENT': '',  
    '__VIEWSTATE': {TOO LONG TO BE POSTED},  
    '__VIEWSTATEGENERATOR': '3ECD7CB5',  
    '__VIEWSTATEENCRYPTED': '',  
    'site-search-head': '',  
    'ph_pagebody_0$phheader_0$_FlyoutLogin$PersonalEmail$EmailAddress': '',  
    'ph_pagebody_0$phheader_0$_FlyoutLogin$PersonalPassword$SingleLine_CtrlHolderDivShown': '',   
    'ph_pagebody_0$phheader_0$_FlyoutLogin$OrganisationEmail$EmailAddress': '',  
    'ph_pagebody_0$phheader_0$_FlyoutLogin$OrganisationPassword$SingleLine_CtrlHolderDivShown': '',  
    'ph_pagebody_0$phheader_0$_FlyoutLogin$PartnerEmail$EmailAddress': '',  
    'ph_pagebody_0$phheader_0$_FlyoutLogin$PartnerPassword$SingleLine_CtrlHolderDivShown': '',  
    'ph_pagebody_0$phthreecolumnmaincontent_1$panel$VehicleSearch$vehicle-type': 'car/truck',  
    'ph_pagebody_0$phthreecolumnmaincontent_1$panel$VehicleSearch$vehicle-identifier-type': 'registration number',  
    'ph_pagebody_0$phthreecolumnmaincontent_1$panel$VehicleSearch$RegistrationNumberCar$RegistrationNumber_CtrlHolderDivShown': 'abc123',  
    'honeypot': '',  
    'ph_pagebody_0$phthreecolumnmaincontent_1$panel$btnSearch': 'Search'  
}

按照此处的说明获取缺少的一个(__VIEWSTATE):

https://mkyong.com/computer-tips/how-to-view-http-headers-in-google-chrome/

好的,基本上你还没有把所有必需的POST参数发送到HOST,正如你在Print-Screen中看到的那样,有多个带值的参数

现在,我们将发出GET请求来解析HTML并获取所有必需的值,然后发出POST请求

import requests
from bs4 import BeautifulSoup

data = {
    '__EVENTTARGET': '',
    '__EVENTARGUMENT': '',
    '__VIEWSTATEENCRYPTED': '',
    'site-search-head': '',
    'ph_pagebody_0$phheader_0$_FlyoutLogin$PersonalEmail$EmailAddress': '',
    'ph_pagebody_0$phheader_0$_FlyoutLogin$PersonalPassword$SingleLine_CtrlHolderDivShown': '',
    'ph_pagebody_0$phheader_0$_FlyoutLogin$OrganisationEmail$EmailAddress': '',
    'ph_pagebody_0$phheader_0$_FlyoutLogin$OrganisationPassword$SingleLine_CtrlHolderDivShown': '',
    'ph_pagebody_0$phheader_0$_FlyoutLogin$PartnerEmail$EmailAddress': '',
    'ph_pagebody_0$phheader_0$_FlyoutLogin$PartnerPassword$SingleLine_CtrlHolderDivShown': '',
    'ph_pagebody_0$phthreecolumnmaincontent_1$panel$VehicleSearch$vehicle-type': 'car/truck',
    'ph_pagebody_0$phthreecolumnmaincontent_1$panel$VehicleSearch$vehicle-identifier-type': 'registration+number',
    'ph_pagebody_0$phthreecolumnmaincontent_1$panel$VehicleSearch$RegistrationNumberCar$RegistrationNumber_CtrlHolderDivShown': 'abc123',
    'honeypot': '',
    'ph_pagebody_0$phthreecolumnmaincontent_1$panel$btnSearch': 'Search'
}


def Main(url):
    with requests.Session() as req:
        r = req.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')
        data['__VIEWSTATE'] = soup.find("input", id="__VIEWSTATE").get("value")
        data['__VIEWSTATEGENERATOR'] = soup.find(
            "input", id="__VIEWSTATEGENERATOR").get("value")
        r = req.post(url, data=data)
        soup = BeautifulSoup(r.content, 'html.parser')
        print(soup.findAll("div", class_="display"))


Main("https://www.vicroads.vic.gov.au/registration/buy-sell-or-transfer-a-vehicle/check-vehicle-registration/vehicle-registration-enquiry")

现在,如果你检查了输出,你会看到它是空的,这是由于两件事

  1. HTML源中有一个名为monsido的值,该值与JS一起用于生成one-time令牌,以便在会话期间对请求进行身份验证
<script type="text/javascript">
    var _monsido = _monsido || [];
    _monsido.push(['_setDomainToken', 'dfWhFzGbaTj5hyKQYZxi0g']);
    _monsido.push(['_withStatistics', 'true']);
</script>
<script src="//cdn.monsido.com/tool/javascripts/monsido.js"></script>
<script>
  1. HOSTCloudFlare保护,其中它也需要Cookie中的__cfduid参数

现在,为了缩短道路,如果您使用当前的cookies/headersrequests.Session()下调用monsido,您将获得所需的令牌。所以你现在需要得到__cfduid,我帮不了你,因为绕过已知的防火墙是非法的,比如防火墙,它实际上是为了防止这种刮擦的情况而发明的

现在,回到selenium,您可以获得所需的输出:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import pandas as pd

options = Options()
options.add_argument(' headless')
driver = webdriver.Firefox(options=options)

driver.get("https://www.vicroads.vic.gov.au/registration/buy-sell-or-transfer-a-vehicle/check-vehicle-registration/vehicle-registration-enquiry")
regnum = driver.find_element_by_css_selector(
    "input#ph_pagebody_0_phthreecolumnmaincontent_1_panel_VehicleSearch_RegistrationNumberCar_RegistrationNumber_CtrlHolderDivShown").send_keys("abc123")
click = driver.find_element_by_css_selector(
    "input#ph_pagebody_0_phthreecolumnmaincontent_1_panel_btnSearch").click()

names = [
    item.text for item in driver.find_elements_by_css_selector("label.label")]
data = [item.text for item in driver.find_elements_by_css_selector(
    "div.display")[:10]]

df = pd.DataFrame([data], columns=names)
df.to_csv("data.csv", index=False)

driver.quit()

输出:view-online

enter image description here

相关问题 更多 >