因此,我试图从以下URL中获取数据:
我自己很擅长抓取网页,但这个网站有一些独特的分页类型,我猜这是使用JavaScript完成的。事实上,对于前5页,它只是将page={NO}
附加到URL,但在前5页之后,它将唯一标识符(查询)与页码一起附加到每个页面的URL。该查询的大部分部分对于所有页面都是相似的,只是每个页面的某些字符不同
查询如下所示:
第6页:
cmxXakxKcWNvelMwbko5aFZ3YzdWemtjb0p5MFZ3YmtBRmp2b1RTbXFSOXZuekl3cVBWNnJsV3NuSkR2QnZWMU1KSDVNUU5sQlRWMkFUU3dCSld4TEdMM0JReDRabU52WVBXc3BUeXhWd2JsWkdxOXNGanZwMkl1cHpBYkczTzBuSjlocGxWNnIzMGZWYVd1b3pFaW9JQXlNSkR2Qno1MW9Uazk=
第7页:
cmxXakxKcWNvelMwbko5aFZ3YzdWemtjb0p5MFZ3YmtBRmp2b1RTbXFSOXZuekl3cVBWNnJsV3NuSkR2QnZWMU1RcDJaVFYxWjJaakx3cDBNSkQyWndaM0FUVjNMd3R2WVBXc3BUeXhWd2JrQUdNOXNGanZwMkl1cHpBYkczTzBuSjlocGxWNnIzMGZWYVd1b3pFaW9JQXlNSkR2Qno1MW9Uazk=
Python请求
我已经通过inspect检查了代码,在下一页中找不到任何这样的键。当前页面查询位于脚本标记中
请尝试从以下代码开始。这是第七页。您可以在我从Network
选项卡获取的params
中看到query
import requests
headers = {
'authority': 'www.11880.com',
'cache-control': 'max-age=0',
'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'cookie': '__cfduid=d3ba308b5d5994136cfb2ffd23797ae371615711538; _gcl_au=1.1.638363378.1615711421; _ga=GA1.2.1798619679.1615711421; _gid=GA1.2.1401311919.1615711421; __gads=ID=7767acb88c5bd4d2:T=1615721793:S=ALNI_MZzcwQtHCl4hWOWahxyceRZJYrgGg; randomSeed=1615724814; referrer=none; __cf_bm=d8225bc8662e1af83b9ae8c3eebbfdb7f0613cb2-1615727481-1800-AdDP/tSscGJQRiVmW/GyJBUUNHXkWvqYbiqv47MgKrvXzBt0InecHvXrwnMtnOKbtYS/YODx2Zh1ewlOlCAgtMpvjD7Vw9FG9J+gvII/EOy2; cf_chl_2=93a30b70f062111; cf_chl_prog=a41; cf_clearance=b85ac9756885d88ad6c979309aeadb222e4f60b9-1615727532-0-250; geoIPData=eyJjb3VudHJ5X2NvZGUiOm51bGwsImNvdW50cnlfY29kZTMiOm51bGwsImNvdW50cnlfbmFtZSI6bnVsbCwicmVnaW9uIjpudWxsLCJjaXR5IjpudWxsLCJwb3N0YWxfY29kZSI6bnVsbCwibGF0aXR1ZGUiOm51bGwsImxvbmdpdHVkZSI6bnVsbCwiYXJlYV9jb2RlIjpudWxsLCJkbWFfY29kZSI6bnVsbCwibWV0cm9fY29kZSI6bnVsbCwiY29udGluZW50X2NvZGUiOm51bGwsImlwIjoiMTE5LjE2MC42Ni4xMjUsIDE3Mi42OS4xMTEuMTM4In0%3D; rlData={%22randomSeed%22:1615724814%2C%22rlUrl%22:%22/suche/Immobilienmakler/deutschland?branchen=3302469%257C3302464%257C3302249%257C3303516%257C3301609%257C3300129&sorte=%257C&modul=direct&page=5%22%2C%22adsTargeting%22:{%22ort%22:[%22deutschland%22]%2C%22suche%22:[%22Immobilienmakler%22]%2C%22url%22:[%22/suche/Immobilienmakler/deutschland%22]%2C%22branche%22:[%223302469%22%2C%223302464%22%2C%223302249%22%2C%223303516%22%2C%223301609%22%2C%223300611%22%2C%223305630%22%2C%223300129%22%2C%223305491%22%2C%223305627%22]}}',
}
params = (
('branchen', '3302469|3302464|3302249|3303516|3301609|3300129'),
('sorte', '|'),
('modul', 'direct'),
('page', '7'),
('query', 'cmxXakxKcWNvelMwbko5aFZ3YzdWemtjb0p5MFZ3YmtBRmp2b1RTbXFSOXZuekl3cVBWNnJsV3NuSkR2QnZWMU1RcDJaVFYxWjJaakx3cDBNSkQyWndaM0FUVjNMd3R2WVBXc3BUeXhWd2JrQUdNOXNGanZwMkl1cHpBYkczTzBuSjlocGxWNnIzMGZWYVd1b3pFaW9JQXlNSkR2Qno1MW9Uazk='),
)
response = requests.get('https://www.11880.com/suche/Immobilienmakler/deutschland', headers=headers, params=params)
Python Selenium
我尝试使用selenium,但网站上有一个captcha,当我使用它时会弹出,我不知道如何绕过captcha
JavaScript或节点
我还没有尝试使用JavaScript,但我想验证码也会出现在那里
因此,上述任何技术中的任何解决方案都将受到高度赞赏。谢谢……
解决方案非常简单
只需从每页发送类似于此表单的请求。此POST请求将用户重定向到所需页面。
相关问题 更多 >
编程相关推荐