我用python创建了一个脚本,使用post http请求从网页获取搜索结果。要填充结果,必须单击顺序显示的字段here。现在将出现一个新页面,this是如何填充结果的
There are ten results in the first page and the following script can parse the results flawlessly.
我现在想做的是使用results到达他们的inner page,以便从那里解析Sole Proprietorship Name (English)
到目前为止,我一直在尝试:
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.businessregistration.moc.gov.kh/cambodia-master/service/create.html?targetAppCode=cambodia-master&targetRegisterAppCode=cambodia-br-soleproprietorships&service=registerItemSearch"
payload = {
'QueryString': '0',
'SourceAppCode': 'cambodia-br-soleproprietorships',
'OriginalVersionIdentifier': '',
'_CBASYNCUPDATE_': 'true',
'_CBHTMLFRAG_': 'true',
'_CBNAME_': 'buttonPush'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
res = s.get(url)
target_url = res.url.split("&")[0].replace("view.", "update.")
node = re.findall(r"nodeW\d.+?-Advanced",res.text)[0].strip()
payload['_VIKEY_'] = re.findall(r"viewInstanceKey:'(.*?)',", res.text)[0].strip()
payload['_CBHTMLFRAGID_'] = re.findall(r"guid:(.*?),", res.text)[0].strip()
payload[node] = 'N'
payload['_CBNODE_'] = re.findall(r"Callback\('(.*?)','buttonPush", res.text)[2]
payload['_CBHTMLFRAGNODEID_'] = re.findall(r"AsyncWrapper(W\d.+?)'",res.text)[0].strip()
res = s.post(target_url,data=payload)
soup = BeautifulSoup(res.content, 'html.parser')
for item in soup.find_all("span", class_="appReceiveFocus")[3:]:
print(item.text)
如何使用请求解析每个结果内页中的Name (English)
这是您可以从站点的内部页面解析名称,然后从“地址”选项卡解析电子邮件地址的方法之一。我添加这个函数
.get_email()
只是因为我想让您知道如何解析来自不同选项卡的内容输出如下:
要获得名称(英文),您只需将
print(item.text)
替换为print(item.text.split('/')[1].split('(')[0].strip())
即可打印AMY GEMS相关问题 更多 >
编程相关推荐