使用python刮取动态javascript内容网页

from selenium.webdriver import Firefox from bs4 import BeautifulSoup import lxml driver = Firefox() url = 'https://ec.europa.eu/assets/eac/msca/jobs/import-jobs_en.htm' driver.get(url) soup = BeautifulSoup(driver.page_source, 'lxml')

2条回答

网友

1楼 · 编辑于 2024-07-02 10:30:49

如果你再进一步，你会在这里找到真正的数据：https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml 下面是一个使用SimplifiedDoc的示例

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml') 
doc = SimplifiedDoc(html)
jobs = doc.selects('job-opportunity')
for job in jobs:
    print (job.select('job-id>text()'),job.select('job-title>text()'))

结果:

367020 Early-Stage Researcher (ESR) 3-year PhD position - "Efficient intra-cavity and extra-cavity generation of beams with radial and azimuthal polarization in high-power thin-disk lasers" - Project: GREAT
377512 8 Short-term Early Stage Researcher positions available through the EvoCELL ITN (single cell genomics, evo-devo and science outreach)
383978 ESR (early stage researcher) for intelligent quality control cycles in Industry 4.0 process chains enabled by machine learning
......

网友

2楼 · 编辑于 2024-07-02 10:30:49

实际上，您可以使用requests+BS4获得所需的结果。您所需要做的就是将APIhttps://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml与头一起使用

代码

import requests
from bs4 import BeautifulSoup

headers = {
    'authority': 'euraxess.ec.europa.eu',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'accept': 'application/xml, text/xml, */*; q=0.01',
    'sec-ch-ua-mobile': '?0',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36',
    'origin': 'https://ec.europa.eu',
    'sec-fetch-site': 'same-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://ec.europa.eu/',
    'accept-language': 'en-US,en;q=0.9',
}

response = requests.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml',headers=headers)
# print(response.text)

soup = BeautifulSoup(response.content, 'html.parser')
ID = soup.find_all('job-id')
Title = soup.find_all('job-title')
for ID,Title in zip(ID,Title):
    print(ID.text,Title.text)

输出

383876 PhD position in the framework of HEalth data LInkage for ClinicAL benefit (Helical) project
433411 PhD Student in Biophysics/Electrophysiology
454880 15 PhD positions in Marie Sklodowska Curie ITN “Active Monitoring of Cancer As An Alternative To Surgery” (CAST)
465392 15 Marie Curie PhD Positions in ''Mobility and Training for Beyond 5G Ecosystems (MOTOR5G)''
480654 Early Stage Research Position in mmWave-based communication systems at National Instruments Dresden GmbH
....

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用python刮取动态javascript内容网页

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >