使用python刮取动态javascript内容网页

2024-07-02 10:30:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用Python清理这个网站:“https://ec.europa.eu/research/mariecurieactions/how-to/find-job_en

首先,我注意到我感兴趣的表实际上位于以下url:https://ec.europa.eu/assets/eac/msca/jobs/import-jobs_en.htm

然而,requests+BS4只提供HTML格式的页面源代码。我假设这是因为内容是动态的

因此,我尝试了Selenium+BS4来刮取网站,但我仍然只能刮取页面源代码

from selenium.webdriver import Firefox
from bs4 import BeautifulSoup
import lxml

driver = Firefox()
url = 'https://ec.europa.eu/assets/eac/msca/jobs/import-jobs_en.htm'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')

我如何才能刮上述网站


Tags: httpsimporturl网站driverjobseneac
2条回答

如果你再进一步,你会在这里找到真正的数据:https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml 下面是一个使用SimplifiedDoc的示例

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml') 
doc = SimplifiedDoc(html)
jobs = doc.selects('job-opportunity')
for job in jobs:
    print (job.select('job-id>text()'),job.select('job-title>text()'))

结果:

367020 Early-Stage Researcher (ESR) 3-year PhD position - "Efficient intra-cavity and extra-cavity generation of beams with radial and azimuthal polarization in high-power thin-disk lasers" - Project: GREAT
377512 8 Short-term Early Stage Researcher positions available through the EvoCELL ITN (single cell genomics, evo-devo and science outreach)
383978 ESR (early stage researcher) for intelligent quality control cycles in Industry 4.0 process chains enabled by machine learning
......

实际上,您可以使用requests+BS4获得所需的结果。您所需要做的就是将APIhttps://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml与头一起使用

代码

import requests
from bs4 import BeautifulSoup

headers = {
    'authority': 'euraxess.ec.europa.eu',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'accept': 'application/xml, text/xml, */*; q=0.01',
    'sec-ch-ua-mobile': '?0',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36',
    'origin': 'https://ec.europa.eu',
    'sec-fetch-site': 'same-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://ec.europa.eu/',
    'accept-language': 'en-US,en;q=0.9',
}

response = requests.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml',headers=headers)
# print(response.text)

soup = BeautifulSoup(response.content, 'html.parser')
ID = soup.find_all('job-id')
Title = soup.find_all('job-title')
for ID,Title in zip(ID,Title):
    print(ID.text,Title.text)

输出

383876 PhD position in the framework of HEalth data LInkage for ClinicAL benefit (Helical) project
433411 PhD Student in Biophysics/Electrophysiology
454880 15 PhD positions in Marie Sklodowska Curie ITN “Active Monitoring of Cancer As An Alternative To Surgery” (CAST)
465392 15 Marie Curie PhD Positions in ''Mobility and Training for Beyond 5G Ecosystems (MOTOR5G)''
480654 Early Stage Research Position in mmWave-based communication systems at National Instruments Dresden GmbH
....

相关问题 更多 >