不同环境下纱线机群中火花作业提交的处理库
spark-yarn-submit的Python项目详细描述
一个python库,可以使用 rest api
注意:它目前支持cdh(5.6.1)和
HDP(2.3.2.0-2950、2.4.0.0-169)
图书馆的灵感来自:
github.com/bernhard-42/spark-yarn-rest-api
开始:
使用库
# Import the SparkJobHandlerfromspark_job_handlerimportSparkJobHandler...logger=logging.getLogger('TestLocalJobSubmit')# Create a spark JOB# jobName: name of the Spark Job# jar: location of the Jar (local/hdfs)# run_class: entry class of the appliaction# hadoop_rm: hadoop resource manager host ip# hadoop_web_hdfs: hadoop web hdfs ip# hadoop_nn: hadoop name node ip (Normally same as of web_hdfs)# env_type: env type is CDH or HDP# local_jar: flag to define if a jar is local (Local jar gets uploaded to hdfs)# spark_properties: custom properties that need to be setsparkJob=SparkJobHandler(logger=logger,job_name="test_local_job_submit",jar="./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar",run_class="IrisApp",hadoop_rm='rma',hadoop_web_hdfs='nn',hadoop_nn='nn',env_type="CDH",local_jar=True,spark_properties=None)trackingUrl=sparkJob.run()print"Job Tracking URL: %s"%trackingUrl
以上代码使用本地jar启动一个spark应用程序
(简单项目/target/scala-2.10/simple-project_2.10-1.0.jar)
有关更多示例,请参见
test_spark_job_handler.py
构建简单的项目
$ cd simple-project $ sbt package;cd ..
以上步骤将创建目标jar: ./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar
更新测试中的节点IP:
在
测试用例:
*rm:resource manager*nn:name节点
加载数据并使其对hdfs可用:
$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
将数据上载到HDFS:
$ python upload_to_hdfs.py <name_nodei_ip> iris.data /tmp/iris.data
运行测试用例:
在hdfs中使用简单的项目jar来测试远程jar:
$ python upload_to_hdfs.py <name_nodei_ip> simple-project/target/scala-2.10/simple-project_2.10-1.0.jar /tmp/test_data/simple-project_2.10-1.0.jar
运行测试:
$ python test_spark_job_handler.py
实用程序:
- upload_to_hdfs.py:将本地文件上载到hdfs文件系统
注:
DIV>图书馆尚处于早期阶段,需要进行测试、缺陷修复和
文档
在运行之前,请执行以下步骤:
*根据需要更新resourcemanager、namenode和webhdfs端口
设置.py
*使火花罐在HDFS中可用:
hdfs:/user/spark/share/lib/spark-assembly.jar
对于贡献,请创建相应的问题pr