如何将PyCharm与PySpark联系起来?

2024-05-20 14:17:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我是apache spark的新手,显然我在macbook中安装了apache spark和自制程序:

Last login: Fri Jan  8 12:52:04 on console
user@MacBook-Pro-de-User-2:~$ pyspark
Python 2.7.10 (default, Jul 13 2015, 12:05:58)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/01/08 14:46:44 INFO SparkContext: Running Spark version 1.5.1
16/01/08 14:46:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/08 14:46:47 INFO SecurityManager: Changing view acls to: user
16/01/08 14:46:47 INFO SecurityManager: Changing modify acls to: user
16/01/08 14:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)
16/01/08 14:46:50 INFO Slf4jLogger: Slf4jLogger started
16/01/08 14:46:50 INFO Remoting: Starting remoting
16/01/08 14:46:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.64:50199]
16/01/08 14:46:51 INFO Utils: Successfully started service 'sparkDriver' on port 50199.
16/01/08 14:46:51 INFO SparkEnv: Registering MapOutputTracker
16/01/08 14:46:51 INFO SparkEnv: Registering BlockManagerMaster
16/01/08 14:46:51 INFO DiskBlockManager: Created local directory at /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/blockmgr-769e6f91-f0e7-49f9-b45d-1b6382637c95
16/01/08 14:46:51 INFO MemoryStore: MemoryStore started with capacity 530.0 MB
16/01/08 14:46:52 INFO HttpFileServer: HTTP File server directory is /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/spark-8e4749ea-9ae7-4137-a0e1-52e410a8e4c5/httpd-1adcd424-c8e9-4e54-a45a-a735ade00393
16/01/08 14:46:52 INFO HttpServer: Starting HTTP Server
16/01/08 14:46:52 INFO Utils: Successfully started service 'HTTP file server' on port 50200.
16/01/08 14:46:52 INFO SparkEnv: Registering OutputCommitCoordinator
16/01/08 14:46:52 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/01/08 14:46:52 INFO SparkUI: Started SparkUI at http://192.168.1.64:4040
16/01/08 14:46:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/08 14:46:53 INFO Executor: Starting executor ID driver on host localhost
16/01/08 14:46:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50201.
16/01/08 14:46:53 INFO NettyBlockTransferService: Server created on 50201
16/01/08 14:46:53 INFO BlockManagerMaster: Trying to register BlockManager
16/01/08 14:46:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50201 with 530.0 MB RAM, BlockManagerId(driver, localhost, 50201)
16/01/08 14:46:53 INFO BlockManagerMaster: Registered BlockManager
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

Using Python version 2.7.10 (default, Jul 13 2015 12:05:58)
SparkContext available as sc, HiveContext available as sqlContext.
>>>

我想开始玩,以便了解更多关于MLlib的信息。但是,我使用Pycharm用python编写脚本。问题是:当我去Pycharm并尝试调用pyspark时,Pycharm找不到模块。我试着将路径添加到Pycharm,如下所示:

cant link pycharm with spark

然后我从一个blog尝试了这个:

import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME']="/Users/user/Apps/spark-1.5.2-bin-hadoop2.4"

# append pyspark  to Python Path
sys.path.append("/Users/user/Apps/spark-1.5.2-bin-hadoop2.4/python/pyspark")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf
    print ("Successfully imported Spark Modules")

except ImportError as e:
    print ("Can not import Spark Modules", e)
    sys.exit(1)

而且仍然不能开始使用Pycharm和Pycharm,有没有办法将Pycharm和apache pychark“链接”起来?。

更新:

然后我搜索apache spark和python path以设置Pycharm的环境变量:

apache spark路径:

user@MacBook-Pro-User-2:~$ brew info apache-spark
apache-spark: stable 1.6.0, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/1.5.1 (649 files, 302.9M) *
  Poured from bottle
From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb

python路径:

user@MacBook-Pro-User-2:~$ brew info python
python: stable 2.7.11 (bottled), HEAD
Interpreted, interactive, object-oriented programming language
https://www.python.org
/usr/local/Cellar/python/2.7.10_2 (4,965 files, 66.9M) *

然后根据上面的信息,我尝试按如下方式设置环境变量:

configuration 1

知道如何正确地将Pycharm与pyspark联系起来吗?

然后,当我使用上述配置运行python脚本时,出现以下异常:

/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/spark_examples/test_1.py
Traceback (most recent call last):
  File "/Users/user/PycharmProjects/spark_examples/test_1.py", line 1, in <module>
    from pyspark import SparkContext
ImportError: No module named pyspark

更新: 然后我尝试了@zero323提出的配置

配置1:

/usr/local/Cellar/apache-spark/1.5.1/ 

conf 1

输出:

 user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1$ ls
CHANGES.txt           NOTICE                libexec/
INSTALL_RECEIPT.json  README.md
LICENSE               bin/

配置2:

/usr/local/Cellar/apache-spark/1.5.1/libexec 

enter image description here

输出:

user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1/libexec$ ls
R/        bin/      data/     examples/ python/
RELEASE   conf/     ec2/      lib/      sbin/

Tags: toimportinfoonusrapachelocalpro
3条回答

这是我在MacOSX上解决这个问题的方法。

  1. brew install apache-spark
  2. 将此添加到~/.bash\u配置文件

    export SPARK_VERSION=`ls /usr/local/Cellar/apache-spark/ | sort | tail -1`
    export SPARK_HOME="/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec"
    export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
    export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
    
  3. 将pyspark和py4j添加到内容根目录(使用正确的Spark版本):

    /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip
    /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip
    

enter image description here

带PySpark软件包(Spark 2.2.0及更高版本)

随着SPARK-1267的合并,您应该能够通过在用于PyCharm开发的环境中pip安装Spark来简化过程。

  1. 转到文件->;设置->;项目解释程序
  2. 单击install按钮并搜索PySpark

    enter image description here

  3. 单击安装软件包按钮。

手动安装用户提供的火花

创建运行配置:

  1. 转到运行->;编辑配置
  2. 添加新的Python配置
  3. 设置脚本路径,使其指向要执行的脚本
  4. 编辑环境变量字段,使其至少包含:

    • SPARK_HOME-它应该指向安装了Spark的目录。它应该包含诸如bin(具有spark-submitspark-shell等)和conf(具有spark-defaults.confspark-env.sh等目录
    • PYTHONPATH-它应该包含$SPARK_HOME/python和可选的$SPARK_HOME/python/lib/py4j-some-version.src.zip(如果没有)。some-version应与给定Spark安装所使用的Py4J版本匹配(0.8.2.1-1.5、0.9-1.6、0.10.3-2.0、0.10.4-2.1、0.10.4-2.2、0.10.6-2.3、0.10.7-2.4)

      enter image description here

  5. 应用设置

将PySpark库添加到解释器路径(代码完成所必需的)

  1. 转到文件->;设置->;项目解释程序
  2. 打开要与Spark一起使用的解释器的设置
  3. 编辑解释器路径,使其包含指向$SPARK_HOME/python(如果需要,则为Py4J)的路径
  4. 保存设置

任选

  1. 安装或添加到路径type annotations匹配已安装的Spark版本以获得更好的完成和静态错误检测(免责声明-我是项目的作者)。

最后

使用新创建的配置运行脚本。

这是适合我的设置(Win7 64bit,PyCharm2017.3CE)

设置智能感知:

  1. Click File -> Settings -> Project: -> Project Interpreter

  2. Click the gear icon to the right of the Project Interpreter dropdown

  3. Click More... from the context menu

  4. Choose the interpreter, then click the "Show Paths" icon (bottom right)

  5. Click the + icon two add the following paths:

    \python\lib\py4j-0.9-src.zip

    \bin\python\lib\pyspark.zip

  6. Click OK, OK, OK

继续测试你的新智能感知能力。

相关问题 更多 >