jupyter与spark集成的几种方法

Jupyter notebook is an popular and excellent tool in data scientists. They can use it to create, verify and share model via their favorite programing language. Spark is a great software to support both batch and real time data analysis in big data area. Especially, it also supports all kinds of algorithms in the component “MLlib”. It will be amazing way to work in an integration environment for both jupyter and spark in machine learning world. Now, we will introduce 4 ways to integrate jupyter and spark.

Using toree which is previously called “Spark Kernel”. (jupyter + toree)

When you are plan to use toree to run in jupyter server, you need to know the following limitations.

spark/spark client need to be installed in this server.

spark-submit is also kicked off at this server.

# install spark
https://spark.apache.org/docs/latest/

# install toree
pip install toree

# configure toree
jupyter toree install --spark_home=your-spark-home

Using sparkmagic and livy (jupyter + sparkmagic + livy)

livy is a spark rest server. sparkmagic provides several kernels such as pyspark, pyspark3, sparkr for jupyter notebook by working with livy sever together. With this approach, you may know the following features.

don’t need to install spark-client on juptyer server side

shift the submitting of spark job from jupyter server side to livy side

# install sparkmagic
pip install sparkmagic

# Make sure that ipywidgets is properly installed.
jupyter nbextension enable --py --sys-prefix widgetsnbextension

# show where sparkmagic is installed.
pip show sparkmagic

# (Optional) install wrapper kernerls
jupyter-kernelspec install sparkmagic/kernels/sparkkernel
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel
jupyter-kernelspec install sparkmagic/kernels/sparkrkernel

# (Optional) modify sparkmagic configuration file.
~/.sparkmagic/config.json

https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/example_config.json

{
  "kernel_python_credentials" : {
    "username": "",
    "password": "",
    "url": "http://localhost:8998",
    "auth": "None"
  },

  "kernel_scala_credentials" : {
    "username": "",
    "password": "",
    "url": "http://localhost:8998",
    "auth": "None"
  },
  "kernel_r_credentials": {
    "username": "",
    "password": "",
    "url": "http://localhost:8998"
  },

  "logging_config": {
    "version": 1,
    "formatters": {
      "magicsFormatter": {
        "format": "%(asctime)s\t%(levelname)s\t%(message)s",
        "datefmt": ""
      }
    },
    "handlers": {
      "magicsHandler": {
        "class": "hdijupyterutils.filehandler.MagicsFileHandler",
        "formatter": "magicsFormatter",
        "home_path": "~/.sparkmagic"
      }
    },
    "loggers": {
      "magicsLogger": {
        "handlers": ["magicsHandler"],
        "level": "DEBUG",
        "propagate": 0
      }
    }
  },

  "wait_for_idle_timeout_seconds": 15,
  "livy_session_startup_timeout_seconds": 60,

  "fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.",

  "ignore_ssl_errors": false,

  "session_configs": {
    "driverMemory": "1000M",
    "executorCores": 2
  },

  "use_auto_viz": true,
  "coerce_dataframe": true,
  "max_results_sql": 2500,
  "pyspark_dataframe_encoding": "utf-8",

  "heartbeat_refresh_seconds": 30,
  "livy_server_heartbeat_timeout_seconds": 0,
  "heartbeat_retry_seconds": 10,

  "server_extension_default_kernel_name": "pysparkkernel",
  "custom_headers": {},

  "retry_policy": "configurable",
  "retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5],
  "configurable_retry_policy_max_retries": 8
}

# (Optional) Enable the server extension so that clusters can be programatically changed:
jupyter serverextension enable --py sparkmagic

Notebook magic list in sparkmagic
sparkmagic_list

The integration chart for sparkmagic and livy
sparkmagic_livy

Using nb2kg, kernelgateway and toree (jupyter + nb2kg + kernelgateway + toree)

spark/spark client need to be installed at kernelgateway side.

spark-submit is ran at kernelgateway side.

Install nb2kg and run notebook server

# install nb2kg
pip install nb2kg

# register nb2kg
jupyter serverextension enable --py nb2kg --sys-prefix

# start jupyter notebook server
export KG_URL=http://kg-host:port
jupyter notebook \
  --NotebookApp.session_manager_class=nb2kg.managers.SessionManager \
  --NotebookApp.kernel_manager_class=nb2kg.managers.RemoteKernelManager \
  --NotebookApp.kernel_spec_manager_class=nb2kg.managers.RemoteKernelSpecManager

# verify if nb2kg is enable.
jupyter serverextension list

 nb2kg  enabled
 - Validating...
  nb2kg  OK

# uninstall nb2kg
jupyter serverextension disable --py nb2kg --sys-prefix
pip uninstall -y nb2kg

Install kernel gateway

# install from pypi
pip install jupyter_kernel_gateway

# show all config options
jupyter kernelgateway --help-all

# run it with default options
jupyter kernelgateway

Install toree in the server where kernelgatway is installed.
The detailed steps, please refer to the previous section.

Using nb2kg, kernelgateway, sparkmagic and livy (jupyter + nb2kg + kernelgateway + sparkmagic + livy)

do not need to install spark/spark client since we call spark by livy.

shift the submitting of spark job from kernelgateway side to livy side.

The detailed steps, please refer to the upper sections.

An step by step example to use toree in jupyter.

Install spark

export SPARK_HOME=/Users/luliang/Tools/spark-2.1.1-bin-hadoop2.7

Install anaconda which includes jupyter and python.

# install anacoda
download the image from [anaconda webside](https://www.anaconda.com/download/) based on your os and install it.

# List current available kernel
jupyter kernelspec list

# Start notebook
jupyter notebook

List all configuration information for jupyter.

jupyter --paths

Install toree.

pip install toree
jupyter toree install --spark_home=your-spark-home

# Other examples to install toree kernel with different options:

jupyter toree install --spark_home=/usr/hdp/current/spark2-client --spark_opts='--master=spark://hdpn.xxx.xxx.com:7077' --kernel_name=hdpn_standalone

jupyter toree install --spark_home=/spark/home/dir
jupyter toree install --spark_opts='--master=local[4] --executor-memory=3G'
jupyter toree install --kernel_name=toree_special
jupyter toree install --toree_opts='--nosparkcontext'
jupyter toree install --interpreters=PySpark,SQL
jupyter toree install --python=python
jupyter toree install --help-all

List current available notebook servers.

jupyter notebook list

Generate Configuration File under ~/.jupyter

# Writing default config to: /Users/luliang/.jupyter/jupyter_notebook_config.py
jupyter-notebook --generate-config

(Optional) Generate password and set passwork for notebook

产生密码：终端输入ipython  
In [1]: from IPython.lib import passwd  
In [2]: passwd()  
Enter password:   
Verify password:   
Out[2]: 'sha1:6402ac25a515:2755b924b8bb5bef2475f7918776197e2f972858'  

配置参数：  
进入/root/.jupyter/jupyter_notebook_config.py  
c.NotebookApp.ip = '*'   #启动服务的地址，设置成 ‘*’ 可以从同一网段的其他机器访问到；  
c.NotebookApp.open_browser = False     #启动 ipython notebook 的时候不会自动打开浏览器；  
c.NotebookApp.password = 'sha1:6402ac25a515:2755b924b8bb5bef2475f7918776197e2f972858'  # ipython notebook的登陆密码  
c.NotebookApp.port = 6666 #设置访问端口 每次启动ipthon notebook端口会加1

(Optional) Configure notebook with https

# A self-signed certificate can be generated with openssl.
# This certificate is valid for 365 days with both the key and certificate data written to the same file.

$ openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mykey.key -out mycert.pem

# Edit the jupyter_notebook_config.py file in ~/.jupyter to use the certificate.

c.NotebookApp.certfile = u’/absolute/path/to/your/certificate/mycert.pem’
c.NotebookApp.keyfile = u’/absolute/path/to/your/certificate/mykey.key’

Now you can access Jupyter with this url: https://hdte.xxx.xxx.com:6666/

To avoid chaos, you can refer to this name changing list.

Spark Kernel (old) ==> Toree (new)
Ipython (old) ==> jupyter (new)

Refer to

Two examples for creating a kind of kernel for jupyter notebook.

Using toree which is previously called “Spark Kernel”. (jupyter + toree)
Using sparkmagic and livy (jupyter + sparkmagic + livy)
Using nb2kg, kernelgateway and toree (jupyter + nb2kg + kernelgateway + toree)
Using nb2kg, kernelgateway, sparkmagic and livy (jupyter + nb2kg + kernelgateway + sparkmagic + livy)
An step by step example to use toree in jupyter.
Refer to