Using Spark For Data Exploration

Spark is actively supported by Apache Open Source community, and it is used in production by many famous firms and companies.

In this blog, the focus would be on productionizing Apache Spark. I will discuss the use cases of Spark and how to enable each of them on production environment.

Currently, Spark has 2 deployment modes (Client , Cluster) with 3 different supported cluster managers :
1- Standalone
2- YARN
3- Apache Mesos

A lot of resources are discussing in details these different deployment ways, so I will skip discussing the differences and assume you are already aware of them.

For more information, you can start from Spark Documentation and also check this post discussing Spark On YARN and Standalone in details.

One of the main advantages of Spark is its capability to fill the gap between production frameworks (like C++, JAVA) and data exploration tools (like R and python). That makes Spark a suitable tool to be shared between different types of people inside the same organization.

“Spark is better at being an operational system than most exploratory systems and better for data exploration than the technologies commonly used in operational systems” (Advanced Analytics with Spark).

Actually, our main challenge is usually to allow these different people (Data Scientists and Engineers) to access and submit jobs to Spark. Let’s start with defining the possible use cases of Spark.

Spark Main Use Cases are :

1- Data Exploration Scenarios.
2- Production Data Pipelines (like Cron Jobs, ETL Jobs or Machine Learning Services)

The first question here is how Spark enables each use case natively? What are the alternatives ?

1- Data exploration Use Case:

1-1- Spark Shell

Spark shell is a console tool where you interactively receive the results with each Spark Action (like .SaveAsTextFile() or .count()). Spark starts the context, submits the jobs and gets you the result inside your shell.

PROS
CONS
Easy to use. No configurations needed.
Lack of  Visualizations
Support for different languages (Scala, Python and R)
No way to save your work.
Deployment mode independent.

1-2- Spark Notebooks

Inspired by IPython Notebook, several Spark notebooks have been developed. None of them is mature enough to be superior so you may need to define your requirements and then choose from the available options.

Notebooks in general provides some features like :

1- Support for different languages including Markdown, python and others.
2- The ability to save your work, re-run and share it easily.
3- Different visualization capabilities.
4- Collaboration between different users.

a- Cloudera’s Hue Spark NoteBook

alt Hue Spark NoteBook is promising. You can go for it especially if you are already using Cloudera’s CDH or Hue but be aware that behind the scenes it mainly depend on Spark On Yarn.

Hue Spark NoteBook Demo

Hadoop Tutorial: the new beta Notebook app for Spark & SQL from The Hue Team on Vimeo.

Hue is not kind of new. It offers more than just a Spark Notebook, it integrates well with all the Hadoop ecosystem components (Oozie, Solr, Impala, HBase, Pig...). The notebook allows using Scala, Python and R interactively on Spark. Hue Spark Notebook depends on Cloudera’s Livy Job Server for submitting batch and interactive jobs.

B- Jupyter Notebook

The famous Jupyter notebook is now supporting Spark via a kernel developed by IBM. It supports Scala, python and R on Spark. You can try it from here !
Another kernel for jupyter is also being developed by Microsoft based on Cloudera’s Livy under the name Spark-Magic.

C- Apache Zeppelin Notebook

alt Similar to Jupyter, Zeppelin provides a notebook for Spark and supports using Scala and python. Zeppelin is incubated by Apache. The main difference is Jupyter is built on Python while Zeppelin is built on JVM.

Zeppelin shines with a distinguished support for visualizations with Scala.

D- Andypetrella’s Spark Notebook

Another available notebook is provided by Andy Petrella with support from LightBend (Formerly TypeSafe) and Data Fellas. It is based on Akka Framework, mainly targeted for Scala users and has a rich visualization support.

Comparison :
Jupyter Notebook
Apache Zeppelin
Andy’s Notebook
Connect to Spark via separate modules called Kernels
Spark Comes packaged with Zeppelin although using kernels is possible too
Spark Comes packaged with the notebook
Scala, Python and R
Scala, Python and R
Scala
One context per notebook
Only one shared context per all nodes
One context per notebook
Built on Python
Built on JVM
Built On JVM
Can be shared and viewed via a web viewer
Can be shared and viewed via a web viewer
Missing

There are a lot of available tools in this domain and their feature set are changing every day. It is hard to keep up of course but this was our try to explore the existing alternatives. If you had an opportunity to use any of these tools, please don’t hesitate to share your experience with me.

In this post we focused on the first use case of Spark and discussed the available alternatives for doing data exploration. A second post will be available soon to cover the second use case too in details.
| Scala, Python and R | Scala, Python and R | Scala | | One context per notebook | Only one shared context per all nodes| One context per notebook | | Built on Python | Built on JVM | Built On JVM | | Can be shared and viewed via a web viewer | Can be shared and viewed via a web viewer | NA |

References

http://stackoverflow.com/questions/32748099/evaluating-spark-notebook