Spark for beginners: Introduction

Français Français

What is Spark ?

Apache Spark is a framework of open source for Big Data processing built to perform sophisticated analysis and designed for speed and ease of use. This was originally developed by AMPLab, UC Berkeley University in 2009 and spent as open source Apache project in 2010.

Spark has several advantages compared with other big data technologies like Hadoop and MapReduce and Storm. First, Spark offers a complete and unified framework to meet the needs of Big Data treatments for various datasets, different in nature (text, graph, etc.) as well as by source type (batch or flow time -real). Then Spark allows applications on Hadoop clusters to be run up to 100 times faster in memory, 10 times faster than on disk. It allows you to quickly write applications in Java, Python or Scala and includes a set of more than 80 high-level operators. Moreover, it is possible to use it interactively to query data from a shell.

Finally, in addition to operations and Map Reduce, Spark supports SQL queries and streaming data features and machine learning capabilities and treatments directed graph. Developers can use these opportunities in stand-alone or in combination in a complex processing chain.

This first article in the series provides an overview of what Spark, the suite of tools available for Big Data processing and explains how Spark is positioned relative to conventional solutions MapReduce.

Hadoop and Spark

Hadoop is positioned as a data processing technology for 10 years and has proven to be the solution of choice for the treatment of large volumes of data. MapReduce is a very good solution for single pass treatment but is not the most effective use cases requiring treatment and algorithms in several passes. Each step of processing workflow consisting of a Phase Map and Reduce phase, it is necessary to express all the use cases as MapReduce patterns to take advantage of this solution. The output data of the execution of each step must be stored on distributed file system before the next step begins. This approach tends to be slightly faster because of replication and disk storage.

In addition, Hadoop solutions are usually based on clusters, which are difficult to implement and administer. They also require the integration of multiple tools for different cases of big data use (as Mahout for Learning and Storm machine for flux treatment).

If you want to set up something more complex, you will string together a series of MapReduce jobs and run sequentially, each of these jobs with high latency and none can begin before the previous has not entirely done completed.

Spark allows to develop complex data processing pipelines, in several steps, based on acyclic graphs (DAGs). Spark allows sharing data in memory between the graphs, so that more jobs can work on the same data set. Spark runs on Hadoop Distributed File System Infrastructure (HDFS) and offers additional features. It is possible to deploy applications on a Spark existing Hadoop cluster v1 (with ISRM – Spark-Inside-MapReduce), on a Hadoop cluster v2 YARN or even Apache Mesos. Rather than seeing a replacement in Spark Hadoop, it is more correct to see it as an alternative to MapReduce Hadoop. Spark was not intended to replace but to Hadoop to provide a complete and unified solution to support different use cases and needs in the context of big data processing.

The features of Spark

Spark provides improvements to MapReduce through steps less costly shuffle. With memory storage and processing of the near real-time, performance can be several times faster than other big data technologies. Spark also supports lazy evaluation ( “lazy evaluation”) queries, which helps to optimize processing steps. It offers a high-level API for better productivity and a consistent architecture model for big data solutions.

Spark holds intermediate results in memory rather than on disk, which is very useful especially when it is necessary to work repeatedly on the same data set. The runtime is designed to work both in memory to disk. Operators perform external operations when the data does not take into memory, which can process larger datasets aggregate memory of a cluster. Spark tries to store as much as possible in memory before switching to disk. It is able to work with some of the data in memory, on another disk.
It is necessary to examine its data and use cases to assess its needs for memory, depending on the work done in memory, Spark can have significant performance benefits. Other features offered by Spark include:

  • Other functions and Map Reduce
  • The optimization of arbitrary graphs operators
  • The lazy evaluation of requests, which helps to optimize the overall workflow process
  • Concise and consistent APIs in Scala, Java and Python
  • An interactive shell for Python and Scala (not yet available in Java)

Spark is written in Scala and runs on the Java Virtual Machine (JVM). The languages currently supported for application development are:

  • Scala
  • Java
  • Python
  • Clojure
  • R

The Spark ecosystem

Alongside the main API Spark, ecosystem contains additional libraries that allow you to work in the field of big data analysis and machine learning. These libraries include:

  • Spark Streaming : Spark Streaming can be used for processing real-time data stream. It is based on a processing mode in “batch micro” and used for real-time data DSTREAM, that is to say a series of RDD (Distributed Resilient Dataset).
  • Spark SQL : Spark SQL allows Spark Spark expose data sets via JDBC API and run SQL-like queries using BI tools and traditional visualization. Spark SQL used to extract, transform and load data in different formats (JSON, Flooring, database) and expose them for ad-hoc queries.
  • Spark MLlib : MLlib is a machine learning bookstore that contains all the classic algorithms and utilities learning, such as classification, regression, clustering, collaborative filtering, reduced dimensions, more primitive sub-optimization underlying.
  • Spark GraphX : GraphX is the new API (in alpha) for the treatment of graphs and graphs of parallelization. GraphX extends Spark RDD introducing the Resilient Distributed Graph Dataset, a multi-graph with properties attached to the nodes and edges. For the support of these treatments, GraphX exposes a set of basic operators (e.g. subgraph, joinVertices, aggregateMessages), and an optimized variant of the Pregel API. Moreover, GraphX includes an increasingly large collection of algorithms and builders to simplify the graphs analysis tasks.

In addition to these libraries include BlinkDB and Tachyon: BlinkDB is an approximate query engine that can be used to execute interactive SQL queries on large data volumes. It allows the user to swap the precision against the response time. It works by running queries on data extracts and reports its results accompanied by annotations with significant errors indicators. Tachyon is a distributed file system for sharing files reliably at the speed of memory access through clustering frameworks like MapReduce and Spark. It avoids disk access and loading frequently used by hiding memory files. This allows the various frameworks, tasks and requests to access files quickly cache.

There are also adapters for integration with other products such as Cassandra (Spark Cassandra Connector) and R (SparkR). With Cassandra connector, you can use Spark to access data stored in Cassandra and perform analysis on the data.

The following diagram (Figure 1) shows the relationships between these different libraries of the ecosystem.

We will explore these libraries in future articles in this series.

The Spark architecture

The Spark architecture includes three main components:

  • Data storage
  • API
  • The Resource Management Framework

Let’s look at each of these in more detail components.

Data storage :

Spark uses the HDFS file system for data storage. It can work with any source of consistent data with Hadoop, including HDFS, HBase, Cassandra, etc.


The API allows developers to create Spark applications using a standard API. The API is available in Scala, Java and Python. The links below point to the sites with the Spark API for each of these languages:

Resource management

Spark can be deployed as a standalone server or distributed processing framework as Mesos or YARN. Figure 2 illustrates the components of the Spark architecture.

How to install Spark

There are several ways to install and use Spark. You can install it on your machine as a standalone framework or use a virtual machine images (Spark Virtual Machine) available from publishers such as Cloudera, Hortonworks and MapR. You can also use Spark already installed and configured on the Cloud (as Databricks Cloud).

How to run Spark?

When Spark is installed on a local machine or in the cloud, various connection modes Spark engine are possible. The table below shows the Master URL parameter used for different embodiments.


How to interact with Spark?

Once Spark is installed and running, you can connect using the Shell Spark to perform interactive data analysis. Shell Spark is available for Scala and Python. Java does not yet support interactive Shell, so this feature is not yet available for Java. You can use spark-shell.cmd orders and pyspark.cmd to launch the Shell respectively with Scala and Python.

The Web Console

When Spark is running, regardless of the method of execution, you can review the results of jobs and other statistics by accessing the Web console, via the following URL: http://localhost:4040
Figure 3 shows the Spark console, with tabs for steps, storage, environment and exécuteurs.