Welcome, we will discover in this tutorial the Spark environment and the installation under Windows 10 and we’ll do some testing with Apache Spark to see what makes this Framework and learn to use it. The code for this lab will be done in Java and Scala, which for what we will do is much lighter than Java. Do not worry if you do not know what language we will use only very simple features of Scala, and basic knowledge of functional languages is all you need. If that’s not enough, Google is your friend.
Demo on Youtube
- You will need the Java Development Kit (JDK) for Spark works locally. This is the first step described below.
- Install Scala (Optional)
Note : These instructions apply to Windows. If you use a different operating system, you have to adapt the system variables and the paths to the directories in your environment.
Installing the JDK
Téléchargez le JDK depuis le site d’Oracle, la version 1.8 est recommandée.. Vérifiez l’installation depuis le répertoire bin sous le répertoire JDK 1.7 en tapant la commande
java -version. Si l’installation est correcte, cette commande doit afficher la version de Java installée.
Download the JDK from Oracle’s site, version 1.7 is recommended . Check the installation from the bin directory under the JDK 1.7 directory by typing
java -version. If the installation is correct, this command should display the version of Java installed.
Add the JAVA_HOME variable in the system environment variables with value: C: \ Program Files \ Java \ jdk1.7.x
Add in the variable PATH system environment value: C: \ Program Files \ Java \ jdk1.7.x \ bin
Testing with the cmd
Scala download from the link: http://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.msi
Define environmental variables following system:
- Variable: SCALA_HOME Value: C: \ Program Files (x86) \ scala
- Variable: système PATH Value: C: \ Program Files (x86) \ scala \ bin
scala , see below.
Installing applications Spark
Download the latest version from the Spark website. The most recent version at the time of this writing is 2.0.1. You can also select a specific version based on a version of Hadoop. I myself have downloaded Spark for Hadoop 2.7 and the file name is
spark-2.0.1-bin-hadoop2.7.tgz. Unzip the file to a local directory, such as D: \ Spark.
Then download Windows Utilities from the Github repo https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1 and paste it in D: \ spark \ spark-2.0.1-bin-hadoop2.7 \ bin. add theses environment variable system.
- Variable: HADOOP_HOME Value: D:\spark-2.0.1-bin-hadoop2.7
- Variable: SPARK_HOME Value: D:\spark-2.0.1-bin-hadoop2.7\bin
- Variable système PATH Value: D:\spark\spark-2.0.1-bin-hadoop2.7\bin
To verify the installation of Spark, position yourself on the Spark directory and run the Shell with the following commands:
This stage finished, you can exit the shell:
First Spark application ‘Hello World’
Application Scala via shell
Spark Once installed and running, you can run queries to analyze with the API. Simple controls to read data from a text file and process are available. We will look more advanced use cases in future articles in the series. Let’s start by using the API to perform the known example of word count.
move to your file directory, eg:
Open a Shell Scala, then run the following commands Scala
The cache function is called to store RDD created cache, so that Spark does not have to recalculate each time, with each subsequent request. Note that caching () is a lazy operation, Spark does not store the data directly in memory, in fact, this will be done when the action is invoked on an RDD. Now we can call the function count to see how many lines are present in the text file. txtData.count ()
The following commands perform the word count and display the account next to each word found in the file.
Other examples of using the API can be found on the Spark website, in the documentation.
Java application with Eclipse and Maven
To use the Spark API in Java we’ll choose as Eclipse IDE with Maven. initially starting with Apache Maven 3.3.9 download since http://apache.mivzakim.net/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.zip and extract eg in D: \ apache-maven-3.3.9
Ensuite ajouter les variables d’environnements suivantes :
- MAVEN_HOME value: D:\apache-maven-3.3.9
- Path value : D:\apache-maven-3.3.9\bin
Check cmd, see below
Let the installation of the Eclipse IDE. We’ll use a light version of Eclipse Luna and then add Manven Eclipse.
Download Eclipse since https://eclipse.org/downloads/packages/eclipse-ide-java-developers/lunasr2
Extract the content of the archive into a directory and start Eclipse.
Turning to the integration of Maven into Eclipse following the above screenshots:
Plugin configuration M2e
Impose on Eclipse to use your Maven installation by clicking Add and choose your Maven directory.
Now you have completed the installation step, we’ll create our first Spark project in Java.
Open Eclipse and do File => New Project => Select Maven Project; see below.
Now add External Jars from the location D: \ spark \ spark-2.0.1-bin-hadoop2.7 \ lib; see below.
Edit pom.xml. Paste the following cod.
Write your code or simply paste the following code:
Build the project: Go to the following location (where we stored the project) execute mvn package cmd
Test your application with the following command Spark
spark-submit --class sparkWCexample.spWCexample.JavaWordCount --master local F:\workspace\spWCexample\target\spWCexample-1.0-SNAPSHOT.jar
In this article, we saw how the Framework Apache Spark with its standard API, helps us in processing and analyzing data. Spark is based on the same Hadoop file storage system, so it is possible to use Spark and Hadoop together where significant investments have already been made with Hadoop.
You can also combine the types of treatments with Spark Spark SQL, Spark and Spark Machine Learning Streaming as we shall see in future articles. With different modes of integration adapters and Spark, Spark you can combine with other technologies. For example you can use all Spark, Kafka and Apache Cassandra; Kafka for streaming incoming data, Spark for the treatment and Cassandra NoSQL database for storing results.
However, keep in mind that Spark is not yet fully mature ecosystem and that it needs improvement in some areas such as security and integration with BI tools.