Hello everyone, I’m chained to post a new series dedicated to a Big Data project.
Big data is an evolving term that describes any voluminous amount of structured, semi structured and unstructured data that has the potential to be mined for information.
Big data is often characterized by 3Vs:
In this tutorial, we’ll do a simple analysis of sentimental Tweets Spark with SQL on a json file. This exercise is designed in Java to retrieve a stream of Tweets and Scala for spark SQL scripts. You will find the Repo Github link in the tutorial.
above illustrates the architecture of our application.
In this chapter, we will walk you through using Spark Streaming to process live tweet streams. Remember, Spark Streaming is a component of Spark that provides highly scalable, fault-tolerant streaming processing. These exercises are designed as standalone Java programs which will receive and process Twitter’s real sample tweet streams. You will find it the Gist Github links in the tutorial.
Welcome, we will discover in this tutorial how to connecting Spark with Cassandra database using the Java language. The code will be done in Java you will find it the Gist Github links in the tutorial.
Welcome, we will discover in this tutorial how to create RestFull API with MongoDBas NOSQL database using the Java language. at the end of this tutorial you will be able to create your own API interacting with NOSQL database (mongodb). The code will be done in Java you will find it the Github repo links at the end of the tutorial.
Welcome, we will discover in this tutorial the Spark environment and the installation under Windows 10 and we’ll do some testing with Apache Spark to see what makes this Framework and learn to use it. The code for this lab will be done in Java and Scala, which for what we will do is much lighter than Java. Do not worry if you do not know what language we will use only very simple features of Scala, and basic knowledge of functional languages is all you need. If that’s not enough, Google is your friend.
Apache Spark is a framework of open source for Big Data processing built to perform sophisticated analysis and designed for speed and ease of use. This was originally developed by AMPLab, UC Berkeley University in 2009 and spent as open source Apache project in 2010.