Spark for Beginners: Project – Part 1

Français Français

Hello everyone, I’m chained to post a new series dedicated to a Big Data project.

Project’s context

Big data is an evolving term that describes any voluminous amount of structured, semi structured and unstructured data that has the potential to be mined for information.
Big data is often characterized by 3Vs:

  • The extreme volume of data.
  • The wide variety of data types.
  • The velocity at which the data must be processed.

Although big data doesn’t equate to any specific volume of data, the term is often used to describe terabytes, petabytes and even exabytes of data captured over time.
Data analytics relies on descriptive and predictive statistics to derive insights from your data. So during this 10 weeks, we will choose a thematic, then we will fix our objective business then we will collect data and try to make statistics.


We choose the global food system like thematic, because Food is needed by the human body for energy, to repair and build cells and to prevent sickness and heal from it. While it is possible to obtain nutrients in a scientifically controlled manner, common food is the most efficient way of obtaining energy and nutrients. The Global Food System is a complex network of consumers and producers. By any measure, the data sets are big. Using big data to revolutionize the way people discover restaurants and food.
So we will take different Business objectives about food.

Data source identication

Open Food Fact

A free, open, and collaborative database of food products around the world.
Here are some columns in FoodFacts:

  • code (text)
  • url (text)
  • product_name (text)
  • categories (text)
  • origins (text)
  • countries (text)
  • energy_100g (numeric)
  • fat_100g (numeric)
  • sugars_100g (numeric)
  • vitamin_d_100g (numeric)

Instagram API

Is a free online photo sharing and social network platform ,It allows members users to upload, edit and share photos with other members through the Instagram website, email, and social media sites such as Twitter, Facebook,Tumblr, Foursquare and Flick , so we will use instagram to collect data relating to food pictures posted on Instagram in order to identify interesting information on the most pleasurable trends worldwide, those of fine dining of course, for the contemporary gratification of taste, sight and smell.

Twitter API

Is an online social networking service that enables users to send and read short 140-character messages called “tweets”. Registered users can read and post tweets. So we will use Twitter to collect data relating to tweets posted and analyzes a spectrum of personality attributes to help discover actionable insights about people and entities.

Yelp API

Yelp connects people with great local businesses. Our users have contributed approximately 108 million cumulative reviews of almost every type of local business, from restaurants, boutiques and salons to dentists. We use Yelp API to get the best local business information and user reviews of over million businesses in 32 countries to match user with them.