As a data science student when I am thinking about big data definitely Hadoop, MapReduce and spark come to my mind, but how they work?
The basic idea of this architecture is that the entire storing and processing are done in two steps and in two ways. The first step is processing which is done by Map reduce programming and the second-way step is of storing the data which is done on HDFS. It has a master-slave architecture for storage and data processing. The master node for data storage in Hadoop is the name node. There is also a master node that does the work of monitoring and parallels data processing by making use of Hadoop Map Reduce. The slaves are other machines in the Hadoop cluster which help in storing data and also perform complex computations. Each slave node has been assigned with a task tracker and a data node has a job tracker which helps in running the processes and synchronizing them effectively. This type of system can be set up either on the cloud or on-premise. The Name node is a single point of failure when it is not running on high availability mode. The Hadoop architecture also has provisions for maintaining a stand by Name node in order to safeguard the system from failures. Previously there were secondary name nodes that acted as a backup when the primary name node was down.
Hadoop MapReduce is a programming technique where a huge amount of data is processed and distributed. One of the best example is Google MapReduce. This environment provides a programming model that has two parts Map and Reduce. The advantages of MapReduce are saving a lot of time, hide complexity and saving hardware costs. Map-reduce uses data in parallel rather than serial.
Users specify the computation in terms of a map (that specify the per-record computation) and a reduction (that specify result aggregation) functions, which meet a few simple requirements. For example, in order to support these, MapReduce requires that the operations performed at the reduce task to be both “associative” and “commutative.” This two-stage processing structure is illustrated in the figure below.
MapReduce simplified flowchart
It is important to remember that spark does not require Hadoop, it simply has support for storage systems implementing the Hadoop APIs. Spark supports text files, sequence files, Avro parquet, and any other Hadoop input format.
What is spark?
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Unlike most other shells, however, which let you manipulate data using the disk and memory on a single machine, spark’s shells allow you to interact with data that is distributed on disk or in memory across many machines, and spark takes care of automatically distributing this processing.
Databricks is the business version of the Spark. It is easy to cluster creation, have security, scheduling, having notebooks and collaboration. That I use the notebook environment in Databricks and focus on the data frame. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks. In addition to building the Databricks platform, the company is co-organizing massive open online courses about Spark and runs the largest conference about Spark – Spark Summit.
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.