Spark

Introduction

Spark is an open-source computing framework engine that is used for analytics, graph processing, and machine learning.

Spark has a real-time processing framework that processes large amount of data every day. Spark is used not only in IT companies, it can used by various industries like healthcare, banking, stock exchanges, and more.

The primary reason for its popularity is that Spark architecture is well-layered and integrated with other libraries, making it easier to use.

Spark is a master/slave architecture and has two main daemons:

1.The master daemon

2.The worker daemon.

The two important aspects of a Spark architecture are the Spark ecosystem and RDD.

An Apache Spark ecosystem contains Spark SQL, Scala, MLib, and the core Spark component.

Spark Core is the base for all parallel data processing, and the libraries build on the core, including SQL and machine learning, allow for processing a diverse workload. Spark includes various libraries and provides quality support for R, Scala, Java, etc.

Spark SQL is a simple transition for users familiar with other Big Data tools, especially RDBMS.

RDD or Resilient Distributed Dataset, is considered the building block of a Spark application. The data in an RDD is divided into chunks, and it is immutable. RDDs can perform transformations and actions.

Features of the Apache Spark Architecture

Spark has a large community and a variety of libraries. It provides an interface for clusters, which also have built-in parallelism and fault-tolerant. Here are some top features of Apache Spark architecture.

Speed

Compared to Hadoop MapReduce, Spark batch processing is 100 times faster because Spark manages the data by dividing it into partitions, so data can be distributed parallelly to minimize network traffic.

Polyglot

Polyglot is used for high-level APIs in R, Python, Java, and Scala, meaning that coding is possible in any of these four languages. It also enables shell in Scala using the installed directory ./bin/spark-shell and in Python using the installed directory ./bin/pyspark.

Real-Time Computation

The Spark has a capability to process real time(stream) analysis, it uses in-memory computation, so that it is processed with low-latency. Spark is designed for high scalability, also clusters can run from single node to thousands of nodes. And it also supports many computational methods.

Hadoop Integration

Spark is relatively new, and most of the Big Data engineers started their career with Hadoop, and Spark has a  compatibility with Hadoop is a huge bonus. While Spark replaces the MapReduce function of Hadoop, it can still run at the top of the Hadoop cluster using YARN for scheduling resources.

Machine Learning

MLib, the machine learning feature of Spark is very useful for data processing since it eliminates the use of other tools. This gives data engineers a unified engine that’s easy to operate.

Lazy Evaluation

The reason Spark has more speed than other data processing systems is that it puts off evaluation until it becomes essential. Spark adds transformations to a Directed Acyclic Graph for computation, and it will be executed only after the driver requests action.

ARCHITECTURE

The below diagram explain the complete spark architecture

A spark cluster has a single Master and many number of Slaves/Workers.

The driver and the executors run their individual Java processes and users can run them on the same horizontal spark cluster or on separate machines i.e. in a vertical spark cluster or in mixed machine configuration.

Driver in Spark Architecture

It is the central point and the entry point of the Spark Shell.

The driver program runs the main () function of the application and is the place where the Spark Context and RDDs are created, and also it is the place where transformations and actions are performed.

Spark Driver is responsible for the translation of spark user code into actual spark jobs executed on the cluster.

Spark Driver performs two main tasks: Converting user programs into tasks and planning the execution of tasks by executors. A detailed description of its tasks is as follows:

  • The driver program that runs on the master node of the spark cluster. It schedules the job execution and negotiates with the cluster manager.
  • It translates the RDD’s into the execution graph and splits the graph into multiple stages.
  • Driver stores the metadata about all the Resilient Distributed Databases and their partitions.
  • Driver program converts a user application into smaller execution units known as tasks. Tasks are then executed by the executors i.e. the worker processes which run individual tasks.
  • After the task has been completed, all the executors submit their results to the Driver.
  • Driver exposes the information about the running spark application.

Executer in Spark Architecture

An executor is a distributed agent responsible for the execution of tasks.

Every spark application has its own executor process. Executors usually run for the entire lifetime of a Spark application and this phenomenon is known as “Static Allocation of Executors”. However, users can also opt for dynamic allocations of executors wherein they can add or remove spark executors dynamically to match with the overall workload.

  • Executor performs all the data processing and returns the results to the Driver..
  • Reads and writes data to external sources.
  • Executor stores the computation results data in in-memory, cache or on hard disk drives.
  • Interacts with the storage systems.
  • Provides in-memory storage for RDDs that are collected by user programs, via a utility called the Block Manager that resides within each executor. As RDDs are collected directly inside of executors, tasks can run parallelly with the collected data.A