Big Data & Spark Introduction
What is Spark
Spark is an Big Data Analytics Engine with a programming interface for Data parallelism and fault-tolerance. It currently supports familiar languages like Java, Scala, Python and R.
Why Spark
Before we dive into why Spark, we should first understand the basics of Big Data. In a nutshell as the name suggests Big Data is nothing but large set of Data. In the last decade we have seen explosive data growth and the traditional sequential processing no longer works. There needs to be a change of algorithms/process to handle big data differently and these techniques are bundled with a name of Big Data processing. Google File system paper first talked about possibility of using several cheap low compute machines to process this data over one single costly huge machine.
This effectively led to the birth of Map-reduce techniques where large set of data is broken into several partitions and executed in several parallel computers/nodes/instances. Map-reduce like the name suggests works in two phases, firstly it maps the data writes it to the risk and reduces the data into next phase. Since this involves in IO operations which are typically costly compute operations, spark solved is by avoiding the writes in between to the disk but rather process them in-memory. This results in faster and optimized execution.
Apache spark philosophy
Spark philosophy is to bring an unified data processing platform which is parallelized in nature. Spark is made as an upgrade of Map-Reduce functions which has limitations primarily on writing data to the disk between map and reduce operations thus increasing the processing time.
Unified Platform
Spark offers multiple ways of parallelized computing from a laptop based machine to any number of clusters and offers simplicity of running the data processing. It can be a standard tool for data processing for data engineers and data analysis and applying machine learning algorithms for data scientists for various amount of data.
Spark Ecosystem
Spark has several components in its ecosystem. lets understand overview of each component
Spark Core
All the IO capabilities and fault tolerance are encapsulated in the spark core. Specifically spark cluster and storage.
Spark SQL
Spark SQL is on top of spark core with data abstraction component called Data frames which supports semi structured and structure data processing.
Spark MLlib
Spark MLlib is the distributed machine learning library which sits on top of spark core.
Spark GraphX
GraphX library is distributed Graph processing in parallelized fashion.
Spark Streaming
Spark streams are a small batches of data(in the form of RDDs) ingests which used in supporting stream analytics.
SparkR
This is library to support R on spark as distributed computing component.

Spark Execution Explained:
Execution of the Spark Engine can be considered into two steps.
1. Converting user program into logical plan.
Spark Driver accepts the user program and then passes the code through analyzer to convert into logical plan. Analyzer takes help of catalog to resolve the logical plan. Catalog holds all the information pertaining to several fields which are needed for the successful execution of the code. The fields catalog holds can also be derived. The logical plan is further optimized by spark catalyst optimizer to arrive at the final state from this step which is optimized logical plan.
2. Converting optimized logical plan into Physical plan
Spark now considers the optimized logical plan and then arrives at several physical plans using cost model. the best out of the cost model is arrived as best physical plan and then this plan is further sent into executors for the execute plan over several RDDs.
Sequence of execution steps:
- User submits application to spark, the driver program then takes control of the application.
- Spark context breaks the application into logical plan, which then is converted into physical plan by DAG(directed acyclic graph) scheduler. At this point, we would have stages and tasks identified. (Details explained in above step)
- DAG Scheduler solely works on driver program, this is created after spark context creation after Task Scheduler and SchedulerBackend creation. DAG Scheduler handles computation of DAG,Preferred locations to run, Shuffle outputs being lost. DAG Scheduler typically sends requests to Task scheduler to create the tasks. DAG Scheduler holds information about the RDD and the details of the host RDD (resilient distributed datasets) resides to as not to re-run same compute again.
- Task Scheduler takes the submitted tasks from DAG scheduler and executes the tasks through worker node with help of cluster manager. DAG scheduler manages the state of the tasks and RDDs.

References
https://spark.apache.org/docs/latest/cluster-overview.html
https://books.japila.pl/apache-spark-internals/scheduler/DAGScheduler/#introduction
https://techvidvan.com/tutorials/spark-architecture/