Beginners Guide to Apache Pyspark

Boost your data processing performance with Apache Pyspark!

Srivignesh Rajan

Published in

Towards Data Science

6 min readJul 16, 2020

Apache Spark

Apache Spark is an open-source analytics engine and cluster-computing framework that boosts your data processing performance. As they claim, Spark is a lightning-fast unified analytics engine. Spark is entirely written in Scala.

Spark is effectively used in the field of Big-Data and Machine Learning for analytical purposes. Spark has been adopted by various companies like Amazon, eBay, and Yahoo.

Features of Spark

Spark is polyglot which means you can utilize Spark using one or more programming languages. Spark provides you with high-level APIs in Java, Python, R, SQL, and Scala. Apache Spark package written in Python is called Pyspark.
Spark supports multiple data formats such as Parquet, CSV (Comma Separated Values), JSON (JavaScript Object Notation), ORC (Optimized Row Columnar), Text files, and RDBMS tables.
Spark has low latency because of its in-memory computation. Spark has been designed to deal with humongous data, so scalability is an inherent feature of Spark.
Spark can be seamlessly integrated with Hadoop and has the capability of running on top of Hadoop clusters.

How Spark works

Spark uses Master-Slave architecture. The Master node assigns tasks to the slave nodes that reside across the cluster and the slave nodes would execute them.
A Spark Session must be created to utilize all the functionalities provided by Spark. A Spark Session is created inside the Driver program. The Driver program resides inside the Master node.

# Example of creating a Spark Session in Pyspark
spark = SparkSession.\
builder.master("local").\
appName("AppName").getOrCreate()

When you read the DataFrame using Spark Session the DataFrame will be partitioned and stored across the cluster nodes so that it can be operated parallelly. The partitions of the DataFrame are collectively called RDD (Resilient Distributed Datasets). RDDs are fault-tolerant which means it is resilient to failures.
When an Action is invoked through the Spark Session, the Spark creates DAG (Directed Acyclic Graph) of transformations (which would be applied to the partitions of data) and implements them by assigning tasks to the Executors in the slave nodes. A transformation is never implemented until an Action is invoked. The tendency of implementing a transformation only when an Action is invoked is called Lazy Evaluation.
The Driver Program running in the Master node assigns Spark jobs to the slave nodes when an action is invoked. The spark jobs are split into stages and those stages are further split into tasks.
The slave nodes contain many Executors that receive tasks and execute them parallelly on the partitions of the data. The executors are the ones which cache data for in-memory computation.

Spark Transformations

The Spark Transformations perform some operations on RDDs and produce new RDD. Various Spark transformations include map, flatMap, filter, groupBy, reduceBy, and join.

Spark Transformations are further classified into two types,

Narrow transformations
Wide transformations

Narrow transformations

Spark transformations are called narrow transformations when the operation does not require Shuffling. A Narrow transformation does not require partitions of data to be shuffled across nodes in the cluster.

Examples of Narrow transformations are map, flatMap, filter, sample, etc.

Wide transformations

Spark transformations are called wide transformations when the operation requires Shuffling. Shuffling is an operation that involves shuffling the partitions of the data across the nodes of the cluster to perform an operation.

Examples of Wide transformations are groupBy, reduceBy, join, etc.

The groupBy is a transformation in which the values of the column are grouped to form a unique set of values. To perform this operation is costly in distributed environments because all the values to be grouped must be collected from various partitions of data that reside in nodes of the cluster.

Actions in Spark

Actions are operations that trigger Spark Jobs. Spark does not immediately perform transformations. It requires an Action to trigger the implementation of the Spark transformations.

Examples of Spark actions are collect, count, take, first, saveAsTextFile, etc.

Collect is an action that collects all the partitions of data that resides across the nodes of the cluster and stores them in the Driver that resides in the Master node.

Spark Jobs

Spark jobs are triggered when an Action is invoked. Spark jobs are further divided into Stages and Tasks.

Stages

Spark jobs involving wide transformations are grouped together as a stage and the jobs involving narrow transformations are grouped as another stage.

# A Spark Job
df.filter(col('A')).groupBy('A')

The entire code above is considered to be a Spark job, in this filter is a separate stage and groupBy is a separate stage because filter is a narrow transformation and groupBy is a wide transformation.

# Stage A
df.filter(col('A')).groupBy('A')# Stage B
df.filter(col('A')).groupBy('A')

Tasks

The stages of Spark Jobs are further divided into tasks. The tasks are the operations that are applied to each of the partitions across the nodes of the cluster.

Data preparation using Pyspark

Install Pyspark

Pyspark can be installed by executing the following command

pip install pyspark

Import the required libraries

import math
import numpy as np 
import pandas as pd  
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import isnan, when, count, col, isnull, asc, desc, mean'''Create a spark session'''
spark = SparkSession.\
builder.\
master("local").appName("DataWrangling").getOrCreate()
'''Set this configuration to get output similar to pandas'''
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)

'''Find the count of a dataframe'''
df.count()
"""OUTPUT:891"""

Count of values in a column

df.groupBy('Sex').count()

Find distinct values of a column in a Dataframe

df.select('Embarked').distinct()

Select a specific set of columns in a Dataframe

df.select('Survived', 'Age', 'Ticket').limit(5)

Find the count of missing values

df.select([count(when(isnull(column), column)).alias(column) \
for column in df.columns])

Filtering null and not null values

'''Find the null values of 'Age' '''
df.filter(col('Age').isNotNull()).limit(5)'''Another way to find not null values of 'Age' '''
df.filter("Age is not NULL").limit(5)

'''Find the null values of 'Age' '''
df.filter(col('Age').isNull()).limit(5)'''Another way to find null values of 'Age' '''
df.filter("Age is NULL").limit(5)

Sorting columns

'''Sort "Parch" column in ascending order and "Age" in descending order'''
df.sort(asc('Parch'),desc('Age')).limit(5)

Dropping columns

'''Drop multiple columns'''
df.drop('Age', 'Parch','Ticket').limit(5)

Groupby and aggregation

'''Finding the mean age of male and female'''
df.groupBy('Sex').agg(mean('Age'))

Summary

Spark is a lightning-fast cluster computing framework used for analytical purposes.
SparkSession is the entry point for all the functionalities in Spark and is responsible for creating and scheduling Spark Jobs.
The Executors implement the transformations on all the partitions of data available in RDD.

Find this post in my Kaggle notebook: https://www.kaggle.com/srivignesh/an-introduction-to-pyspark-apache-spark-in-python

Connect with me on LinkedIn, Twitter!

Happy Learning!