PySpark Tutorial | Python Spark | Intellipaat


hey everyone welcome to the session
by intellipaat the entire world is a big data problem managing and scaling
this huge amount of data we see every single day is a problem even to the most
renowned big data architects out there but PYSPARK aims to do this very easily
and in this session we’re gonna learn exactly that we’ll have a quick
introduction to PYSPARK and learn PYSPARK in depth at the same time well
before we begin with this session make sure to subscribe to the Intellipaat
YouTube channel and hit that Bell icon so that you never miss an update from us
so here’s the agenda for today we’ll begin by checking out a quick
introduction to spark and followed by this we’ll check out the architecture
and the components of spark after this we can check out all the various methods
of deployments there are and then we can check out everything there is to know
about RDD’s and the operations associated with them after this we can
check out the installation of spark and some fundamentals along with the same
and after this we can take a quick look at a demo and interactive demo where you
can get your hands on with code and then learn at the same time and after this if
you guys have any queries make sure to head down to the comment section below
and do let us know and we’ll be happy to help you out there and if you guys are
looking for end-to-end course certification in PYSPARK Intellipaat
provides the PYSPARK certification training program where you can learn all
of these concepts thoroughly and earn a certificate at the same time well
without further ado let’s begin the class apache spark is one of the most
widely used frameworks when it comes to handling and working with big data on
the other hand python is one of the most widely used programming language for
data analysis machine learning and much more so why not use them together this
is where Python would spark also known as PYSPARK comes into the picture so
let’s start with a very first topic introduction to PYSPARK you know what
the average salary of an Apache spark develop er is around 110,000 dollars per
annum so there’s no doubt that spark is used in industry a lot now when it comes
to Python because of its rich library set it is used by majority of data
scientists and analytics experts today so combining both the Python and
spark was a major gift to the community spark was developed in Scala language
which is much like Java it compiles the program into byte code for the JVM for
spark big data processing so to support spark with Python the Apache spark
community release PYSPARK so guys what do you think of PYSPARK any rough idea anyone okay fine I’m assuming that you guys
know that it is something which has to deal with both Python as well as the
spark right since I already told you that Python is an open source powerful
high-level and object-oriented programming language and yes in terms of
its syntax it’s very easier to use and read when compared to other programming
language on the other hand the Apache spark is an open source general-purpose
distributed computing engine which is used for pressing and dealing with large
amount of data so here comes the PYSPARK it’s a collaboration of Apache spark and
Python it is an API for spark which easily integrate and work with RDD using
the library called pi 4 j so let me tell you pi spark is an API where you
merge the simplicity of Python with the power of Apache spark in order to obtain
the big data for those of you who are already familiar with spark and the
concept of rdd’s for you guys you can think of it as a Python API for spark
which easily integrate and work with RDD using a library called pi forging the
PYSPARK it offers a PYSPARK shell which links the Python API to the spark
core and initializes the spark context I hope by now you guys have a brief idea
of what is PYSPARK all right so let’s move ahead and let me tell you
about some of its advantages this will help you to know more about why you
should use PYSPARK so like we can see how we have speed powerful caching real-time
computation deployment and polyglot so on number one we have speed so when you
compare it with the traditional large-scale data processing it’s
definitely 100x faster than them all right in terms of powerful caching well
PYSPARK will give you a powerful caching with this persistence
capabilities on third we have real-time computation well there’s a plethora of
data which need to be processed in real time let me tell you a fact you know
what every minute around 48 hours of video is being uploaded on YouTube
Twitter gets around 98 thousand of tweets and more than 168 millions of
emails are sent every single minute well these data comes in bits and pieces
from many different sources it can come in various forms like words images
numbers and so on just a quick info in case if you guys are looking for
end-to-end course certification in PYSPARK Intellipaat provides the PYSPARK
certification training program where you can learn all of these concepts
thoroughly and or a certificate at the same time the link is given in the
description box below so make sure to check it out and on that note let’s get
back to the class so Twitter is a very good example of what’s being generated
and real-time we also have websites where statistics like number of visitors
page views and so on are being generated in real time there are so much data and
they’re not useful and their raw form we have to process it and extract and insight
from it so that it becomes more useful to us right so this is where spot
streaming comes into picture it is exceptionally good at processing
real-time data and is highly scalable it can process enormous amount of data in
real time without skipping a beat all right
next feature we can see over here as deployment well you can deploy PYSPARK
through Mesos Hadoop or even through spark zone cluster all right next we
have is polyglot well simply it means that it supports multiple language like
Python Java and Scala it basically provides the shell and Scala in Python
which you can access through the Ben folder of both Scala as well as the
Python alright a question might arise in your mind like
Underwood situation should I use Python or under what scenario should I go with
Scala or Java or R all right so let me tell you fact the most used programming
language with spark a Python and Scala all right so now if you are going to
learn PYSPARK in this tutorial it is important that you should know why and
when to use spark with Python instead of spark with Scala so let me just compare
Python and spark with some parameters this will make things clearer for you to
decide so first parameters performance speed
Python is comparatively slower than Scala when used with spark but
programmers can do much more with Python than Scala because of its easy interface
that it provides on the other hand Spark is written in Scala so it integrates
very well with Scala making it faster than Python alright so next we have is
the learning Co Python is known for its easy syntax and being a high-level
language it makes it easier to learn whereas Scala has a arc in syntax which
makes it hard to learn but once you get a hold of it you’ll see that it has its
own benefit all right next is the data science library Python supports multi
data science and machine learning libraries but on the other hand Scala
lacks proper data science library and two-way Scala does not have proper local
tools and visualizations alright finally the last parameter is complexity so
python api has a simple yet easy and comprehensive interface whereas on the
other hand scala syntax and the fact that it produces the verbose the output
is why it is considered a complex language I hope that things are getting
clearer to you so before we dive deeper into the PI
Spock concept let me tell you about how big tech giants like Yahoo TripAdvisor
Alibaba and other big multinationals are using pi spark for them
well Yahoo uses Apache spark for its machine learning capabilities to
personalize its news and web pages and also for targeting advertisement and
also for target advertising they use sparkle Python to find out what kind of
news user are interested in reading and categorizing the news story to find out
what kind of user would be interested in reading each category of news next is
the Trip Advisor well Trip Advisor uses a PYSPARK to provide advice to
millions of travelers by comparing hundreds of website to find the best
total price for its customer the time taken to read and process the reviews of
the hotel in a readable format is done with the help of Apache spark
all right let’s discuss about Alibaba well Alibaba I guess most of you have
heard about it right it is a one of the world’s largest e-commerce platform and
runs some of the largest Apache spark jobs in the world in order to analyze
hundreds of petabytes of data on its e-commerce platform alright so I guess
now you have idea that how PYSPARK is being used in the industry right so
before I continue has anyone any doubt up till here if yes
please add it over to the comment section below and we’ll try to reply
them at the earliest so guys if you are a mobile viewer get
ready with your laptop cause now we’ll proceed with the installation part so
let’s install pi spark in your system just to keep it real just in case you
are stuck you’ll find many installation reference out there on the internet
which you can just go ahead and refer to and get installed so guys in this
tutorial I will be teaching you how you can install pi spark on your Windows
machine alright if you are Linux user or your email id below and I’ll
send you the installation step for air you can just follow it and install it on
your machine right well before we actually proceed
with the installation part there are some prerequisite required for pi spark
installation that you must know so as a part of prerequisite you must have the
Jupiter notebook installed on your machine and you must have an updated version of
Java all right so let’s just open and just search for download
PYSPARK well I’m assuming that you guys have
installed a prerequisite for this session so proceeding ahead let’s
download the spark just google it download spark the very
first link which we’ll get as the Apache sparkling just click over there and it
will redirect you to the spark Apache website
as you can see over here the spark release is 2.4 off November it’s very
recent and the package type which I am selecting as pre-built for Apache Hadoop
2.7 and later you have other options available over here so I’m selecting pre-built for Apache
Hadoop 2.7 and later and finally what I’ll do I’ll just click
over the download link over here it will redirect me to a different page where I
can see over here it’s a mirrored link for me I’ll just click it and my
download will start I think it’s a pretty big file so the ad might so yeah
it’s a pretty big file just make sure that you have a very good internet
connection if you are downloading it for now I’ll cancel this as I have already
downloaded this file so let’s cancel it let me just show you where I have
downloaded it downloads I have this file SPARC 2.4 benhaddou 2.7 I like just a
quick info in case if you guys are looking for end-to-end course
certification in PYSPARK Intellipaat provides the PYSPARK certification
training program where you can learn all of these concepts thoroughly and earn the
certificate at the same time the link is given in the description box below so
make sure to check it out and on that note let’s get back to the class just
unzip it extract files now let’s start
just start the files the files I extracted I’ll just
save it somewhere like let’s say within the D okay by the time it does save it
just open your CMD so our next step would be to edit our
environment variable so let’s create some environment variable using the set
X command my first variable is spark underscore
home and after that I’ll specify the path of
my spark file so there’s the path let’s paste it okay next I need another environment
variable as Hadoop under scope then again specify the path of the file
a Tonto done now since I want to launch my PYSPARK shell and Jupiter notebook so I need to define two more parameters over
here PYSPARK driver Python and PYSPARK driver Python OPDs all right this will
help me to open the PYSPARK shell and Jupiter notebook so I’ll write said X PYSPARK underscore driver underscore Python and IPython hit enter done and my next parameter is
statics PYSPARK underscore driver underscore
Python underscore OPDs our notebook they don’t stop alright let’s close this
terminal now let’s open a new command line prompt now my first task would be
to redirect to this bin folder all right so this is the path it’s copy it jump to
the D directory CD space I wanna go over till when
so yeah so there’s my path hey Tonto haven’t been now my next step would be
to execute pi spark with the command PI spark I’ll use a master parameter for setting
the mass node and since I want to course for local
testing so I’ll specify master local hominid cause I need to and head and out
so what it will do it will open here jupiter notebook on my local host so yes
we are successfully installed pi spark on our machine if you are facing any
shoes commanded below over the error message which you got and i will try to
resolve it for you okay so for now just click on new select python 3 and it will
open our tab where you can write your PI spark or
Holly so this is the Geppetto notebook with Python 3 where I’ll be writing my
PI spark host all right so now is the time that we
start and learn with the fundamentals of PI spot so now the installation is
complete so I guess we should proceed with the fundamentals of PI spa so as a part of pi spark fundamentals so
we have spark context Park on spark file rdd’s data frame and MLM so we’ll start
with spa context well it is a entry gate for any spark derive application or
functionality scene detail what exactly this so when we run any spark application a
driver program starts which has a main function and your spa context gets
initiated here the tire program then runs the operation inside the executors
on worker nodes now the spice Park that uses pi 4j this
pi 4j enables the Python program running in a Python interpreter to dynamically
access Java object in a java virtual machine and create Java spark context by
default PI spark has spark context available as SC so creating a new spark
context won’t work over here all right let’s move ahead so here are some of the parameters
related to spark context we have master app named Spock homes here eliezer
gateway corn pie files environment JSC profiler CLS bat size among all of these
only master an app name are the one which I used most frequently so for this
tutorial we’ll focus on just two of them the master represents the URL of the
cluster it connects to where as the app name is nothing but the name of your job
let me show you this with an example let’s open our chip at a notebook
so let’s write a very fast PI Spock code from PI spa important I spot context all right and next we’ll define a SC as
spa context and inside this my first parameter would
be the URL my URL is local and my next parameter is the name of my application limiters anything like mine so I guess we are getting an error like
cannot run multiple spa context advanced alright so you can resolve it just by
writing SC dot stop okay hit enter and let’s reload it again
executed once more yeah so as you can see here my error is gone
and the statement is executed alright so what I did over here I defined the URL
and the app name now this parameter it cannot be changed
throughout the program all right so alright this was all about spark
context so moving on ahead let’s see another PI
spark fundamental sona pi spark fundamentalist next we
have is this popcorn well it is used to provide a configuration to run spark
application on local machine or cluster so what exactly is spark on well to run a spark application on a
local or a cluster you’ll have to set few configuration and parameters and
this is where sparkin comes into picture it provides the configuration to run a
spark application initially what we do we create a spark on object with spark
conf function which will load the values from spark now you can set different
parameters using the spark on object and once you set the parameter thirty minutes half an arc initially we’ll create a spark on object
with spark conf which will load the initial values from spark and once set
their parameters we’ll take the priority over the system properties oli
remember once we pass the spark configuration object to a party spa it
cannot be modified by any user thus it becomes immutable so we have set set master set up named
Get Set spark home all right so set it is used to set the
configuration property you specify two parameters over here key and a value
fine in set master you specify the master URL and setup name you set an
application name finally there is I get get is used to get the configuration
value of the key and find you there’s a set spark home which is used to set this
spark installation path let me show you this with an example this should make
things easier for you to understand all right so let’s move ahead all right so
let’s jump to our chip at a notebook so let’s define our configuration file
for this I’ll use from PI Spock dot conf board
spark configuration I mean next is
from PI spark context import spark context you all right next what I will do I will
define a variable conf and in it I will define my configuration function . i’ll set the app name as set happening
name off my up I named it as spice back them up you
and I’ll set the master as local to fine
now let me show you how this gate function is working so since I define
the conf variable so I just right over there conf dot get inside that value my
value is what spot master spark master so what it will do it will
give me the master URL which I have already said
fine let’s at Enter as you can see we got the URL as local
to fine since I have already set my muscle URL as local – now what if I want
to check my app name so configuration dot get Spock dot up dot knee
and hit enter so my spark application aim was by sparked my just a quick info
in case if you guys are looking for end-to-end course certification in buy
spot in telepath provides the PI spark certification training program where you
can learn all of these concepts thoroughly and or a certificate at the
same time the link is given in the description box below so make sure to
check it out and on that note let’s get back to the class alright alright so I
guess we can move ahead so next in the PI spark fundamental we have is this
spark file the spark file it helps in dissolving the path of files are to
spark all right so what exactly is this spark
fight let me first tell you what it is so whenever you have to upload your file
in Apache spark you can upload it using the SE dot add file where s is your
default spark context remember all right and whenever you want to retrieve the
path on a worker you can get it using this spark file start get the spark
files is used to resolve the path of the file added through spark context start
add file ok now let’s talk about its class method
spark file contains two types of class method get and get root directory get is
used to get the path of that file that is added through the spark context dot
add file and get grew directly function is used to get the path of the root
directory added through SC dot add file fine let me make things more clear with
a demo all right so let’s check out the demo for Spock files from PI spark
import spark files fine next from PI spark I want to import spark context
next what I will do i define my path but equal to s dot v dot join and side that
I’ll define the path and the file so path is so the path is this t spark when
so this is my path and the name of my file is just yet and the name of the file is
fortune five double zero two zero one seven calling so since I have to find my
path so my next step would be to add the fight so SC dot add file from where from
this path which I have already mentioned ever now let’s hit enter you okay so yeah it got executed fine now
let’s check for its root directory so it’s our another let’s check for its to
directly spot files dot get arrow directory
the finite let’s see so there’s my root directory all right
so moving on ahead and spice Park fundamentals we have our d DS well
there’s a very important concept that you have to learn or like it stands for
resilient distributed data set and as a building block of every spark
application so what is our DD r D D stands for
resilient distributed data set resilient as if the data in the memory is lost
it cannot be recovered that is its fault tolerant behavior next is distributed
the data is cheong into partition and stored in memory across the cluster and
finally it’s the data set that as initial data can come from a file or it
can be created using the program fine so next we have types of operations in
our daily well we have two types of operation and RDD one is transformation
and other is action let’s see about them in detail so what is transformation an RDD well
these Spock are DD transformations are function that taken our DD as an input
and produces one of many oddities as alpha that do not change the input our
DD since I already told you that our Dedes are immutable and hence you cannot
change it so what they do is they always produce one or more rdd’s by applying
the computation they represent for example malfunction filter reduce by ki
etc these transformation are lazy operation on an RDD in apache spa it
creates one of many new rdd’s which executes when an action occur
hence transformation creates a new data set from an existing one
certain transformations can be pipeline which is an optimization method that
spark uses to improve the performance of computation
next comes the action operations an action and spot returns the final
result of our DD computations it triggers the execution using lineage
draft to load the data into original our DD if I carry out all the intermediate
transformation and return final result to the driver program or write it out to
the file system the lineage graph is dependency of all the parallel rdd’s of
our daily action our our DD operation that produce non our DD values they
materialize a value in a spark program an action is one of the ways to send
result from executors to the drivers if you function which are a part of it is
like first take reduce collect and so on alright so using transformation even
create our DD from an existing one but when you have to work with the actual
data set at that point you have to use action remember when the action occur it
does not create new or DD like transformation thus action our our DD
operation that gives know our DD values but in turn what it does it stores the
value whether to driver or to external source system thereby bringing laziness
of our DD into motion well here is a summarizing image for you
as you can see when you apply transformation operations like math
filter sort by or anything so what it does it creates a new RTD whereas on the
other hand when you have to perform actions like using First Take etc then
your existing RTD is saved as a file alright so I think you guys might have a
question in mind like oh when you guys should use our DD right well here are
some of the scenarios under which we’re rdd’s most preferably used first is the unstructured data when your
data is unstructured for example in case of a media stream or stream of text you
can use RDP second is the no schema well this is the case where you don’t care
about imposing the schema such as a column in a format or while processing
or accessing data attributes by name of column fine next is the data
manipulation that is you want to manipulate your data with functional
programming construct that domain-specific expressions and finally
we have the low-level transformation that as you want low-level
transformation and action and control your data set alright so these are the
scenarios under which you can use our dealings so next on the PI spark fundamentally we
have data frame so this was all about RTD if you have any doubt don’t hesitate
and adding them in the comment section below it will be beneficial for you as
well as for the other learners all right so in the next spice pack fundamentally
we have data frame data frame is a buzzword in industry nowadays people
tend to use it with many popular languages used for data analysis like
Python Scala and R so why is that everyone using it so much
well data frame generally refers to a data structure which is tabular in
nature it represents rows each of which consists of a number of observations
rows can have a variety of data format which can be heterogeneous whereas a
column can have the same type of data type that is homogeneous data frame
usually contains some metadata an addition to the data for example the
column and the row name okay you can say that data frames are nothing but
two-dimensional data structure similar to a SQL table or to a spreadsheet fine
I hope the data frame is clear to you guys so moving on ahead and the end we
have the MLM MLM is a machine learning API of spark
which in turn supports various machine learning algorithm it is a machine learning library that is
provided by a purchase part to make the machine learning scalable and easy
it provides various stewey for a machine learning algorithm feature ization
pipelines persistence on utilities few machine learning algorithm which are a
part of a classification regression clustering collaborative filtering
dimensionality reduction and others alike so as a part of feature ization
they have feature extraction transformation and selection and
pipeline they have a tool for constructing evaluating and tuning
machine learning pipelines like spark core forms the vital component of spark
where it talks to all of the other modules for us keeps in check with the
other sub modules you know you have spark SQL your spark streaming you have
MLM fit is the machine learning library then you have graphics which is used for
graph processing and you know again streaming is for live streaming and SQL
is to perform mom you know unstructured data processing data access and data
manipulation as well so to bring all of this under one particular umbrella
we have spark core as I’ve already mentioned this forms the you know the
base of the spark engine guys so our spark core as we already know it
contains all of the basic functionalities of spark right so we
have many things which we have which needs looking at consider fault recovery
memory management you need task scheduling then you’re gonna need
algorithms you know there are many task scheduling algorithms that is first and
first out are round-robin there’s so many algorithms right so you need
something to stay in constant touch with that and you know to make it readable
for both the program and the user as well again as you already checked out in
the hoops concepts abstraction forms a major part of our learning right so
again the major abstraction concept here for the spark core is something called
rdd’s guys I’m not sure if you guys are familiar with what our Dedes are if
you’re not don’t worry I’ll be walking you guys so in detail about it and the
same module and basically it is a data unit but will be anyway checking it out
of the car next couple of slides so again yeah spark core provides oh I know
many API is which is application programming interfaces of processing or
processing structured data is pretty much easy but going about working with
unstructured data is not impossible person
but then at the end of the day if it’s unstructured think about it right so
structured was is unstructured the example let’s top of my head right now
is up you know have a good clean bedroom and living room versus a very shabby
room where you’ve thrown all your dresses around and then you’re in a
hurry to pick up your dress and you have no idea where you’re looking at it so
similarly that’s again unstructured data now the next component will be your
advising is spark SQL again our spark SQL already provides us api’s and we
already know we can work with structured data using SQL and again people who have
an SQL valid background and people who have been working on it for a long time
then pretty much can make use of SQL commands with you know to talk with
SPARC and go about query in their language and you know data gathering
data manipulation and storage as well so again with SPARC SQL guys there is many
data sources which you can have right or usually most of the time all of the data
sets are pretty much CSV files and again we have JSON queries that we can pull up
and we can have requests and we can now you know pull up a lot of data using
that as well then if perk eat or v5 with Cassandra and guys there’s lots of data
sources lot of you know abstraction tools where we can keep pulling in data
at it is it is at our wish right so pretty much that’s what what I wanted to
convey your and with SQL it really makes it easy and then if you already know how
you can go about programming with SQL it makes your life even more easier because
it is giving you all ready to use API to work with spark and next up we have the
spark cost streaming engine test so this basically provides us with
really low latency and you know high quality streaming so basically we need
processing tools would you go about giving us this result right so spark
streaming has again EPS which pretty much give us direct access fast access
and most of all again we’ll be covering an entire module about streaming so
don’t worry about our in-depth but yeah at this point of time what I want you
guys to know is that to give you a quick idea of how it works so basically it
gets all of the live input data streams right so does continuous data flowing
into it to handle this particular data what spark streaming usually goes about
doing is it pretty much splits all of these data
to certain tiny snippets and or or we call it batches basically so basically
it goes about dividing data into batches to give you an example you know consider
or Twitter right so Twitter is being it is usually I wouldn’t say spam but
Twitter is bombarded with so many amount of tweets from everyone around the world
right so imagine all of the data being generated at Twitter so if you have to
process this in real time it is gonna be a little tough but then we can go about
almost pushing for real-time requirements that’s what we call by near
real-time and this can be done you can have logs generated you can work on the
logs and you know you can perform analysis on the data as well so you know
you can do sentiment analysis and whatnot using the MLM as well so to go
about doing this and to access that near real-time data right so we’re pretty
much going to need or spark streaming guys and next up coming to the EM Lib
guess oh this pretty much is the core for sparks machine learning and
artificial intelligence concepts we’re already familiar with the hype which
goes on about machine learning and artificial intelligence right so you
know or to just or take a quick detour or if you guys remember back in the day
I think we were being told that you know machine learning and artificial
intelligence will one day revolutionize the world yes it pretty much is already
revolutionising the world right I mean it has been so subtly integrated into
our world that we don’t even realize it’s there anymore
look around you probably you know or you can say if you have an iOS device you
can pretty much go about saying hey Siri and you know that’s that’s artificial
intelligence working for you so everything from pretty much turning off
on your lights at home turning on the air conditioner you know putting putting
rockets into Moon and Mars as well so we have machine learning everywhere and
it’s been ever so subtly implemented that it’s not pushed into our face the
concept is not pushed into our face and it’s really fun to work with
oh so anyway coming back oh yes this particular library right so the machine
learning library would spark has so many inbuilt libraries which
pretty much give you access to the plethora of machine learning algorithms
which are also present with though I mean in the case of machine learning and
it’s concepts and yeah to give you a couple of names machine learning
algorithms guys we have classifications you know we have regressions we have
clustering we have collaborative filtering right so we have linear
regressions there’s so many algorithms that you have to know and work with or
when it comes to machine learning so in that particular case MLM pretty much has
it has a scoured here as well guys also machine learning library right so it
provides various techniques as well it’s not the only thing where it provides a
couple of libraries for us and then you know we go about doing it on our own
without any guidance no so it provides us couple of techniques of how we can go
about pre-processing our data because at the end of the day data might be
structured unstructured right so we converting this particular data into
something usable right so that usable part is what we call information so data
as it is I mean as it stands or Spartan you know there’s pretty much nothing you
can do about it so it is clustered it is you know on structure and at the end of
the day it might not be what you’re looking for so consider you know data if
you want like a mug of coffee guys so if I give you the coffee powder I give you
a cup of milk I give you sugar all separate and three containers well it is
still coffee but other than everyday you cannot drink from each and expect the
same coffee taste right so that is data and let’s say you brew up our only good
cup of coffee as soon as you mix up all of that and you don’t do all the process
and then that becomes information right so this pretty much again for coffee to
pretty much take it from the plant the coffee plant and then you know to get it
roasted and ready to grind this can be another good dog analogy for data
pre-processing as well guys and next up we have graphics so pretty much Shaw
does as beginners I guess you know you guys might be interested in non graph
graphing out a couple of data as well and basically it’s a visualization guess
so whenever I have all you know spoken to of the you know
that person as a learner he prefers to find out more about a language more
about a code if it looks really good so graphics is again pretty much you can
graph out everything you want to and basically at the end of the day graphs
look you’re really nice to look at and work with right
you know most of all it provides really good analysis and supposed computation
for large amounts of data right so again I’ve been giving you analogies with
simple terms to just put the concept in your heads but at the end of the day you
need to realize that the data which you know you’ll be working with at the end
of the day is huge so yeah and at the end of the day instead of looking at a
million numbers it would be really nice to look at probably 10 graphs right
think about it and yeah they pretty much come with a variety of you know graphing
algorithms and lot of computation techniques as well and these are mainly
used for you know disaster recovery systems and something like say after a
hurricane hits and you need to find out what’s going on what is the level of
damage the you know the population who were affected by it and so on and a
couple of page ranking algorithms have used graphs as well as I’ve seen a lot
of financial fraud detection systems and most of all I think graph the first
thing the most important thing that comes into my map my head is pretty much
think about it guys stock market right so if you keep looking at numbers then
sure or then walk to but again looking at graphs if you’re a trader you will
pretty much figure out how important graphs are right so graphics is pretty
much the SPARC equivalent for that you know here’s an abstracted quick view of
what everything looks like so SPARC components again we already
know that the programming language is supported by SPARC are pretty much Scala
are Java and Python we’re couple of people prefer or Python
over Scala just for good readability easy learning a lot of applications as
well oh with terms of library we have seen
SQL Laemmle graphics and streaming as well and to talk to all of these
components together and to hold one management engine we have or the spark
core and then we have something called cluster management tools as well we’ve
already checked out what a harness so yarn is basically
resource negotiator and so to have your data and clusters to access the data
from these clusters and you to work with pretty much you’re gonna need or yarn to
talk to spark as well and then we have a patch I’m a source again may so similar
to yarn but it pretty much gives you very good functionality as well I think
we have spark shed you know grace again this is another cluster management tool
which which I think a majority of users who have real-time dependencies go out
using it or then for storage you’re already familiar with HDFS Hadoop file
system then we’ll be checking out you know we already know that you can pretty
much put this convert all of our data into the Hadoop file system requirements
and pick it up from each individual nodes in the network but appacha will
also support our standalone orders will so let’s say you’re experimenting and
you know you have you have no instance where you connect to multiple nodes for
your clustering requirements you can pretty much go about doing it in our
standalone node as well and yeah to move it to cloud I think Amazon ec2 does a
good job – pretty much help you storage with storage and to move everything to
the cloud as well and support for our relational or database management
systems and or no SQL is of nobody mentioned here as well so I would
recommend you you know quickly or probably pause the screen or something
and just glance over all of these concepts because these can form the ER
from the major foundation to go about learning SPARC and you know to later or
down in the day since you’re going about mastering this entire thing you might
not require to keep coming back to this diagram but at this point of time I
would highly recommend you to be aware of these concepts guys so without
further ado let’s quickly check out SPARC architecture guys so I pretty much
have a very simple diagram here to denote what happens so we have something
called as the driver program and SPARC so driver program is the link between
you as a programmer and sparks program so all the data that you input all the
data that you need manipulated you know to go about working with it or
you need spark to understand your data right so again our driver program which
consists of this concept called as a spark context is a pretty much use for
this so again thinking about drivers right so think about it recall if you’re
familiar so if you have a brand new hardware where you need that hardware do
you not talk to your raw machine you know you need the hardware to talk to uh
software and at the end of the day you need to use that so basically you’ll be
using a driver there as well right so again it is the concept is again no
pretty similar and simple out here this also coming to spark context spark
context pretty much forms you know the main entry point for any functionality
within which you know you go about doing in spark so basically at the end of the
day our spark context is used to create what we call rdds again I’ll walk you
through it you know to create accumulators we used to broadcast a
couple of variables and all of that but one thing that at this point of time in
the introduction that I would like to tell you guys is again you need to know
that for every single instance of spark running or spark or running you need to
have only one spark context so if you have more than one again you cannot run
it side by side and you will get our error case so one spark context for
every single program and you’ll be good to go so again all of the data that I’ve
inputted pretty much goes into our cluster manager where we have a couple
of management tools that we looked at in the previous slide where again this
depends on you if you’re pulling your data from the cloud you are pulling it
from a couple of different nodes or you’re pulling it from you know once
stand I don’t know what all of that right so the cluster management
I think forms a very vital part because think about it right so if you’re
forcing someone to always use offline data or say use data only from a machine
then it really doesn’t make sense but then if you give them freedom to pull
the data from the cloud pull the data from another node pull it from another
cluster you know cluster basically the group of four nodes and data so when you
give them this amount of freedom then they it would really help scalability
help a lot of users lot of developer at the end of the day right so costume
management is again customized for your requirement where you pull out the data
based on your needs and all of that and next we have the worker notes case so
worker nodes are at the end of the day it’s the endpoint of our Ross Park
architecture where you know the data gets converted into information where we
getting the task done as you can see on your screen we have something called as
the executor rate so that is what it all literally means so worker nodes again
are your endpoints of the system where you’ll go about working with it you will
go about top you know you’ll be asking your worker nodes to pretty much use the
algorithms to learn something and tell you you know give you a valid output so
all of that part goes about and they all work on ODEs guys so again at the end of
the day worker nodes C needs to have four node it needs to be in touch with
spark context for various reasons or don’t worry I’ll be walking you through
that in the couple of slides but that’s why you see the big arrows up and down
where it keeps talking to each of the nodes so it is important that you talk
to every single node that you’re actively making use of right so if
you’re making use of ten nodes but then you’re only are telling your program
that you’ll be pulling data from one single node you know that there are ten
but your program is seeing only one that really wouldn’t make sense so to make
your program you know to convince your program to see ten outputs and to see
I’m sorry to see ten inputs that that forms on a vital part so again as we’ve
already you know spark provides a very well-defined layer architecture which
makes it really easy for beginner to learn and you know all of these
components so and all of the layers are loosely coupled today so again as I
mention about the freedom of you know going about manipulating for your
requirements you are not asked to you know learn in a very tightly bound and
adornment where you know you cannot wiggle around or do something of your
own and I think this you know make spark a
little easy so again driver program runs on pretty much the master node what we
call of the spark cluster which is again the master slaves or organization where
you need to know at this point that whatever driver program you have again
it runs pretty much on the master or node of the Worcester and
so it is basically you know its annual all job execution and negotiates as I’ve
already mentioned with the cluster manager and I’d be using it or translate
rdd’s into execution graphs as well so think of rdd’s as a simple array at this
point of time and you know a lot of data a lot of numbers usually and you know
all of that so our or you know the entire
architecture here it makes it easier to pretty much translate this entire RTDs
and numbers into graphs case so the driver program again it can pretty much
split up graphs as well into you know it can split up the graphs into a couple of
stages as well so this will simplify your use case when you are going about
coding and you know working on a very big project case so again as I’ve
mentioned the role of the executor is pretty much to take care of all the
execution of the tasks that you provide and yeah every application has to have
its own executor now because again it is not native to what you’re doing and
every pretty much most of the tasks are different from each other right so again
the executor as I mentioned also performs all of the data processing that
it needs it can read data or write data it can push the data to external sources
they can pull in data from external sources as well and you can see the
small arrows on your left or when it talks to the cluster manager as well so
it interacts with variety of storage systems and this again makes it easier
to go about this so the customer Roger yeah again it does the service which is
pretty much responsible to gather everything from your audit travel
program to gather all the data that’s required for the executors to work about
and then now you know you can have the standalone cluster manager as well so
pretty much it is one single entity and you will go about using standalone
custom manager for your learning journey because I don’t think you would be valid
to bombard you with a couple of multi node networks and to go about that way
right so customer again as we’ve already walked you through it again it basically
depends on what you’re doing so it can be a very simple spark application and
you know it can lead all the way to something really fancy right so yeah so
next up will be checking out spark deployment
case has a variety of spark deployment modes that we have already talked about
the first most important thing you need to know is that spark works in
standalone mode right so we can go about launching this power cluster you don’t
you do not need any third party cluster manager to work with spark you know
spark is again I mean this cluster is basically again used to deploy spark
applications as well and next we have Emmaus mazes again the most important
thing that mesas can do for spark is it usually replaces our cluster manager and
spark the native cluster manager and it goes about providing a private cluster
of data for you and then it provides in order for tools as well or next up you
can have spark on yarn so you can pretty much deploy it on top of yarn and you
know usually a spark with Hadoop can enhance sparks capability or by lot so
it’s like this extra masala we put on a dish right so put that kind of analogy
in your head so spark with Hadoop is it plays a major role in opping so you know
upping sparks game guys so that’s what I want to convey oh and we have Amazon ec2
as well so you know Amazon has been providing us with lot of good cloud
services these days and I’m sure you guys already know what Amazon ec2 is
right so in case you do not know well basically Amazon ec2 it it stands for
elastic cloud computing and so basically we know what it stands for so it is a
web service that provides you know secure resizable computing capacity and
it does this I mean it does not at the end of the day it is on cloud right so
it is it is meant for scalability and that is that is how easy to works and to
push your entire application to cloud as something something really nice guys I
mean I personally worked on a couple of projects and now with easy to where I’ve
moved a couple of stuff there and it makes your life a lot easier cloud at
the end of the day right so then we have communities guys communities again it is
it is an open source chain so basically given a DS was founded by Google it was
invented by the developers at Google but right now we
it’s being maintained by separate forum now the forum is called a cloud native
computing foundation and short communities is fun to work with as well
so what I want to convey with this slide is that you know in terms of spark and
its deployment you can go about working with it
stand alone with Neos with yarn with easy to and or with the communities as
well so let’s quickly have in have an introduction of how we can know or run
applications or the hard guys so one thing that I want you guys to note at
this point of time is that spark is already pre-configured 4-yard so you do
not need any other additional configuration to know to run now in a
spark on yarn and I think this word door this is a huge advantage for learners
right so let’s say you have to write a very simple hello world code but then
you have to you know you have to spend three hours to set up your runtime for
it it really doesn’t make sense and probably yeah you will do it at the end
of the day but as a beginner it would not be right to per body right so the
people you of design spark is pretty much thought about this already and
they’ve pre-configured for you harness so we already know or are now pretty
much as used to control of resource management to take care of all of our
tasks scheduling for us and even it provides a couple of security features
as well so again security is a primary aspect for many of the applications that
we write and the developers at SPARC have got us covered again with source
Park in yard it is again possible to run any application in any mode right so we
have two particular modes we call it the class 2 mode and the client mode so
cluster mode is when you need your application to talk to a couple of nodes
and you know pick up the data from a couple of nodes and to you know to go
about checking you to go about launching your application basically into this
particular instance guys with client mode again most of the time you just
have one single entity that you have to deal with and you have to go about
walking with that entity processing with the I mean the master node will tell
eventually what that application needs to do but then processing the actual
data making use of the data and then you know pretty much going about you know
with the goal of your application and to ensure that the goal is met right so
again client more than cluster mode basically are two ways where you can
pretty much launch the spark app vacation guys so this park driver right
so we have something called as a spark driver guys
so this pile driver pretty much it works as a part of our application master we
call it am but it’s yeah application master and it’s the management of this
particular spark driver the management of our client application in terms of
spark is basically done by yarn so the client you know as soon as you initiate
your application as soon as you put your application on the deployment you don’t
have to worry about anything else at this point because of as we have the
yarn container against pard rival talks to the on container directly so you just
need to have one single process in the yarn container that will take care of
your entire application that will what the process is what we call it’s called
driving your application so it can go about driving your application it can
request resources from yarn or you know it can it can issue commands to
something called as node manager you can have your spark executor go about doing
the job that it’s supposed to do as well right so the the node manager again
we’ve already seen and/or managed to pretty much talks to all of our single
cluster events and I know talks to all of our executors as well so to allocate
the resources and to get the job done and to pull all the data from one single
entry point we have the yarn container guys so again the resource manager yap
so the driver pretty much talks to our resource managers as well where it gets
all the resources and it allocates the resources which is you know pretty much
required to go about doing this as well so at the end of the day after the
executor or you know runs your task then you’re wrong you know the task is
complete it is or it is pretty simple all you just need to know what your on
container does what node manager does and you know how resources go about
being allocated and a couple of four simple processes guest so again when you
think about it right so in the deployment mode in this particular
deployment mode the driver again runs on the host where you were providing it
with the job right so if I’m sitting at my home and I want the job from there so
I push the terminology we use is pushed so as soon as I push the job from there
and basically at the end of the day it needs to the driver needs to run
you know pushing your job from so the executor will require all of the data so
order does is it pretty much goes on nor requesting from yard and yard will talk
to our application master an application master will eventually give it all the
data it needs then yarn talk to the resource manager or such manager will
hand okay the resources commands are sent into the executor and at the end of
the day it has executed yes so oh quickly jumping on to the next concept
which is a spark shell so guys spark shell pretty much is a very you know it
provides the most simple way to go about learning what spark actually is but do
not be fooled by the simpleness of shell because shell programming is again a
huge asset for any programmer or because at the end of the day it is again a very
powerful tool which is used to perform in-depth analysis as well and writing
shell command I personally prefer it to be fun because of the you know you can
go about interacting with the shell directly and you know to get used to it
to see a lot of code I think for me spatulas do only fun guys so are Ross
Park you know the shell pretty much so again it is it supposed three types of
interactive shells case so we have a spark shell with the Scala movie formats
PI to nine with Aris well so you know we can go about concentrating or how it
works with Tov Python let’s quickly dive into what a spark web UI is case so
think about it what do you think about da spark web UI where you can use you
know spark web UI and what you can go about doing it right so quickly take a
minute to see you know if something pops up in your head or or probably if you’ve
seen or read about this somewhere else previously as well so I hope you guys
could not figure something out so coming to the concept hour well the web UI is
again pretty much the web interface of a spark application guys so why would you
need a web application again at the end of the day the UI is really nice and you
would rather have a you know take a look at quick UI and you know figure stuff
out really fast rather than getting messy right well mainly it is used for
inspection it is to see what a spark jobs are doing what’s happening with
your particular coasters and all of that case so consider where every single time
you know you create a news power context this will basically tell your
application to launch you know one instance of your web UI Oh your review I
can be pretty much on it as it is on the local host machine right so there is
this particular ID that you can use it and yeah you can do a lot of things with
that you can go about walking with jobs you can check out what the status of
your application is you can check out what the storage you know what’s the
issue with storage or what’s going on I mean how much amount of data or each
application needs so you can go about you know having your resource allocator
so allocate that you can check out the environment in which it’s being worked
on right so lot of things that you can do and you will be talking to your
executors at the end of the day as well so a B Y is really you know a fun way to
take a look at that so the job stop again yes it is it is usually used to
show all the status of whatever is happening with your current SPARC jobs
in that particular application guys so as I’ve already mentioned the SPARC
context you’ll be looking at the context in the job tab basically again this
particular or jobs tab contents consists of usually two pages so it consists of
all jobs where you’re given you know you’re presented with our nice nice to
look at list of all the jobs and then now you could go about finding all the
details for each individual jobs as well and later basically the tab which you
know what it does you should display your stages per state let’s say you have
your state to be and I I mean your application is in an active state that’s
working your application is pending execution your application has completed
execution it has being skipped or it has failed and crashed you know something
like that so to find out all of this I think web UI is it’s really helpful guys
and the stages tab again is to tell you basically when your spark application
basically got initialized case so the stages tab that we are checking out is
basically used to show that current instance of what’s happening so again as
I have already told you through the instance it can be stopped it can be
executing you know and it can be a lot of stages as well I
basically you can go about customizing this but let’s not get into though to
the depth right now again the stages tab you know it includes all stages page
where the space will tell you the stages of all your applications then you can
pretty much move into one stage page and later you have something called a spool
pacing as well this or coming to storage well yes or the storage tab again goes
about displaying the information about rdd’s again or if you do not know what
rdd’s I hope I am peeking your curiosity on what they are and this entire module
is pretty much about RDD so we will be checking that out in the next couple of
slides again in the storage star we have something called the somebody page where
summary page basically shows us the storage levels and partitions of all of
these so our Deniz and it shows the size and the executors which are used for
this use of partitioning our individual data organizing the data in a structured
way possibly and guys the Suraj tab rights are we talking
about it and includes two pages one is called a storage page and the other one
is called as the our BD page storage space will greet you with all of the
data of what your application is consuming and the our DD page will tell
you of what you’re wrong resilient data sets are raw you know carrying I mean
what’s the weight that they’re carrying what’s the data that you’re presented
with is the resource present – in order to be allocated and all of that guys
next up coming to the Eeyore and Ranma tab the environment up again basically
displays the values for a different configuration know you know a lot of
variables and your environment as well why would you need this well at the end
of the day or you know you walk in you can be working with Java Scala Python
and you’ll be talking to a lot of system properties at the same time as well to
go about you know giving you a quick look at everything I guess this would
make it a very important base again the tab pretty much as soon as it’s created
you have two pages okay so one is the parent spark UI very will be it’s almost
like an admin access to your application let’s say and then you have the app
state the store where you can go about coding and finding out what your
application is or doing at that point of time and the next door tab important a
bazaar executors tab because executors staff will give you one single
page a huge summary of what’s going on with your executors you know that was
basically created for your applications to run so there’s a lot of things it can
display but mainly you’ll be concerned with the memory usage the disk usage the
task that is in hand being processed and you know the information that
corresponds with it oh the storage memory column in the
executors table pretty much goes about Dino showing the amount of memory that
that particular application is consuming in real time and it lots of show you the
memory which is basically reserved for caching couple of data as well so major
Kitching actions goes on in this stage where we require faster access or near
real-time applications so our raw you know executors tab pretty much gives us
that big somebody there so moving on next we can go bots are having a quick
introduction to be a pious park shell guys so as we’re already aware PI SPARC
shell again it exposes our spark programming model directly on to Python
and we can map you know we can go about using Python with SPARC directly and our
PI spark shall pretty much links our entire Python API directly onto our
spark core and it is used to initialize the spark context as well and as I’ve
already kept on mentioning for a while now majority of the experts majority of
say data scientists analytics experts you know they use Python because it has
this amazing amount of huge library or where you can pretty much find a library
for even one of your custom requirements as well so you know at the end of the
day I’m sure people at SPARC had to integrate Python with it because think
about it so Python gives you access to numerous numerous libraries and then if
you can do all of that along with spark hey guys that’s a win-win right so
integrating Python with spark was definitely a boon yes I you know I can
vouch for that so oh yeah last part again it requires Python to be available
on the system path because SPARC needs to know where and Python is and you know
to have that understand axis untainted axis basically and that is basically
used to run all of the programs case so to
quickly give you an idea of what the spark shell looks like here is what it
looks like on a Windows 10 machine well you can pretty much you know know go
about installing spark I mean installing spark on Windows Azure it’s very simple
yes so as soon as you open up your Rob so you know the spark shell and you go
about typing PI spark you can be greeted with a message which says Apache spark
with vitamin this is all you know some people might like working with shell and
some people might not prefer I mean that’s personal preference but sometimes
yes I still hold my word where after all you know working with shell is fun and
you’ll be learning a lot as well so next so quickly have an overview of how we
can submit a PI spark job days so with in terms of pi spark or spark in general
it’s not that all of our code or you know independent of each other right so
sometimes we have this native code dependencies it can be with other
projects you can have multiple source codes from coming in from multiple
directions and then you need a way to go about packaging them your with your
application and you know you need this code to be basically distributed to
every single node on your heart raw spark cluster as well so for this we
usually you know we go about creating an assembly jar file that contains our code
and all of its dependencies so there is a quick requirement for Java as well and
you may even again which can have assembly plugins I’m basically when
we’re creating your jars so all say lists and spark and OOP
so again these are certain dependencies that we need to know me to work with and
I don’t get here’s a quick overview as well so after that we’ll have the
assembled jar file with sure we can basically call it the bin file or our
spark submit script which is used to basically as the name suggests to submit
our script you know while passing our jar file as well so as soon as you hit a
spark submit so this script is basically used to launch the applications that
you’ve just program and to push it on the cost of gas again a quick recall
spark is already pre-configured for yarn it does not require any additional
configuration to run yar and controls all of our resource management
scheduling insecure it is possible to you know run an
application is anymore right as a cluster mode of client mode so to be
launching it and cluster of client mode you know all you just have to go about
using this simple piece of code so you might be wondering all the backslashes
right so if you’re going about using shell programming then or you will you
will be you know using you’ll be exposed to a lot of slashes as well also have a
quick pause of the screen and I know try to try to see if you can figure out I
mean it is not important at this point of time for you to know what this does
but then just see if you can figure something out right so most of your
learning happens due to your perspective so you try to figure it out and on your
own and I think that holds better in memory as well so go about quickly
pausing the cine screen to see if you can figure something out and if you
cannot not to worry anyway we’ll be checking this out and the next couple of
modules so there you know you can get familiar with this as well so again this
is the steps which we’ll be using to you know going about building the
application so we’ll be creating a document let us say we call it PI spark
job dot py o dot py extension because again this is all a Python file will be
storing it on in your local drives local machines and whatnot as soon as you open
the file yes so though what we doing here is will be printing hello this is a
PI spark right so you can upload this particular file into our cloud labs
local storage case so if you’re not making use of a virtual machine at this
point of time you can have this tiny hello world code to be pushed on to our
cloud lab local storage and you can perform you know this particular steps
to go about now working with a case so basically as I’ve already mentioned to
run your application you will be using spark submit master and we’ll be we’ll
be using the yarn cluster here so we’re using yarn deploy and as soon as you go
about using that you know client ow you’re finally about py it will go about
pretty much walking with it and if you can see here if you can check out the
mouse this is where you have your application output as well so this is
how you would you know go about submitting a bias bugs of this and so
how can we go about checking a PI spur job you know using a couple of
pretend notebooks oh so yeah so I just park and pretty much go what easily
running in its standalone mode that we have discussed and consider this tiny
example I have only on the right for you right so this is again Google collab
I’ll be quickly walking you through the same so in this particular example we’ll
be importing pies part for our usage firstly we’ll be installing PI spark
we’ll be importing PI spark will be initializing a spark context you know
followed by that we’ll be performing a calculation so we’d be using the mod
function oh this is basically our basic you know spark job and later I will be
using an RDD operation to perform a transformation case I hope I am you know
again picking your interest on what an RDD is so if I have to you know give you
a small trailer of what our DVDs are at this point of time well rdd’s actually
stand for resilient or distributed dataset guys so rd knees are the
fundamental structures of spark and you know every data set in an RDD is again
divided into a lot of local partitions which is needed for each different
clusters and all of that or anyway we’ll be coming back to that so let me quickly
open up my Google collab so you know walk with are all for spice Punjab this
so this Google collapse you can basically access it by typing in collab
but researcher on and this is basically guys at the end of the day
this is just a Jupiter notebook which is you know hosted on Google Cloud so you
can carry your code easily everywhere so I’ve been you know using Google collab
for a couple of years now ever since its launch and I find it really easy
basically it’s convenience case so all you do is to run your code you type in
your code so our first code is again to basically check the Java version that we
require and as soon as I go and hit the play button right next to it done so
your code is executed and you know you do not have to install Python you don’t
have to install the dependencies you don’t have to go about configuring it
configuring the runtime executor nothing sohal you do is type in your code hit
play and you’re good so that’s that’s how simple it is guys so basically we
have four Java version 11 right now but for our spark applications so most of
the time prefer Java 8 so I’ll be installing
selection to is basically Java 8 so I’ll be going about installing Java 8 for
that and will be or downloading SPARC is also W Gators again is you should
download it from a particular server and here is the official central link for
picking up spa two point four point four so as soon as I had that are you know
Google App will go on downloading spot and uploading it at file section as well
this will take a minute so you know we will just continue as soon as we’re done
with the download so that took about or 20 seconds I saw
that was pretty quick and I saw you downloaded it was in a zip format right
so it’s in or tdz or for mine so to go about executing this we need to first of
all long unzip it right so we we use this command
called rxf where raw you’re unzipping it in a format or Google could have
understands what is present inside our raw now you know how to spark off or was
it fine so after this so if to install something
called Azure fine spark can you guys so quickly guess what what fine spark is
probably you know take a take a second to figure out what to find spark is
think about it I mean if you can’t figure it out that’s fine oh so
basically to keep your spark instance to be portable we make use of finds Park at
the end of the day guys so this is a very tiny module so this should be you
know I’m guessing installed under five seconds and that’s done again you need
to set up a couple of environment variables we have to set up Java homie
dispersed of the environment variable spark home so there’s the code to go
about doing that and as soon as I hit play and that’s done so let us let us go
about creating a first spark session guys so we’ll be importing fine spark or
to go about starting your spot session using fine spot init method and then
we’ll be importing or you know PI spark dot dot s scales will be including the
SQL module from the particular spark session and this will just telling that
will have only one local instance where you working with it so again at this
point of time you’re doing nothing but just creating an empty session for us
but you need to know that we have for the code to check as well so I cannot
just go about executing for the core because spark session is already running
right now in collab so let me quickly stop our session so you know we can
pretty much hurry run it again next up we need to install PI spark so
we will so spark now or let’s quickly install PI spark so this should take
about a half a minute which is what I is you let’s quickly import by spark again
begin a new session let’s go about creating our spark on Texas off-ramp I
Sparkle importing the spark configuration again from the same module
we are importing spark context as well by setting a variable configuration is
equal to spark config convert set master is local by telling you that the only
local instance which will run you can have any application name that you want
and here we just call it spark basics guys you have s es es again our spark
context where you build where it’s pretty much simple guys so you use this
code and this code is you know our standard which is set to pretty much to
go about you know creating your spark context of your own so let me run it
again it should take less than you know 10 seconds to go what running it here is
what we were discussing in the PPD now we’re defining a modular function where
we’re putting them PI and we’re returning the modulus of that so again
our D D is equal to SC dot paralyze range of 1000 mapping it using the take
function I know that you guys are not familiar with any of these right now but
then chip to give you a quick example this is how it works guys so basically
it is like a list of all of the elements again it has a structured data type it
forms the native of spark and don’t worry I have lots of examples that you
know I’ll be showing you a lot of functions a lot of operations that I’ll
be walking you through when it comes to Hardy DS as well which is actually the
next section that will be checking out I just wanted to get you familiar with
Google color but guess no you chose to use another law python ID for the
previous modules and I just for to show you how your you know what you need to
go about setting up your by spark job guys
so what our spark rdd’s guess so I’ve already told you spark RTV is pretty
much stand for resilient distributed data sets but at the end of the day how
is it why do we need it what a spark will be just you know checking it out
and in this particular module great so this module is basically all about Hardy
bees and we really need to know what they are why we need it in all that
right so our DS are immutable they are you know they’re separate split or
structured partitions of data so basically data records collections
in the same and you know they can be how can we be created as basically you know
you can have a couple of functions work on it or I’ll be walking you through the
functions as well we have map your filter we have group I function load and
loads the functions so when you have all these operations done and you have
certain things which are done to your data well most of it is you need to save
your data after you know after you do certain operations on it right
so to have a certain stable storage media or you know the media can be HDFS
as well and so after the transformations that’s being done you need a place to
store it so the transformation being walked on that particular data guys that
is what our TVs are mainly used for so there’s a couple of things that I need
to walk you very important things is pretty much with to start oh that is
again the first thing is in-memory computation rdd’s support lazy
valuations fault tolerance very good immutability again as I mentioned
partitioning very good persistence and coarse-grained operations as well so
what are our DS again our deniz are the main logical law data unit and spark
which is the most important aspect of this module they are again a distributed
collection of objects which are again they can be stored in the memory they
can be stored in the disks you know just of one single machine but they can be
stored in the disks of multiple machines and one single RTD can be divided into
multiple logical partitions and these partitions can be stored these can be
processed on all of the machines in a cluster as well so again yeah our DD is
you know they can be caged they can be used again and again no matter how many
times that you require for any of your future transformations I mean these are
small files so at the end of the day though you can keep catching them and
reusing them and coming to lazy evaluation or do you guys know what what
lazy evaluation is if you do not then again as we’ve already been doing or
take a quick guess of what a lazy evaluation means this so I hope you
could guess it it’s basically you know you have variables in your programs
which are always bound to your certain data and they’ll be executing it right
it’s to have the evaluation separated from the data until the
also obtained is basically what lazy evaluation means you know to delay your
evaluations let’s say you have an expression you pretty much be delaying
the evaluation of it until that is required so that is not the very simple
explanation of what talk or lazy evaluation is and again rdd’s can
support our lazy evaluation and it can support transformations and actions on
these rdd’s as well so here’s a quick workflow of how an RDD works so we have
our DS we have 1 & 2 which again Club into a single simple unit and then our
Dedes are like arrays right so they’re pretty much structured data so this
structured data gets converted into graphs or Daz is a directed acyclic
graphs and with a directed acyclic graph what usually happens is data flows from
one node to another in a structured manner where you know data can jump back
and forth and each node is connected to you know another node which is ahead of
it so basically what I want you guys to figure out at this point of time is to
know that the rdd’s goes from you know looking like arrays in a lot of numbers
and it basically goes into a graph where you know our spark unit can understand
what’s going on so after our dad Euler finishes its work we’re at it assembles
it orders it and stores it so this is next up a stored or cluster manager a
customer manager will figure out where to pull the data from right so it can be
easy – it can be via our communities or you can pull the data from a lot of
places so our cluster manager again is used for this so all of the tasks set
from our rdd’s consider each individual already operation to be a task right so
this entire task set is pretty much pushed on to a – scheduler where the
duster union will figure out when to execute what and you know how long to
allocate a resource for that particular task and so on and so all of this data
is pushed to the executor and our executor will go about town you know
executing and giving us the the required output or that we you know eventually
require guys so why would we require our DDS think about it so just to answer
that question is the the next part of this module we need to find out what are
the stop gaps in the existing methodologies that
leaders there right so firstly I have Indus computation guys so before the
existence of rdd’s what used to happen was we had again done loads of data back
in the day as well but all these data were being processed in in a hard disk
in a spinning slow hard disk so we already know that you know in-memory
computation can be done which is already supported obviously executing something
in your RAM on your cases you know hundred thousand times faster than what
it’s done on your hard disk so within this computation again we needed
something to you know go about stepping up our km and to be needed in memory
computation basically so again within this computation you’re gonna be using a
lot of memory to store and at the end of the day you know if that is slow
literally is a turn off right so next we have job execution so basically whatever
was existing right so consider MapReduce basically so MapReduce needs to store
some of its data you know it needs to have an intermediate storage space she
can be called HDFS also it needs a place to store all the intermediate data to
access it to pick it up again with job execution another important point is
that it makes you know again yeah so overall computation of your jobs are all
usually slow so probably we can imagine and we can handle a slow execution if
it’s one job ten jobs and you know it takes about five ten minutes but if a
ten minute job was taking about 3540 hours then yeah that would really slow
down things right so again overall it slows down your input/output operations
serialization operations or application operations and so on with parallel
processing again no when we didn’t have spark most of the data processing were
you know were not done in parallel first of all and even if an application
supported parallel processing it was not as you know efficient as that guys so we
know how are the rdd’s are wrong you know solving the problem for us well
mainly you know our duties of again emphasized mentioned there’s a huge
capacity huge capability to handle various volumes of data so how does this
do it well basically it processes them all perrolli you know it has multiple
logical partitions where it can access work on each of
partitions and save it later guys so you can go about creating an RDD whenever
you want but it’s only executed whenever they are needed so you can go on
bunching up creating tens and hundreds of are days that you want anderson spot
supports lazy evaluation it’ll only be picked up whenever it is needed guys so
let’s quickly check out some facts to know to go about explaining why i d– DS
are better than what’s existing today and the first important point i want you
guys to notice that supports up you know an answer distributed computing guys so
distributed computing in this day in terms of big data is again a big boon
right so again spark rdd’s are useful for you know distributed computing which
involves processing lots and lots of data over lots and lots of computers
with respect to lots and lots of jobs right so having this distributed
computing capability is a big thing guys and secondly partitioning well RTD is
again are partitions so basically by partition what do you mean is in simple
terms it is basically split into a couple of logical partitions case and
you know the partition data is distributed across all the nodes that
are present in the cluster as well so that is again a very important fact of
why rd is a better and thirdly or location stickiness case
so basically rdd’s can define their own preference so they can have their own
location to where they have to you know stick to the partitions and where they
need to be present for the data requirement procedures
so again location stickiness is again another very important reason of why our
Dedes are required next up let’s quickly you don’t jump in to find out what are
the features of SPARC rdd’s guys so as we’ve already talked about this I mean
as I’ve already mentioned this the first major important think of spark are
Denise’s in-memory computation guys so and any RDD for that instance and all
the data that are you know present there it can be pretty much pushed on to your
memory and you can have it stay in that memory for as long as possible and this
is I’ve put it first because this has to be the most important feature of SPARC
are these guys oh and secondly with respect to immutability you need to know
at this point of time is that an RDD pretty much cannot be changed once it is
created guys so that is what immutable means and sure you can perform certain
transformations on it and you don’t can only be transformed but it cannot be
changed on the whole so rdd’s are immutable guys lazy valuation again the
data it’s not available all transformed you know until something is executed
where it triggers the requirement so you can code and 10/15 rdd’s and you know
you can just leave it that it’ll only be picked up when it is needed right so
then supporting lazy or evaluation is again another big advantage of this or
coming to ketch ability guys this is again another very important feature of
SPARC rdd’s where our rocket storage you know it stores all of the intermediate
RDD results just in the memory so nothing is being pushed to your hard
disk or local storage and the biggest advantage sure with RTD is is that the
default storage for its gates is in memory guys so this ensures a higher
efficiency better speeds of data access and all of that and the next important
thing is parallel data processing is so to process data sequentially is surely
an advantage but to go about processing your data parallely side by side as
something which takes you know it’s requires a lot of muscle so again we can
be very happy and you don’t satisfied with our duties because our Dedes can
process the data in parallel for us and lastly oh the another important
feature of was balcony’s is that you know it can have types guys so it can it
has a type of configuration where you know your rdd’s can be of the type
integer it can be able to type long it can be a float and this again ensures
better readability it is more concise it gives you that feeling of tailor-make
where you know you can have your data which is which is in perfect condition
where you know as soon as you glance at it if it’s an integer or integer coded
data if it’s so data with large numbers using long or if it’s a decimal number
or no which has float and this next up on this module we need to
check out the way so I know the waste which is basically used to create RDD is
in our PI spot guys again let’s see basically that’s a symbol of collab that
you see on the right and yellow and you will be checked out how we can create
our DDS in that simple example I showed you right so we importing spark
configuration we’ll be importing or a spark context or to go about creating a
spark context we’ll be creating a con for object case so basically this
provides the entire configurations for our spark application it will you’ll be
setting the application name now you’ll be setting the masternode as well in
this or in the example that you see it is local because there’ll be only one
instance in this particular example that I’ll show and yeah we’ll be dividing the
app name the master URL as well master you are of those local in this case and
we’ll be using SC SC is nothing but an object which is derived from our raw
spark context guys so we can go about doing another quick set of operations as
well so we can you know create an RDD using a simple list so we have already
walked you through list and module two so just quickly you know revise the
concepts of lists so here we have a variable values and we have five
elements in a list and then we’re performing this operation called SC
daughter paralysed and we’ll be passing the values or basically we’ll be passing
the values and creating the oddity using that guys so SC got paralyzed is an
unfamiliar syntax for you right so SC not paralyzed is you know an
unfamiliar syntax at this point of time so the parallel is method basically when
you’re applying it on a collection right so in our case the list is the
collection so when you going about doing that you know on you basically a new our
ID is created for us and all of the elements that are you know they’re
copied into this particular DD so I realize we pretty much go about creating
a new our DD as well and all the syntax to go about doing is paralyzed and
passing it a collection guys oh and to print an RD value you easily use the we
usually use the take function guys so our DD don’t take off I will pretty much
print the five items which are present action and moving on there’s a couple of
things I’ll be showing you practically as well so to upload a file to Google
collab we have this particular syntax from Google Doc collab you know import
the files and then we’ll be putting it into a variable uploaded is equal to
file start upload again that is our GUI again I’ll be walking you 2,000 second
and you know we can go about initializing an RDD or using a text file
guys so if you have a particular text file where we are you know doing
something or we have a lot of data in the text file where you want your RDD to
become the data in the text file sure you can do that as well and you know we
can go about printing the text from the RTD as well so let’s quickly you know
jump into collab and we can check out the next set of four demos guys so again
here’s the demos arrive that we’re talking about values is equal to one two
three four five it’s a quick list for us paralyze creates are already and as soon
as I hit take or we’ll have the output of a case so uploading files to collab
again as soon as a turn on this or just see what happens so it’ll allow us it’s
still running cut so waiting for me to put up a fight let me quickly you know
jump to our desktop we have a spark file which is already it’s a text file
basically nothing you know nothing important so basically I have you know I
have some gibberish returned in there for this example case so loading a text
files again pretty simple a seed or text file of first part or text the location
you are beginning sparked or text because inside the files we already have
it so there is no other wrong you know no other path where we have to provide
it so it is just files in case of collab this so soon as they go about or running
this it will pick it up and as soon as I hit tau are needed collect which is
basically used to print the already later so to say Apache spark with Python
a spice park so this data are our didi has picked up
because of what I have put in the text file guys so it is awesome you know as
simple and or straightforward as that moving on the next concept that will be
checking out dresses are something called as persistence and caching guys
so to begin if you already know that DD’s are busy support lays evaluation
right so you can go about using the same are any any number of time at you want
and it will go on recomputing it as well so you know this goes on for a while
sure for a small amount of data it won’t have a major
damage but for large huge datasets oh no it’s all worth terabytes or say to
recompute it every single time I don’t think it was very efficient right so to
go on avoiding this recomputation on a single already multiple times you can
have you know your spark to persist the data so persist the data will pretty
much store it in its partition and hold it there so you know if the node they
pretty much computes the RDD and pushes it to the partition and if the node has
a data let’s say if there is a node which has data persisting on it and it
fails so sure Sparkle again recompute all of the lost partitions of the data
when when it’s required so that is the key word here when it’s required I snore
all the time so again Sharada knee so can you only
persist oh no with our DDS again they come with a method which is called as
unn persists so unpurchased will again let us you know remove all this data
manually from the cage so let us say you do not require it and here you require
your kits or a second cage is again you know a small amount of data right if you
not push everything on the cage and rather have it stagnant there so
basically you can use the unpurchased method that you know you can go about
removing all the data RDD data from the webpage guys so here’s a quick
persistence level law or chart I have for you guys so with persistence level
you need to know that you know you can perform this is basically to give the
user flexibility guys so with the memory only level you can you’ll be pushing all
your rdd’s onto your cage directly oh that’s why the space user is high you
know the CPU time is low because rate has accessed faster it is always pushed
in the memory and none is on the disk so again it is it stores an RDD as a Java
object basically and so what happens if your case is already full and it doesn’t
support most of it is you know saved into the cage and you know the other
part which is not catched will basically be recomputed when needed guys so again
we have all memory only sir we have memory and desk we have memory and disk
sir and we’ve disk on the you guys so at this point of time I would just
recommend you pause this particular section to glance through of the Elevens
case so if there is you know you can have high CPU time as well which is you
know not efficient you push everything on your disk only
and none of it is pushed to the kitchen out of the intermediate storage is
pushed to the cage and all of your data are on the disk as well so this again is
inefficient right so quickly you know pause the screen and have a look at this
persistence chart case so coming to catching guys we already know that you
know caching core data is very important and when your data is being accessed
repeatedly right so again it is it is very important to go on catching it we
call it a hot data set because you might require it at every single point of
execution say or in a night rate of algorithm and whatnot and cases are
fault tolerant as we’ve already discussed and you know let us quickly
you know go back to an example and we can check out the example where you can
have our data set and you can know you know pretty much have it caged guys so
let me quickly jump into google collab and we can work out with it so coming to
persistence if you already have it with persist it is going to hold it in the
gates and it does not tell you where it is holding it guys so this is again a
very important readability aspect of spark as well and coming to caching or
as soon as we go ahead and it’s part or X against part or text is the file which
we imported a couple of minutes ago so with that again or text file dot page
so basically text file dot case is used to hold up there and to basically you
know at the end of the day as a developer as a programmer to know that
your data is fault tolerant and it will be there if something messes up is
something oh you know important case so now that we’re done with that you know
let’s quickly jump into operations that we can perform on our Dedes case there’s
a lot of operations that you know you can eventually perform but then there’s
the there’s two main important types of operations that you can perform so the
first one is called as a transformation think about it right so I did either
immutable as we’ve already discussed so to work around this we need something
called as a transformation so the transformation basically return a new
are digas so lot of emphasis on the word new guy since it’s immutable then
creating a new RDD is you know it’s inexplicable so creating a new RDD is a
great requirement case again an action what an action will do is basically it
will return one singular value to us and that will make our lives a lot in austen
billion so you will basically figure it out when you’re getting dirty getting
your hands dirty with a lot of code and yeah without further ado let’s quickly
check out our daily transformation the guys
so again transformations as the rest of our DeeDee concepts are all the easy
operations that are basically performed on one RDD unit and doing this will
eventually lead to our transformation will lead to creating one or more new
rdd’s as well and you know how will it work well you here at us are DD
transformation pretty much you know it will return a single pointer which is
created to the new r DD so our application needs to know where the new
data is going right so basically pointer is used to memory map that so our
transformation will basically return a pointer to the new r DD and it will you
know allow us to create the dependency which is required and between each of
these rdd’s or because let’s say you have you know you have a couple of rdd’s
and each has a function to do something it each has a unique function to it and
then to go about you know letting the rdd’s talk to each other calculating the
data having it store on one common place and all of that right so it’s usually a
parent ID d and under the parent ID we have what we call pointer dependency or
towards the child rdd’s and we work on with it
but at this point of time let me not bombard you with a lot of lot of
in-depth details all you have to know at this point of time is pretty much there
is the dependency which exists in between each our DD guys again spark is
lazy right so by lazy what we mean is lazy evaluation so nothing or nothing
will get executed unless you actually go about making some transformation or
having an action that will actually create you know a trigger so the trigger
is basically what you know starts the operations your so you know you can have
job creation and execution there as well so our DD transformation is you know
what is transformation is basically not a set of data you might be wondering if
it’s a set of data at this point of time when it is not it is a step in a
particular program it can be the only step or it can be one among many steps
and so how the transformation works is basically tells spark how to get the
data and you know what to do with it this so couple of four transformations
that you can norm now that will be walking you through
with us basically map we’d only be checking our flat map oh we can check
out filter we can check out map partitions my partition its index we can
check out the samples Union intersection all of these guys so without further ado
let’s quickly check out each of these operations yes oh so the first
transformation we me looking at as the map case up so what map power you know
in simple terms does is basically you know it passes every single element
through a function so think of mapping in in literal law terms right so let’s
quickly or not jump into code where you know it will be better to understand
with code this so again we have a list this basically all strings spark RDD
example sample another example say Oh soon as we use our X dot map and we pass
a lambda function through it so what basically happens is it lets us map it
into key value pairs that you can you know you can go about seeing here so
spark is mapped with one and you know Rd is mapped with one sample one example
one and all of that so a lambda function is you know pretty much used to perform
the mapping here so again this is the code that we just walked you through so
you can pretty much map the key value pairs to be anything that you want and
this is the you know it is used to add a little more structure to our data guys
so next we have a flat map guys so flat map what you what I want you guys to
know at this point of time it is it is pretty much exactly like mapped oh but
then you can have your values to be you know mapped from any sis from 0 to any
number of output items that you want why would you do this because let’s say you
have a function that should return multiple number of outputs right so we
already know that majority of our functions will only return one output
but let’s say we require a function which requires more than one output so
let’s say you require a sequence so in that in that particular case we will
require the flat map guys so again we have a seed or paradise we’re basically
creating three elements or two three and four then we were passing it on to our
how to eat your flat map lambda function ranges 1 to X dot collect again is added
instead of doing separately so basically here we have separated out y dot collect
and who are you you can go about adding it
there as well so this again is pretty much you know creating instances and
copying the values for us so as you can check out of 1 1 1 2 2 & 3 gays so flat
mapping again you wouldn’t use it directly but then you can go on a
mapping to multiple elements that you know that you want you not be using it
practically but then hits good enough it’s good to know right
and next up your filter is so filter again it returns a collection of
elements which is if you filtering based on something right so think about filter
in general as well so you’re filtering something based on name based on ID
based on your roll number or whatever right so again filtering your works the
same way as well so you have a collection in our case most of it is
lists and we will provide a condition to Ixia if this condition is valid you know
just collect all those elements for us so coming to filter again we have 1 2 3
4 5 if any of these individual element so when you perform the modulo operation
2 if the reminder is basically zero then we go about collecting it race so we’re
gonna divide a 1 by 2 or you will get a reminder so that’s not picked you divide
two by two reminder is 0 yes pick that or you divide 3 by 2 or you will not get
you’ll never a matter of 1 right so that’ll not get picked for divides 2
with reminder of 0 yes 5 doesn’t so then go about doing it you will get to 4 if
you can you know put in a 6 here and then you will pretty much get to for 6
as well so it’s as simple as that guys next up we have the sample operation so
sample you’re pretty much the process of sampling is very small amount of data
which either requires replacement or which doesn’t require replacement right
so having the sample doesn’t have to have any order to values right so you
can have any random number generator to give you certain random data as well so
with the sample operation we can have all three parameters case so this method
supports or three main parameters the first one is with replacement so with
replacement is basically say our sampling a certain element multiple
number of times right so when you’re sampling it out every single time then
you can have it replaced as well or the next up is the fraction basically with
fraction you know you go about making size of your sample so as a tiny snippet
or as a fraction of what your ID is entire size is and you can do this
without replacement of data as well and next we have seed with the the parameter
seed what we basically do is you know it acts as our seed for our random number
generator where you know we’ll have a number being generated for us so again
we’re creating a number of range to nine and it will pretty much go about
generating a sample data for us and you know you can work with that as well guys
so to check out another example out here we can check out another or different
range of data but you know at this point of time this quickly pause the screen
take a look at the code and you know we’ll be jumping to Google collab guys
so then Google collab and I just executed the second piece of code so it
is creating sample data though which is from 0 to 8 and at the end of the day we
are collecting the simple sample data as well and the next important operation
that we should look at as the Union operation guys
so basically union of two rdd’s is basically concatenating and mixing all
those elements and yes our union does support repeated elements from most of
our rdd’s guys so let’s go clean something to collab of executed so we
have one two nine on our first parallel and then we have another variable called
as pair we take an object of it from 5 to 15 so as soon as you perform the
Union operation on this and collect it it is gonna pretty much merge everything
into one collection and that is basically all the output that you’re
greeted with case so the next operation is the intersection again similar to
Union but then it is the little intersection right so whatever is in
common or in between both of these rdd’s is what we’ll be getting out of it so
quickly executing this piece of code from range of 1 to 9 and 5 to 15 in the
numbers being generated here so 5 6 7 8 are what is common between 1
to 9 and 5 to 15 right so the I mean that’s exactly what’s printed so the
intersection is pretty much the common elements in both the data sets and it’s
so you know it’s as simple as that case um next we have distinct well with
distinct basically we are transforming creating a new RDD where
all the distinct elements right so as I’ve already mentioned it supports
repeated elements but let’s say we do not require raw you know we do not
require the repeat elements right so sometimes you require this hard coding
we’re wrong we require only distinct elements from it so to go about doing it
basically all we do is it is it does exactly the same as our old operation
but again here we’re doing the Union operation Oh we’ll just be adding the
dot distinct method guys so dot distinct will eventually tell the interpreter to
just talk it was distinct output and to you know to filter out all of the
repeated elements case so another important thing that you can do with
rdd’s is basically to sort it using something else so you can go on now you
know mapping something to another key that we will be seen but what if you
have to go about sorting this with respect to any of your personal you know
any requirements right so here we have a we have a sort by function and this
pretty much supports sorting in the in the ascending order base so we have
random numbers out here five seven one three two one as you can already see
they are not ordered in any fashion but as soon as I go out and I run this you
can check out that it’s ordered all right so two one one two three five
seven so we go about earning tenure then will remain that last lot because though
it is the highest number Oh so again after it’s mapped can we go on sorting
it as we’ve already seen in the slides yes we can
so this will be mapped again in the you know ascending order so you so whatever
is mapped for one set comes first whatever is mapped for L comes next or
then H comes because it is the next ten and you know lastly we will get a which
is 26 also sort by is pretty much as simple as that guess so let’s quickly
check out the next one the next is something would be called as map
partitions and map partition is the you know a simple alternative or when you
have to use map and for each case so my partitions are pretty much called you
know only once or for every partition but in the case if you remember in the
case of map or where what is to happen is if you’re used basically it was a
night right so we used to call it every single
time for every single element of every single collection so to pretty much
bring up our efficiency what we do with my partition as we call it only months
so we’ve defined our function now if we have a variable called hydrator and we
something called us yield some mitre Aeternus so what happens your is you
just pick it up.once you pick up the data once you hydrate through it once
you know rather than just stop picking at single data for every single RTD and
so but then that particular instance what we saw it was or not indexed so you
do not know how you can go about pulling the data and what the data actually does
so you get a quick idea of how you can pull up the data with indexing it or we
have my partitions with index guys so it’s basically a single Dom but then
I’ve just split it out in the title do which is get better readability so what
are my partitions with index does is it returns a brand-new RDD for us by
applying a function on again on each partition of the RDB while tracking the
index of the original partition so we need to find out what the original
partitions index is and then we you know go on keeping a track of it case so we
have 1 2 3 4 & 4 as soon as or instead of just throwing in a write rate
variability or something called a split index that will be pushing on to our raw
now function and as you can see instead of just using our our DD dot map
partitions and collecting it later we’re pretty much going about map partitions
with index and then we’re roiling the the sum operation of it guess again you
will not be using all of these operations on transformations on your
day to day basis but then it’s really good to know that you have access to all
of these and you cannot go on using it guys oh so next up we have grouped by
guess so if you guys are familiar with you know SQL programming the group by
here pretty much sure works the same way as well so what we basically do is you
know we return a brand new oddity as we’ve been doing with all the
transformations we do this by grouping objects in the already existing RDD and
we’ll be you know using grouping key to know what we are all you
know grouping it bite so let’s quickly jump into the code where we can check
out group I guess so then we have a collection 1 1 2 2 3 5 8 so our leader
group by is what we be using so we’re gonna group by the values if it performs
and validates the module of 2 and we’ll be collecting that guys and later we’re
also sorting it as well so so let’s go go about executing this we get 0 2 comma
8 that’s valid oh we get 1 1 3 5 as well so basically
you’re grouping it with respect to an order you getting a key value out of it
and then you’re sorting the key case what is sought is very simple so the
next thing is something called as key bye so what a key by does is basically
it changes the key of the already element that we have been working on for
a while now and it does this using our given key object guys so with the key
objects where you can I mean if you already check out the code so we are
basically doing an exponential operation X multiplied with X and we’re basically
having a value from the range of 0 to 5 with us and we’re performing the just
operation on it with Y and lastly again we we are not we don’t know yet what
this operations means which is exactly what we’ll be checking out next but what
you need to know is basically we’ll have a key and you know you can have your
individual collection items of the data arranged or with respect to the key guys
so what does zip so the zip operation you know it is used to join the two
rdd’s by by combining a certain part of each other right so think of it as a
literal zip lock in which each other fight so you have one part of one thing
one part of the module talking to another one so it is pretty much sure
something like that here as well so basically what we do is we wrote on an
RDD or formed from this list and we create another collection by combining
all of these elements that we have obtained in pairs so we obtain a pair
from one RDD or we obtain a pair from another oddity called Y and you can
check out the I iterate of example now on the screen as well so we will be
picking out say a from RDX with the data values one and two will be merging it
with the data values of Y which has the values of four comma you know four and
one right it’s pretty much as simple as that so this will work when we have both
of the rdd’s of the same size right so what if one of the collections have has
a bigger size so then what will happen a little map all are common elements but
whatever is in excess letters are pretty much ignored this so let’s quickly check
out our zip function so we have the range 0 to 5000 2005 will be the pinger
Y with respect to X and you’re going to be collecting so no zero comma thousand
one comma thousand comma 1 2002 three thousand and three four thousand four
perfect Oh so next again I will be checking out our
zip with indexes again it’s a single world except with index but for e WTH is
pushed it out separately so what we do here is pretty much will create a new r
DD again will be pairing all of our elements or that are in our list already
so whatever lists are present for us will be checking out all the elements
that are present there and will be pairing it with their index as well so
you need to know that with Python again again with spark is well with PI spark
India all the indices or start with the zero ways so since we’re just adding an
index for it all you can do is pretty much you know ABCD the output which
would give it a zero one two three guys so a will be mapped to zero B will be
mapped to once he will be mapped to two and the index of D is mapped to three
are pretty simple right so next we have the repartition transformation so
repartition if you have ever played around with installing windows linux or
something then you would have been familiar with watery partitioning or
storage medium means so basically you’ll be changing the number of partitions
guys so it is either it is used to increase or decrease the number of
partitions in an RDD in our case so how does it work in the backend we’ll all it
does is pretty much it does an entire shuffle of all of the data and it
creates new partitions you know where our data can
fashion Manor and all these data all the partitions are raw you know distributed
evenly as well guys so let’s so let’s quickly jump into google collab and
let’s analyze so the data we have is 1 2 3 4 5 6 7 and in the second it does it
is 4 we’re sorting and we’re using our de dot glom to pretty much go out with
re partitioning so the output we should expect is 100 second element we do three
third element we four five and the fourth element to be six seven perfect
guys so to find out the length of this again
coming back to my presentation I get to find out the length of this you can
pretty much I know use the repartition or glomped sorry
partition of to or number to collect or connectors again giving our outputs as
soon as i go ahead and run this or to tell us we’re you know we’re basically
now working with to our d DS and two partitions have been created where the
data is being stored uniformly guys so it’s pretty pretty simple Oh next we
have coalesce now with coalesce the practical part of coalesce is think
about it right so you will have thousands of partitions when you’re
working with a huge data set coalesce is again used to combine it is used to
unite couple of partitions where you know the dead data might be similar or
the data is not similar you just want to bring it under one RDD then coalesce is
pretty much your friend guys oh so we have two data sets here or 1 2 3 4 5 and
3 so as soon as we pretty much used or blombo to connect with coalesce as you
can already check out we have one single or entity which is giving us the data
and you know the data is being merged and united into one guys just a quick
info in case if you guys are looking for end-to-end course certification in by
spot in telepath provides the pi spark certification training program where you
can learn all of these concepts thoroughly and or the certificate at the
same time the link is given in the description box below so make sure to
check it out and on that note let’s get back to the class you know here’s
another wrong example of how coelus works with colas again you know you can
already check it out that we’ve split up the second element as we used in glom in
the first example and the second example you’ve pretty much gone about you know
coalescing it and bringing it into one single unit in case one single entity
sorry so next so what does three-part this was glommed or with respect to
repartition what does coil is have indifference to each other with coalesce
you need to know that we’ll be using the existing partitions which are already
present case so this is basically used to you know it is used to keep the data
which is being shuffled into a certain minimum extent but with three partitions
it always does a full shuffle of all your data and it creates brand-new
partitions for every single collection of every single Hardy be right again
with coalesce results basically in every single partitions they are stored with
difference amount of data right so not every single data element is of the same
size or with three partition again repartition tries to pretty much
partition your data in such a way where each of the data partitions are pretty
much almost oh you know equal sized coming to the speed of execution or
coelus is pretty much a lot faster because you know the first of all the
shuffle is minimized and second has data of its own different entire dimensions
so whenever data needs to be big honing in and hunting in on the data is really
faster with respect to coalesce than now then repartition why sorry partition is
a little long slower or when you compare it to coalesce guess
so all these file we were all checking out transformations with respect to our
Deniz right so next we can now go about starting or checking all the actions we
can perform on the Audis guys so we already aware you know like unlike
transformations so our DD is they pretty much show or create another brand new
collection for us but you are with respect to actions what we actually do
is we are we create a new value which is a given back to our spark driver program
guys but then to start an action it needs a trigger so actions have to be oh
you know triggered with either or previously constructed our DD or or
another data entry point guys so with respect to the are any functions there
is a couple of functions that will be checking out and I know there’s a
reduced functions as first as take order take count collect for each min Max
functions standard deviation functions some mean variance and all these
operations guys so let’s quickly start out with our range
case Oh so with radios what it basically does is it brings everything you know
all the collections of a dataset and it uh we basically run all of these to a
function of our choice so as soon as we go ahead and you know
let me quickly example to collab where we came down on shaking red used so the
syntax is again pretty simple gauge so for every method every action every
transformation we have a dot method which pretty much corresponds to it and
we’ll be following the function that it needs to be passed through is basically
the parameter that that is supposed to have the half for the reduced function
so as soon as we run this it is going to add all of the elements that are present
in our function guys so just passing it through a simple function is is what
we’re doing out here so here’s another simple example that we’ll be pulling off
for all the values of four in the range from two to ten and so all these values
yours as well are being you know added together and you know being given the
output and we’re running the we’re running a lambda function as well like
we’ve been doing for a while now so if you guys are having some quick time
running remembering lambda functions I just you just you know I suggest you
quickly jump into our previous module to you know get a quick rundown of what the
actual lambda lambda function does guys so the next action that we will be
checking out as the first guys so first again if you can guess it pretty much
gives us the the first element in our collection right so go ahead and hit
this I get two three four or the first element has to be printed so find out
one two three four and run dot first one hundred is going to give us one so it is
as simple as that guy so paralyze will be inputting our collection and we’ll be
running the dot first method on it the next action is or take order guys so
basically what we do here is we’ll be returning an array with a certain set of
ordered values for the RDD guys so consider the closet or the example on
your screen we pretty much have input one five four three nine four zero two
and we’ll be taking elements in the ordered value of need five or so
basically what we’re trying to tell you is we need five values and we’ll be
taking it in the order sense so the first five values in the ordered
sense will be zero one two three four again so again as you can check it out
we have four zero one two three four because these are the first five ordered
elements that you are outputting from the RDD that is present case also next
we have take or take again it pretty much picks up your value but it does it
without an order so whatever order or your oddity has
been input in the collection that even put in is what it is exactly considering
here guys so we have one five four three and I know 4 0 2 etc so this here will
be picking up the first 5 elements in the order of their occurrence so as you
can check out 1 5 3 9 4 is being picked up but here we are picking it up in the
ordered way so that’s the difference between or take order and take guys
oh so next we have count count I guess you can already guess right so it pretty
much tells us the number of elements that are you know presented that are
present in an RDD so as soon as we go ahead and hit count Oh let’s count
manually right first 1 2 3 4 5 6 7 so we have 7 elements as soon as we run dot
count on this it should pretty much tell us that there are 7 now you know 7
elements present let’s go ahead and add another one this with sake of it and it
says 8 so the dot count method pretty much gives us the number of elements
which are present in that particular DB guys and next we have the the collect
action so with collect action what we’re basically trying to do is we’re trying
to you know get all of our elements as an array guys so at this point of time
with with certain transformations we were creating brand new rdd’s but or
with actions we’re not doing that right so with actions let’s say if our driver
program requires an array it it requires an entire structured data input collect
pretty much does that which we’ve been using and seeing for a while right so
what collect I mean we’re using collect for a while or you’re right now so
pretty much I wish I’m sure you guys could have you know guessed it Oh so the
collect action pretty much just so pulls it out and gives us the array output
that we’ve been seeing from our our first section and again here is
another example is where it’s pretty much you know just another different
door no this is a different collection of elements out here that we were wrong
pulling it out but now what we’re doing with respect to the second one and the
first one is if you guys can see clearly that you know we have the testing you
know the distinctive method which being present here so we have new cat rad dog
doo doo was repeated twice right words repeated twice right so that is pretty
much being awarded just because we’re wrong you know we’re making use of the
distinct about that case so collectors are very simple and next
we have a save as text file so basically let us say you want all the data which
is stored in your our DD and you want to push it as a text file so you pretty
much so I am sorry if you want to pull it as a text file so if the re DS data
set has to be presented as a text file to you and it has to be stored somewhere
in your local file system or your raw HDFS then pretty much we’ll be going
about doing this case so with the save as text file so pretty much we love a
love or a sample law data which is generated for us and as soon as I hit
this so if you’re using a Linux or a virtual machine this is usually how the
data goes about with the path so slash user slash bin slash my data a one is
the text file where the random mom the random data stored race so again we can
now you know have instead of random data we can just pretty much you know go
about giving our own all our own data and then pretty much having it save as a
sample file for us as well so since we’re doing this one a Google collab on
as you put in your book the files which are pretty much output are saved online
but if you’re doing this or a local machine they’d have to give the full
path of the file that you need to you know you need to put so let’s say you
want it on your desktop you just cannot say desktop oh you have to just go about
doing C Drive users your user name and then desktop and the file name right so
one important thing that you need to know is with Windows or with Linux the
extension is very important guys so I have had a couple of learners come to me
with an error wrong so the file doesn’t pretty much open because you’re you know
native operating system such as Windows or in
it does not understand the know the type of file which is being processed so that
is again another you know important thing which I pretty much wanted to tell
you guys about coming back to the slides we have for the next one is for each
case so with for each again or the practical literal use of for ages we’re
passing every single element into a function of our choice right so here we
have defined a function called it f with variable is X also what the function has
to do for us all it has to do is print so we have to find the function you’re
printing it and then all we’re running is this we’re running this RDD and we’re
using the for each method to go on printing that out guys so let’s quickly
check it out basically which is printing each element
and we are running it across the for each loop case so next we have something
called as the for each partition well with respect to the for each partition
what you’re trying to do is we’re exhibiting every single function for
every single partition so the access to all four data elements that are
contained in every single partition is you know how can we go about moving with
respect to every single element is basically we’ll be using as a nitrate as
an argument guys so we’ll be pushing an iterator to tell us how we can you know
step across each single element and how we can not go about you know printing it
race so for each partition is very simple so we have a function which does
the operation where we’re running through each of the instances will be
printing X and we have 1 2 3 4 5 which is audience as soon as we you know as
soon as we go ahead and collect all the partitioning elements you’re also pretty
much will be greeted with the same or same output of 1 2 3 4 5 as well and now
coming to a little bit of mathematical operations or we check out what men make
some mean variance and standard deviation all mean so a spark oddities
again support actions like min Maxim max variance and all the standard deviation
operations as well so I think it’s better to move to collab and show you so
let’s say oh we have numbers a variable where we were generating an RTD for
values from the range of 1 to 100 and as soon as you go out at some of the values
from 1 to 100 or the value of 1 plus 2 plus 3 or so on until 100 is pretty much
4 9 5 0 and running numbers dot min this is going to tell us what the minimum
number us your in our case the minimum number is 1 next we have for variance
variance is pretty much calculated by finding out the mean of it so mean is
usually the middle element in an ordered or set of elements by ordered I mean
either an ascending or the descending order and the variance is how much each
individual value varies with respect to the mean this so here in our particular
case for the values from 1 to 100 our variance was what we get is off 816 and
millions usually remains constant for that particular dataset guys so numbers
what max what is the max element that are present so basically our range
one and goes all the way up to 100 and not hundred included right so mother one
maximum number in our particular case is 99 s so from 0 to 99 now if you have to
check out the means in the middle element it will be considering as 50
right so that is exactly what we’re doing here away so number dot mean will
pretty much give us you know give us the mean element case and standard deviation
is how each or single element or deviates with respect to the mean
so we’ve calculated the mean then we need to find out how much each of the
single element moves away from the mean case and in this particular case it is
around twenty-eight point five it is you know it is pretty much as simple as that
case so next up we’ll quickly check out a couple of our DD functions race also
we already walk through a couple of functions you are right so we’ve seen
watch catching dots of catching pretty much stores and re D without computing
it again or with collect we’ll be returning all of these single elements
collect count by value will pretty much return us a map with the number of time
that each value occurs then we have distinct it pretty much returns an RDD
which contains only distinct elements and no repeated elements or with in
terms of filter filter pretty much returns the RDD containing only the
elements which map to that particular filtered function and for each basically
we’ll be applying one single function in the in our case is f so we’ll be
applying one single F to all of the individual elements of the rdd’s that
are raw no being worked on guys and with the persist again as you’ve
already checked out since the storage level if it is memory only then all of
the data of intermediate data the casing data is the stored or in memory all the
time but then we can force it to bring it to our hard disk as well so that does
with the persistence level table that all you need to check out yes so storage
levels are available and basically depending on you know there’s a lot of
things that you need to check out before or using the persist operation or
because the end of the day you might require one module a lot of times but
then it’s not specific to get it very quickly so then you can have it store on
your hard disk and you wouldn’t mind that extra stretch right so something
like that and even other a couple of functions that you already checked we
have the unprocessed function where we’ll be forcing that
to remove the persistent blocks from our memory we have checked out Union we’ve
checked out intersection count will pretty much generate the number of
elements that are presented in our ID and sample pretty much aw or returns the
RDD of that particular fraction gasps OH so next we can check out count by value
so con five value let me quickly jump into my collab where we can set it up so
we have one two three four five six seven eight again
two four so we have a couple of values here for our list right so as soon as I
go over and run this as well if we are counting it by value then it is going to
tell us it is an integer type which which is you know being counted and it
tries to map our values of the rdd’s for us case so – so it is pretty you know
simple as that and next we have to debug string so with our raw – debugging if we
have to find out where the RDD is sketched or where the rdd’s value are
being stored or if we have to check it on the back end will be pretty much
using or debug string is so we have range from n 1 to 19 or 1 to 13 and
we’re pretty much subtracting off and storing it into the ER DDC so as soon as
we hit debug so wherever these individual values are stored let’s say
we let’s say if you have to map it for something else or raw if you use it for
some other operation in the future or to check where it is natively stored will
be pretty much requiring the map that addresses case so pretty much to debug
string is literally used to you know to debug and let us know more about it let
us check out how we can go about working with key value pairs guys so what you
need to know at this point of time is there are a number of ways that you can
you know get paired our degrees and spark right so pretty much that’s what
we’ve been checking out for a while now and there are many formats that can
pretty much directly you know return all of these paired rdd’s with all of their
key value data as well but with all the regular cases that we have chained or we
need to convert these regular are release and then you have to turn it
into our paid rdd’s right so you are it is being done natively so how can we do
this we’ve already check it out basically we can run the map function or
where a key is you know map to value so creating a pair which we’ve already
checked out so I really doubt map is pretty much
going to assign a key value pair for all of our singular elements and it’s create
there are Eddie’s for us guys so basically you’re you can check it out a
1 B 1 C 1 u and E 1 a 2 B 2 C 2 D 2 and E 2 so this is pretty much two different
entities and you can see a key element from a 1 being mapped to the rest of the
elements your a key element from a two being or mapped to be no rest of the
elements then and so again here’s another quick recap where we have
checked out all the transformations that we’ve done so reduce by key again it
combines all the values with the same key the syntax that you go about doing
it as a DD reduce by key and you pretty much send the function to it which is
add and then group by key group by aux exactly how it works in SQL as I have
mentioned it grows group bys with respect to one single key with the map
values function it basically you know applies one single function to all of
the values in a paired rdd’s and it does this without changing the key with flat
map values basically you know we’re applying one single function that
returns an iterator to each of the values right so for each element that is
being returned all we are trying to do is we are trying to produce a key value
or entry this is being done with respect to the old key guys so we use this with
the process what we called as tokenization but then we’ll be looking
at it in the next couple of modules as well so you can know you know you you
can check it out then and next we have the keys function where we basically
just return an RDD which though with the presence of the keys in it sort by key
where you can pretty much go about sorting all these keys and then having
an RDD give you these sorted transformations as well so subtract by
key subtract by key would pretty much remove all the elements or with the key
that is present in the other RDD guys so the same key is present in the second
oddity then it pretty much removes all those elements when it is clubbing it
and transforming it into the next adeney what is the join function basically what
it performs is it performs an inner join so inner join again is another SQL
concept here so it performs an inner join between o both of the rdd’s guys
with the right outer join again it performs or join between not two of the
rdd’s where the key must be present in our first ret in case of the right outer
join or with the left out indicated performance a joint where the
key must be present in the right RDD or ii re d that has mean join guys so with
the right outer join it first be present in the first one the key must be present
the first one with left outer join the key must be present in the second our
edu guys and a co group again co group is pretty much it is used to group all
the data from both of the rdd’s where they share the same key and our next up
we’ll check out something called as re delineate yes what is our DD lineage so
our DD lineage is basically a graph with all of the parent rdd’s or which were
given before transformations yes so as soon as you apply a transformation we’ll
already familiar that another or duplicate our DD is created and that our
e’s are immutable and so on so all in ears pretty much graphs as out you know
it gives us a graph of this is a da g or directed acyclic graph of the parent
rdd’s of that particular transformed our DD guys so it creates this path it
creates a logical execution path starting from the parent our DD and
going to all of the transformations or which have been done guys so it’s pretty
simple you you will require this too you know we already checked it out it’s
pretty much debug string to print our lineage and it tells you the map where
the memory slots are given but then you really require this if you’re going
about debugging or your application or if something is wrong with that to
further or talk about this so basically let’s say we have an RDD graph which is
basically the result of all of the transformations that you have done and
you know we have an RDD lineage graph which is generated on that so since we
have done some transformations on it they will actually be executed in
particular order case so here is to explain more about the debug string
method so basically we have been doing our DD door to debug from all this while
so with all of these functions that have been they say you’ve used the cartesian
method you use the map method you’ve zipped it you use the Union method or
then you’re pretty much gone using Kiba you’ve used another Union on all of
these right so to give you the path of execution to give you a route that you
have taken pretty much you’ll be using that again guys let us check out word
count using our DD comes in case so what is a word count program a word count
program will pretty much return the frequency of every
single word that occurs in the particular file so let’s say we have an
RDD that is created which basically stores or text file data and later we
have we’ll perform a couple of RTD operations which will eventually give us
the word count output of this case so pretty much how we can go about creating
an RDD using a text file as we’ve already checked out so in this
particular case we’ll be uploading the file to Google collab and we let Google
collab or take care of our file and we’ll be filtering it using our lambda
function and we’ll be checking out with the length of corresponding elements to
be greater than zero we’ll be splitting all the words or separately and we’ll be
having a flat map for us at the end of it where we’ll be splitting it based on
an empty character and lastly we’ll be using the reduced by key and sort by key
where we’ll be counting the frequency of occurrence of each single word days so
let’s quickly jump into Google collab and let’s do that so instead of Pi spark
dot txt let me quickly go about doing spark tour takes because which we
already have right so we have sparked X so as soon as I go ahead and create our
RDD we’re picking it up we’re adding a filter to it we are now making it a flat
map we’re splitting all of the individual elements and we are actually
counting the world right so now as soon as I go out and run this pretty much all
of these words have been put on me once we have a batch a spark with Python is
by spark so all these are occurring once right so let’s quickly or change that
let’s have spark here oh I’m gonna give another quiet spot probably let’s you
know do it do it a couple of times here ss caps sorry about the end my spark so
let me quickly save it come back to our Google collab let’s go to the part where
we actually uploaded that or you know there’s another simple way where you can
upload our files guys so to basically to upload our files all you do is open this
tiny tab you have to hit upload you can actually go on deleting this right now
ESCA go to spark upload this so yes so whenever you’ve uploaded files it will
get deleted as soon as the runtime is recycled yes so now my spark is one bio
dolls create another oddity because we’re picking it up from a new file
let’s filter it out again let’s add flatmap to split all the elements let us
perform the word count and sort the words so as soon as I go do this
it says spice Park appears three times pretty much simple right and then you
can pretty much save this entire off flat map the output into vertex file and
you can have it in under the folder called as work on so if you using it
this in your local machine you need to give the location of the folder where
you’ll be pretty much wanting to store the data guys so the next simple concept
that will be quickly checking out is called already partitioning and our
ceiling panel is in this we’ve been talking about partitions a lot right so
partition is nothing but a logical chunk of data orbits is basically large
amounts of data and SPARC manages to pretty much use all of the partitions
and it helps paralyze all of the distributed data that we have and it has
to do this with minimal network sending right so you need to send the data and
the receive data between executors as well so if half of your time is wasted
with you know just sending and pulling data then it really doesn’t make sense a
spark has to do this efficiently as well and it does well how well by default
pretty much spark you know tries to read the data into an RDD from the nodes that
are right next to it so it will pretty much figure out what the access time of
data from each node is and this will basically tell it how close the node is
from the RTD right so then it pretty much you know does the data pulling and
pushing and now we have the criteria where it really does minimal network
traffic right so here consider we have a couple of items you have about 25 items
on the screen and you know the RDD here is split into five partitions of five
each so item 1 to 5 hours partition 1 6 to 10 is partition 2 11 to 15 is
partition 3 16 to 20 partition 4 and respect 20 21 to 25 is the fifth
partition so again here as you can check out if you are mapping it we can cross
map it to something else where we’re pretty much using it as efficiently
as possible guys and since spark usually accesses distributed partition data and
to optimize all the transformations it creates the partitions that can hold a
lot of data so every partition is pretty much being created when it knows that it
can hold a huge chunk of data guys so do we have to program how we partition the
data well absolutely not because rdd’s get partitioned automatically without a
single intervention from the programmer guys but you know spark pretty much
gives us functionality where we can go on adjusting the size the number of
partitions or how the data has to be partitioned based on our particular
application guys so this can be done using the def forget partitions and
array of partition method oh where will you pretty much so this would be this is
basically functioning on an RDD and this will get to know the number of
partitions that are present in the particular array guys so coming to the
types of partitioning we have two types of partitioning one is hash partitioning
and one is the range partitioning guys so again this is again another step
taken by the developers at spark where the giving you the customization to
partition and based on your particular needs but there is very important thing
that you need to know is customizing partition is only possible when you’re
using paired rdd’s so you cannot go about using this on single rdd’s guys
and but then so this might feel like a disadvantage of this point of time right
but it is not so again another basic advantage here is that pretty much you
will have data which is very similar right so you have multiple cases of
similar data which is located at the same place then shuffling the data
pretty much in by transforming it by using group by key reduce by key will
again make it an efficient way to pretty much know that you have an index where
you can pull your data from you can arrange it you can sort it based on that
particular index element and this is again an advantage of having paired RDS
as well so coming to the hash partition technique the hash partitioning again is
pretty much the technique where one single key or we call it the hash key
the hash keys basically is used to distribute you know all the elements
across the partition guys so one thing I mean you will not be using this
practically I just want I just found this to be an
important concept that I want to teach you guys so with hash partitioning
basically you need to know that we have something called as the hash key guys so
the hash key is at the end of the day our go-to person know where we’ll be
using it to you know distribute our basically we be using it to track or to
create and distribute all of the values or in terms of different partitions case
so again we have a one single already on the top we have a couple of elements in
that particular RDD and pretty much will be hashing that we call it the hash key
so we’ll be using the hash key to pretty much tell where you know where the
particular key and where the element of that key has to sit and in which
partition basically so all you need to know at this point of time is that the
hash key denotes where an element has to go into which partition guys coming to
range partition or well we already know that our DD is you know they have keys
and sometimes these key will have a particular order say ascending order or
descending order and to go about doing this in a fashion manner instead of
using a hash key will be pretty much going about using the range partitioning
is so range partitioning will work when you have something in ordered so how do
we go about doing it as basically you know we love topple your or topple will
have keys which are you know the same range as our raw native our DD and the
keys in this particular Orange petitioners are actually partitions
based on all of the sorted range of keys guys so you need to the key word that we
are trying to highlight here is pretty much sorted so the keys which are
present here are partition which are based on the sorted or range of keys and
the native ordering of the keys as well so how do we go about doing range
petitioners well it’s simple is so first we need to compute or reasonable range
boundary to our data right so we cannot have huge or set of boundaries and work
on basically it’ll take more time followed by this we’ll be constructing a
path tester giving the partitioner that particular range boundary that
we’re using it and telling it to partition our index and work with it
from there and lastly we’ll be shuffling our our DD
values against this particular order range values case so uh we have a couple
of train umbers out here to give you as an example now we have the complete tree
number and from where it moves so Central Station north-northeast
Association Bayview and all of that so you can pretty much observe that you
know a value from the southern train has been put into another RDD out your and a
value from our first Center Rd has been put here as well so as soon as the new
RDD is being created after using our range partitioner it is being done in a
certain order but then unity know that the values are
being shuffled after the values are being changed today’s coming to your
last concept of this particular module will be checking out how we can pass
functions to spark case so basically most of sparks transformation and some
of it actions or it basically depends on passing the functions that we have
already seen right so we usually pass function sway adding it or we’re sorting
it and we can use functions there as well so each of the languages that we’ll
be working on Python Java and Scala or they each have a different mechanism to
allow the operation of passing a function and in Python we can pretty
much pass one function inside another function and you can have it under an
anonymous setting as well which is the lambda function that we’ll be using and
a couple of other functional API is provided by Python as well but some
other considerations that come into play here are the name of the functions that
we pass the data that we reference in that function and if the data needs to
be ordered and if they need to if they need to be serialized case so sparks API
again the most important point on the slide for you guys is there it relies
heavily on passing functions in our driver program so it knows what can be
done and what should be done when it are pretty much runs on the cluster guys so
two recommended ways to go about doing this first one has to pretty much go
about defining an anonymous function basically it has a method which is you
know used for any short snippets of code but then what if you have a longer
program or a longer set of code that you need to go about doing it well sure you
can pretty much use in a normal anonymous function as well but then it
is impractical to go about using you know typing it out at a single line so
we can use something called a stat methods where we’ll be using you know
global singleton object to go about doing this case so coming to the first
enormous function so basically it’s a method we’ll be using it for a short
pieces of code that I mentioned so one quick example for a short piece of
anonymous function that we’ve already discussed is the lambda function case so
lambda X this function basically returns nothing but the sum of two to the input
variable so whatever is on the left of the colon is again the list of
parameters that you need to pass and on the right is the expression that which
is required to you know involve all of the parameters and perform operations in
that and you can you know name the functions as well so as you’ve already
seen one two three four five or values and a DD is being created we are
creating the map function and then we’re having the lambda give us C give us the
plus two output of whatever is the input case so functions taken multiple
parameters or you know the functions can take no parameters as well so there’s
pretty much no compulsion that you have to compulsorily push in values to
function guys so as you can now pretty much to check this out we are adding two
to each of the elements here so 1 plus 2 is 3 2 plus 2 is 4 3 plus 2 is 5 and so
on right so it’s it is pretty much simple guys so with the anonymous status
of this function so as soon as you pretty much find out yeah it’s basically
saying is that it needs to init our main function and then get the value of x so
you can have the lambda function to return the you know X plus 2 it doesn’t
have to be as simple as X plus 2 what we’ve shown here but then you can pretty
much go about doing it and more complex ways again depending on your application
as well guys so coming to static methods or let’s say we have the quick example
where we are defining an object let’s call the object as my functions and
let’s pass a function to it calling it calling it as func one so instead of
just passing values to our function what we can go about doing is pretty much we
can now we can do they call by reference as well so if you have to or revised
call by reference quickly then you pretty much you can go about checking
out the older module but to tell you what call-by-reference means is
basically the value gets copied into the formal parameter and our change is done
inside function pretty much reflect outside the
function as well so instead of having just one single to an object where you
know you will have to create on the duplicate object and work on it with
respect to you know static methods so it basically requires nothing but just
sending one single object and it contains it basically that single object
that you send will pretty much pull in and bring all of the class data
mentioned along with it as well so here’s what our simple class looks like
class name of it is my class we have function one you have another function
it’s called do stuff and basically you are what we’re trying to do is we’re
creating or we’re creating an instance of the class and we’ll be calling the
function of again so the do stuff method of it so the map inside the reference is
pretty much you know it’s the func one method and the whole object so as sure
as I showed in the previous slide the entire object has to be send into this
class guys so to give you a quick example with respect to mapping it is
exactly as similar as writing our DD dot map where we’re mapping the same
reference this dot func one of X case so against park is you know pretty much
giving you the highest efficiency possible when you have to define an
object like this and you know and you have to pass the entire object towards
the class and you can pretty much go about mapping it using one single line
guys congratulations you’ve come to the end I hope you guys
enjoyed this so just before finalizing up or let us quickly walk through the
quiz or you can take about five seconds as you know as usual and you can go over
answering this so which of the following is a module for structured data
processing oh here’s another question for you so which is a distributed graph
processing framework on top of SPARC I think this is a very simple one right
just a quick info in case if you guys are looking for end-to-end course
certification in buy spot in telepath provides the PI SPARC certification
training program where you can learn all of these concepts thoroughly and or a
certificate at the same time the link is given in the description box below so
make sure to check it out I hope this session was very informative for you all
if you have any queries regarding this session make sure to head down to the
comment section below and do let us know that and we’ll get back to you or the
earliest on that note have a nice day Oh


  1. Guys, which technology you want to learn from Intellipaat? Comment down below and let us know so we can create in depth video tutorials for you.:)

  2. 👋 Guys everyday we upload in depth tutorial on your requested topic/technology so kindly SUBSCRIBE to our channel👉( ) & also share with your connections on social media to help them grow in their career.🙂

  3. Hi Team, It was a nice intro to PySpark explaining the core concepts..Kudos..It would be good to include one real time case study like Twitter live data processing using Spark..


Please enter your comment!
Please enter your name here