Python PySpark & Big Data Analysis Using Python Made Simple - Udemy

Doelgroep: Alle niveaus
Duur: 84 colleges - 6 uur
Richtprijs: € 19,99
Taal: Engels
Aanbieder: Udemy

>> proefles

Welcome to the course 'Python Pyspark and Big Data Analysis Using Python Made Simple'

This course is from a software engineer who has managed to crack interviews in around 16 software companies.

Sometimes, life gives us no time to prepare, There are emergency times where in we have to buck up our guts and start bringing the situations under our control rather then being in the control of the situation. At the end of the day, All leave this earth empty handed. But given a situation, we should live up or fight up in such a way that the whole action sequence should make us proud and be giving us goosebumps when we think about it right after 10 years.

Apache Spark is an open-source processing engine built around speed, ease of use, and analytics.

Spark is Developed to utilize distributed, in-memory data structures to improve data processing speeds for most workloads, Spark performs up to 100 times faster than Hadoop MapReduce for iterative algorithms. Spark supports Java, Scala, and Python APIs for ease of development.

The PySpark API Utility Module enables the use of Python to interact with the Spark programming model. For programmers who are
already familiar with Python, the PySpark API provides easy access to the extremely high-performance data processing enabled by Spark’s Scala architecture —without really the need to learn any Scala.

Though Scala is much more efficient, the PySpark API allows data scientists with experience of Python to write programming logic in the language most
familiar to them. They can use it to perform rapid distributed transformations on large sets of data, and get the results back in Python-friendly notation.

PySpark transformations (such as map, flatMap, filter) return resilient distributed datasets (RDDs). The short functions are passed to RDD methods using Python’s lambda syntax, while longer functions are defined with the def keyword.

PySpark automatically ships the requested functions to worker nodes. The worker nodes then run the Python processes and push the results back to SparkContext, which stores the data in the RDD.

PySpark offers access via an interactive shell, providing a simple way to learn the API.

This course has a lot of programs , single line statements which extensively explains the use of pyspark apis.
Through programs and through small data sets we have explained how actually a file with big data sets is analyzed the required results are returned.

The course duration is around 6 hours. We have followed the question and answer approach to explain the pyspark api concepts.
We would request you to kindly check the list of pyspark questions in the course landing page and then if you are interested, you can enroll in the course.

Note: This course is designed for Absolute Beginners

Questions:

>> Create and print an RDD from a python collection of numbers. The given collection of numbers should be distributed in 5 partitions
>> Demonstrate the use of glom() function
>> Using the range() function, print '1, 3, 5'
>> what is the output of the below statements ?
sc=SparkContext()
sc.setLogLevel("ERROR")

sc.range(5).collect()
sc.range(2, 4).collect()
sc.range(1, 7, 2).collect()

>> For a given python collection of numbers in the RDD with a given set of partitions. Perform the following:
-> write a function which calculates the square of each numbers
-> apply this function on the specified partitions in the rdd

>> what is the output of the below statements:

[[0, 1], [2, 3], [4, 5]]

write a statement such that you get the below outputs:

[0, 1, 16, 25]
[0, 1]
[4, 9]
[16, 25]

>> with the help of SparkContext(), read and display the contents of a text file

>> explain the use of union() function

>> Is it possible to combine and print the contents of a text file and contents of a rdd ?

>> write a pgm to list a particular directory's text files and their contents

>> Given two functions seqOp and combOp, what is the output of the below statements:
seqOp = (lambda x, y: (x[0] + y, x[1] + 1))
combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
print(sc.parallelize([1, 2, 3, 4], 2).aggregate((0, 0), seqOp, combOp))

>> Given a data set: [1, 2] : Write a statement such that we get the output as below:

[(1, 1), (1, 2), (2, 1), (2, 2)]

>> Given the data: [1,2,3,4,5].
What is the difference between the output of the below 2 statements:
print(sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(4).glom().collect())
print(sc.parallelize([1, 2, 3, 4, 5], 5).coalesce(4).glom().collect())

>> Given two rdds x and y:
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2)])

Write a pyspark pgm statement which produces the below statement:
[('a', ([1], [2])), ('b', ([4], []))]

>> Given the below statement:

m = sc.parallelize([(1, 2), (3, 4)]).collectAsMap()

Find out a way to print the below values:

'2'
'4'

>> explain the output of the below statment:

print(sc.parallelize([2, 3, 4]).count())
output: 3

>> Given the statement :

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])

Find a way to count the occurences of the the keys and print the output as below:

[('a', 2), ('b', 1)]

>> explain the output of the below statement:

print(sorted(sc.parallelize([1, 2, 1, 2, 2], 2).countByValue().items()))

output: [(1, 2), (2, 3)]

>> Given the rdd which contains the elements -> [1, 1, 2, 3],
try to print only the first occurence of the number

output: [1, 2, 3]

>> Given the below statement:
rdd = sc.parallelize([1, 2, 3, 4, 5])
write a statement to print only -> [2, 4]

>> Given data: [2, 3, 4]. Try to print only the first element in the data (i.e 2)

>> Given the below statement:
  rdd = sc.parallelize([2, 3, 4])
  Write a statement to get the below output from the above rdd:
  [(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]

>> Given the below statement:
x = sc.parallelize([("a", ["x", "y", "z"]), ("b", ["p", "r"])])
Write a statement/statements to get the below output from the above rdd:
[('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]

>> Given the below statement:
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
What is the output of the below statements:
print(sorted(rdd.foldByKey(0, add).collect()))
print(sorted(rdd.foldByKey(1, add).collect()))
print(sorted(rdd.foldByKey(2, add).collect()))

>> Given below statements:
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2), ("c", 8)])
Write a statement to get the output as
[('a', (1, 2)), ('b', (4, None)), ('c', (None, 8))]

>> is it possible to get the number of partitions in the rdd

>> Given below statements:
rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
write a snippet to get the following output:
[(0, [2, 8]), (1, [1, 1, 3, 5])]

>> Given below statements:
w = sc.parallelize([("a", 5), ("b", 6)])
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2)])
z = sc.parallelize([("b", 42)])
write a snippet to get the following output:
output: [('a', ([5], [1], [2], [])), ('b', ([6], [4], [], [42]))]

>> Given below statements:
rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
write a snippet to get the following output:
output:
[1, 2, 3]

>> Given below statements:
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2), ("a", 3)])
write a snippet to get the following output:
output:
[('a', (1, 2)), ('a', (1, 3))]
[('a', (2, 1)), ('a', (3, 1))]

>> For the given data: [0, 1, 2, 3]
Write a statement to get the output as:
[(0, 0), (1, 1), (4, 2), (9, 3)]

>> For the given data: [0, 1, 2, 3, 4] and [0, 1, 2, 3, 4]
Write a statement to get the output as:
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]

>> Given the data:
[(0, 0), (1, 1), (4, 2), (9, 3)]
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
Write a statement to get the output as:
[(0, [[0], [0]]), (1, [[1], [1]]), (2, [[], [2]]), (3, [[], [3]]), (4, [[2], [4]]), (9, [[3], []])]

>> Given the data: [(1, 2), (3, 4)]
Print only '1' and '3'

>> Given the below statements:
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2)])
write a snippet to get the following output:
output:
[('a', (1, 2)), ('b', (4, None))]
[('a', (1, 2))]

>> What is the output of the below statements:
rdd = sc.parallelize(["b", "a", "c"])
print(sorted(rdd.map(lambda x: (x, 1)).collect()))

>> What is the output of the below statements:
rdd = sc.parallelize([1, 2, 3, 4], 2)
def f(iterator): yield sum(iterator)
print(rdd.mapPartitions(f).collect())

>> Explain the output of the below code snippet:

rdd = sc.parallelize([1, 2, 3, 4], 4)
def f(splitIndex, iterator):
yield splitIndex
print(rdd.mapPartitionsWithIndex(f).sum())
output: 6

>> Explain the output of the below code snippet:
x = sc.parallelize([("a", ["apple", "banana", "lemon"]), ("b", ["grapes"])])
def f(x): return len(x)
print(x.mapValues(f).collect())
output: [('a', 3), ('b', 1)]

>> What is the output of the below snippet:
import findspark
findspark.init('/opt/spark-2.2.1-bin-hadoop2.7')
import pyspark
import os
from pyspark import SparkContext

sc=SparkContext()
sc.setLogLevel("ERROR")

print(sc.parallelize([1, 2, 3]).mean())

>> what is the output of the below snippet:
pairs = sc.parallelize([1, 2, 3]).map(lambda x: (x, x))
sets = pairs.partitionBy(2).glom().collect()
print(sets)

>> Given the rdd below:
sc.parallelize([1, 2, 3, 4, 5])
write a statement to get the below output:
output: 15

>> Given the statement below:
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
Write a statement to get the below output:
output:
[('a', 2), ('b', 1)]

>> what is the difference between leftouterjoin and rightoutjoin

>> Given the below statement:
tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
what is the output of the below statements:
print(sc.parallelize(tmp).sortBy(lambda x: x[0]).collect())
print(sc.parallelize(tmp).sortBy(lambda x: x[1]).collect())

>> Given the statement:
x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
y = sc.parallelize([("a", 3), ("c", None)])
do something to get the output:
[('a', 1), ('b', 4), ('b', 5)]

>> Given the statement:
x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 2)])
y = sc.parallelize([("a", 3), ("c", None)])
do something to get the output:
[('b', 4), ('b', 5)]

>> Given the statement:
sc.parallelize(["a", "b", "c", "d"], 3)
do something to get the output:
[('a', 0), ('b', 1), ('c', 2), ('d', 3)]

>> Given the statement:
sc.parallelize(["a", "b", "c", "d", "e"], 3)
do something to get the output:
[('a', 0), ('b', 1), ('c', 4), ('d', 2), ('e', 5)]

>> Given the statements:
x = sc.parallelize(range(0,5))
y = sc.parallelize(range(1000, 1005))
do something to get the output:
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]

>> output of the given pgm:

sc=SparkContext()
sc.setLogLevel("ERROR")

data = [["xyz1","a1",1, 2],
["xyz1","a2",3,4],
["xyz2","a1",5,6],
["xyz2","a2",7,8],
["xyz3","a1",9,10]]

rdd = sc.parallelize(data,4)
output = rdd.map(lambda y : [y[0],y[1],(y[2]+y[3])/2])
output2 = output.filter(lambda y : "a2" in y)
output4 = output2.takeOrdered(num=3, key = lambda x :-x[2])
print(output4)
output5 = output2.takeOrdered(num=3, key = lambda x :x[2])
print(output5)

>> output the contents of a text file

>> output the contents of a csv file

>> write a pgm to save to a sequence file and read from a sequence file

>> write a pgm to save data in json format and display the contents of a json file

>> write a pgm to add the indices of data sets

>> write a pgm to differentiate across odd and even numbers using filter function

>> write a pgm to explain the concept of join function

>> write a pgm to explain the concept of map function

>> write a pgm to explain the concept of fold function

>> write a pgm to explain the concept of reducebykey function

>> write a pgm to explain the concept of combinebykey function

>> There are many pgms which are showcase to analyse big data

>> Meer info

Vind een opleiding

Python PySpark & Big Data Analysis Using Python Made Simple - Udemy

Overzicht