Spark Data Structures & Parallelism

A 4-hour course for intermediate-level data scientists / engineers that covers Spark architecture and fundamentals including RDDs, DataFrames, Datasets.

4 hours of instruction

A 4-hour course for intermediate-level data scientists / engineers that covers Spark architecture and fundamentals including RDDs, DataFrames, Datasets.

OBJECTIVES

  1. Get familiar with key use cases for Spark and its core features
  2. Be able to manipulate data in Spark using RDDs, DataFrames, and Datasets
  3. Know how parallel processing works and be able to explore Spark UI to monitor Spark jobs

PREREQUISITES

Introduction to Scala Collections

SYLLABUS & TOPICS COVERED

  1. Spark Basics
    • Spark use cases, architecture, and features
    • Spark architecture and components
  2. Spark Data Structures Overview
    • RDDs as the core data structure in Spark
    • RDD features
    • Working with RDDs
  3. Spark Data Frames And Datasets
    • DataFrame features and what makes them different from RDDs
    • Working with DataFrames
  4. Spark UI
    • Core concepts of parallel processing in Spark
    • Using Spark UI to monitor Spark jobs

SOFTWARE REQUIREMENTS

Apache Spark, You will have access to a Python-based JupyterHub environment for this course. No additional download or installation is required.

About Instructor

DataSociety

148 Courses

Not Enrolled
This course is currently closed