Spark Data Structures & Parallelism

4 hours of instruction

A 4-hour course for intermediate-level data scientists / engineers that covers Spark architecture and fundamentals including RDDs, DataFrames, Datasets.

OBJECTIVES

Get familiar with key use cases for Spark and its core features
Be able to manipulate data in Spark using RDDs, DataFrames, and Datasets
Know how parallel processing works and be able to explore Spark UI to monitor Spark jobs

PREREQUISITES

Introduction to Scala Collections

SYLLABUS & TOPICS COVERED

Spark Basics
- Spark use cases, architecture, and features
- Spark architecture and components
Spark Data Structures Overview
- RDDs as the core data structure in Spark
- RDD features
- Working with RDDs
Spark Data Frames And Datasets
- DataFrame features and what makes them different from RDDs
- Working with DataFrames
Spark UI
- Core concepts of parallel processing in Spark
- Using Spark UI to monitor Spark jobs

SOFTWARE REQUIREMENTS

Apache Spark, You will have access to a Python-based JupyterHub environment for this course. No additional download or installation is required.

About Instructor

DataSociety

148 Courses

Spark Data Structures & Parallelism

About Instructor

DataSociety

Committed to your success with open source. OpenTeams is your easy point of access to a range of services from our open source expert network, from commercial open source support to open source training, staffing & recruiting services, and more.

Resources

OpenTeams