4 hours of instruction
A 4-hour course for intermediate-level data scientists / engineers that covers Spark architecture and fundamentals including RDDs, DataFrames, Datasets.
OBJECTIVES
- Get familiar with key use cases for Spark and its core features
- Be able to manipulate data in Spark using RDDs, DataFrames, and Datasets
- Know how parallel processing works and be able to explore Spark UI to monitor Spark jobs
PREREQUISITES
Introduction to Scala Collections
SYLLABUS & TOPICS COVERED
- Spark Basics
- Spark use cases, architecture, and features
- Spark architecture and components
- Spark Data Structures Overview
- RDDs as the core data structure in Spark
- RDD features
- Working with RDDs
- Spark Data Frames And Datasets
- DataFrame features and what makes them different from RDDs
- Working with DataFrames
- Spark UI
- Core concepts of parallel processing in Spark
- Using Spark UI to monitor Spark jobs
SOFTWARE REQUIREMENTS
Apache Spark, You will have access to a Python-based JupyterHub environment for this course. No additional download or installation is required.