6 hours of instruction
A 6-hour course for intermediate-level data scientists / engineers that covers spark partitions, benchmarking, performance optimization and monitoring.
OBJECTIVES
- Utilize Spark`s intrinsic parallelism and its features to optimize performance
- Leverage SparkUI to monitor Spark jobs
PREREQUISITES
Spark Data Structures & Parallelism
SYLLABUS & TOPICS COVERED
- Optimization Methods
- Spark partitions
- Benchmarking performance
- Caching and persistence
- Implementing Optimization
- Role of shared variables in Spark
- Partitioning data in memory vs partitioning on disk
- Optimizing performance and comparing results
SOFTWARE REQUIREMENTS
Apache Spark, You will have access to a Python-based JupyterHub environment for this course. No additional download or installation is required.