Spark Partitioning & Optimization

6 hours of instruction

A 6-hour course for intermediate-level data scientists / engineers that covers spark partitions, benchmarking, performance optimization and monitoring.

OBJECTIVES

Utilize Spark`s intrinsic parallelism and its features to optimize performance
Leverage SparkUI to monitor Spark jobs

PREREQUISITES

Spark Data Structures & Parallelism

SYLLABUS & TOPICS COVERED

Optimization Methods
- Spark partitions
- Benchmarking performance
- Caching and persistence
Implementing Optimization
- Role of shared variables in Spark
- Partitioning data in memory vs partitioning on disk
- Optimizing performance and comparing results

SOFTWARE REQUIREMENTS

Apache Spark, You will have access to a Python-based JupyterHub environment for this course. No additional download or installation is required.

About Instructor

DataSociety

148 Courses

Spark Partitioning & Optimization

About Instructor

DataSociety

Committed to your success with open source. OpenTeams is your easy point of access to a range of services from our open source expert network, from commercial open source support to open source training, staffing & recruiting services, and more.

Resources

OpenTeams