Spark Partitioning & Optimization

A 6-hour course for intermediate-level data scientists / engineers that covers spark partitions, benchmarking, performance optimization and monitoring.

6 hours of instruction

A 6-hour course for intermediate-level data scientists / engineers that covers spark partitions, benchmarking, performance optimization and monitoring.

OBJECTIVES

  1. Utilize Spark`s intrinsic parallelism and its features to optimize performance
  2. Leverage SparkUI to monitor Spark jobs

PREREQUISITES

Spark Data Structures & Parallelism

SYLLABUS & TOPICS COVERED

  1. Optimization Methods
    • Spark partitions
    • Benchmarking performance
    • Caching and persistence
  2. Implementing Optimization
    • Role of shared variables in Spark
    • Partitioning data in memory vs partitioning on disk
    • Optimizing performance and comparing results

SOFTWARE REQUIREMENTS

Apache Spark, You will have access to a Python-based JupyterHub environment for this course. No additional download or installation is required.

About Instructor

DataSociety

148 Courses

Not Enrolled
This course is currently closed