Building Scalable Models in PySpark

Learn how to optimize your code and to speed up current data processing using PySpark. In this course, students will work through best practices of how and when to use PySpark. They will explore what they can do with PySpark and how to use distributed computing within PySpark.

4 hours of instruction

Learn how to optimize your code and to speed up current data processing using PySpark. In this course, students will work through best practices of how and when to use PySpark. They will explore what they can do with PySpark and how to use distributed computing within PySpark.

OBJECTIVES

  1. Define use of Spark and PySpark and their role in Big Data analysis
  2. Query and analyze data in PySpark
  3. Build scalable machine learning models with PySpark

PREREQUISITES

Foundations of Big Data

SYLLABUS & TOPICS COVERED

  1. Basics
    • Working with data in PySpark
    • RDDs vs. DataFrames vs. Datasets
    • Optimized queries with Datasets
  2. Spark SQL
    • SparkSQL and its use cases
    • DataFrame API and operations on DataFrames
  3. Logistic Regression
    • Logistic regression use cases and theory behind it
    • Logistic regression implementation in Spark
    • Parallel processing in Spark

SOFTWARE REQUIREMENTS

You will have access to a Python-based JupyterHub environment for this course. No additional download or installation is required.

About Instructor

OpenTeams

56 Courses

Not Enrolled
This course is currently closed