Big Data ,

Building Scalable Models in PySpark

Learn how to optimize your code and to speed up current data processing using PySpark. In this course, students will work through best practices of how and when to use PySpark. They will explore what they can do with PySpark and how to use distributed computing within PySpark.

View Course details

OpenTeams

4 hours of instruction

OBJECTIVES

Define use of Spark and PySpark and their role in Big Data analysis
Query and analyze data in PySpark
Build scalable machine learning models with PySpark

PREREQUISITES

Foundations of Big Data

SYLLABUS & TOPICS COVERED

Basics
- Working with data in PySpark
- RDDs vs. DataFrames vs. Datasets
- Optimized queries with Datasets
Spark SQL
- SparkSQL and its use cases
- DataFrame API and operations on DataFrames
Logistic Regression
- Logistic regression use cases and theory behind it
- Logistic regression implementation in Spark
- Parallel processing in Spark

SOFTWARE REQUIREMENTS

You will have access to a Python-based JupyterHub environment for this course. No additional download or installation is required.

About Instructor

OpenTeams

56 Courses

Building Scalable Models in PySpark

About Instructor

OpenTeams

Committed to your success with open source. OpenTeams is your easy point of access to a range of services from our open source expert network, from commercial open source support to open source training, staffing & recruiting services, and more.

Resources

OpenTeams