4 hours of instruction
Learn how to optimize your code and to speed up current data processing using PySpark. In this course, students will work through best practices of how and when to use PySpark. They will explore what they can do with PySpark and how to use distributed computing within PySpark.
OBJECTIVES
- Define use of Spark and PySpark and their role in Big Data analysis
- Query and analyze data in PySpark
- Build scalable machine learning models with PySpark
PREREQUISITES
Foundations of Big Data
SYLLABUS & TOPICS COVERED
- Basics
- Working with data in PySpark
- RDDs vs. DataFrames vs. Datasets
- Optimized queries with Datasets
- Spark SQL
- SparkSQL and its use cases
- DataFrame API and operations on DataFrames
- Logistic Regression
- Logistic regression use cases and theory behind it
- Logistic regression implementation in Spark
- Parallel processing in Spark
SOFTWARE REQUIREMENTS
You will have access to a Python-based JupyterHub environment for this course. No additional download or installation is required.