Buy NOW!

$1,500.00

This four-day hands-on training course in Cloudera Data Engineering delivers the key concepts and knowledge developers need to utilize Apache Spark for developing high-performance, parallel applications on the Cloudera Data Platform (CDP). By focusing on Cloudera Data Engineering, participants will gain valuable skills in managing data and building robust data pipelines.

Through practical exercises, students will practice writing Spark applications that integrate with CDP’s core components. Participants will learn how to use Spark SQL to query structured data, utilize Hive features for data ingestion and denormalization, and work with large datasets stored in a distributed file system.

After completing this course, participants will be well-equipped to tackle real-world challenges, building applications that facilitate faster decision-making, enhance analytical capabilities, and support a wide range of use cases across various industries.

During this course, you will learn how to:

  • Distribute, store, and process data in a CDP cluster.
  • Write, configure, and deploy Apache Spark applications.
  • Use the Spark interpreters and applications to explore, process, and analyze distributed data.
  • Query data using Spark SQL, DataFrames, and Hive tables.
  • Deploy a Spark application on the Data Engineering Service.

This course is designed for developers and data engineers. All students are expected to have basic Linux experience, and basic proficiency with either Python or Scala programming languages. Basic knowledge of SQL is helpful.  Prior knowledge of Spark and Hadoop is not required.

HDFS Introduction

  • HDFS Overview
  • HDFS Components and Interactions
  • Additional HDFS Interactions
  • Ozone Overview
  • Exercise: Working with HDFS

YARN Introduction

  • YARN Overview
  • YARN Components and Interaction
  • Working with YARN
  • Exercise: Working with YARN

Working with RDDs

  • Resilient Distributed Datasets (RDDs)
  • Exercise: Working with RDDs

Working with DataFrames

  • Introduction to DataFrames
  • Exercise: Introducing DataFrames
  • Exercise: Reading and Writing DataFrames
  • Exercise: Working with Columns
  • Exercise: Working with Complex Types
  • Exercise: Combining and Splitting DataFrames
  • Exercise: Summarizing and Grouping DataFrames
  • Exercise: Working with UDFs
  • Exercise: Working with Windows

Introduction to Apache Hive

  • About Hive
  • Transforming data with Hive QL

Working with Apache Hive

  • Exercise: Working with Partitions
  • Exercise: Working with Buckets
  • Exercise: Working with Skew
  • Exercise: Using Serdes to Ingest Text Data
  • Exercise: Using Complex Types to Denormalize Data
  • Working with Datasets in Scala
  • Exercise: Using Datasets in Scala