Data Engineering on Google Cloud

Looking to gain hands-on experience implementing key machine learning processes on Google Cloud? This four-day course will teach you everything you need to know about designing and building data processing systems.

google badge
Book this course
Call our sales team today
4 day course
Partner of the Year
Virtual, Private
Virtual Classroom
A convenient and interactive learning experience, that enables you to attend one of our courses from the comfort of your own home or anywhere you can log on. We offer Virtual Classroom on selected live classroom courses where this will appear as an option under the location drop down if available. These can also be booked as Private Virtual Classrooms for exclusive business sessions.
Private
A private training session for your team. Groups can be of any size, at a location of your choice including our training centres.

As a Google Cloud Partner, we’ve been selected to deliver this four-day Data Engineering course.

Our expert Cloud trainers will combine presentations and demos with practical, lab-orientated workshops to teach you the steps required to design data processing systems. You’ll learn how to build end-to-end data pipelines, analyze data and carry out machine learning.

The course covers structured, unstructured, and streaming data, so if you’re a budding Data Engineer, you’ll leave with the ability and confidence to be able to apply your new skills to a variety of scenarios and datasets.

Our Data Engineering on Google Cloud course is delivered via Virtual Classroom. We also offer it as a private training session that can be delivered virtually or at a location of your choice in South Africa.

This course is an intermediate-level course. If you’re looking to master the basics, you would benefit from our Google Cloud Fundamentals: Big Data & Machine Learning course.

Want to combine the two? Take a look at our Professional Data Engineer track, which will get you on track to obtaining the Google Data Engineer certification.

Course overview

Who should attend:

This course is intended for developers who are responsible for:

  • Extracting, loading, transforming, cleaning, and validating data
  • Designing pipelines and architectures for data processing
  • Integrating analytics and machine learning capabilities into data pipelines
  • Querying datasets, visualizing query results, and creating reports

What you'll learn:

By the end of this course, you will be able to:

  • Design and build data processing systems on Google Cloud
  • Process batch and streaming data by implementing autoscaling data pipelines on Dataflow
  • Derive business insights from extremely large datasets using
  • BigQuery
  • Leverage unstructured data using Spark and ML APIs on Dataproc
  • Enable instant insights from streaming data
  • Understand ML APIs and BigQuery ML, and learn to use AutoML to create powerful models without coding

Prerequisites

To benefit from this course, participants should have completed our Google Cloud Fundamentals: Big Data & Machine Learning course or have equivalent experience. You should also have:

  • Basic proficiency with a common query language such as SQL
  • Experience with data modeling and ETL (extract, transform, load) activities
  • Experience with developing applications using a common programming language such as Python
  • Familiarity with machine learning and / or statistics

Course agenda

Module 1: Introduction to Data Engineering
  • Explore the role of a data engineer
  • Analyze data engineering challenges
  • Introduction to BigQuery
  • Data lakes and data warehouses
  • Transactional databases versus data warehouses
  • Partner effectively with other data teams
  • Manage data access and governance
  • Build production-ready pipelines
  • Review Google Cloud customer case study
  • Lab: Using BigQuery to do Analysis
Module 2: Building a Data Lake
  • Introduction to data lakes
  • Data storage and ETL options on Google Cloud
  • Building a data lake using Cloud Storage
  • Securing Cloud Storage
  • Storing all sorts of data types
  • Cloud SQL as a relational data lake
Module 3: Building a Data Warehouse
  • The modern data warehouse
  • Introduction to BigQuery
  • Getting started with BigQuery
  • Loading data
  • Exploring schemas
  • Schema design
  • Nested and repeated fields
  • Optimizing with partitioning and clustering
  • Lab: Loading Data into BigQuery
  • Lab: Working with JSON and Array Data in BigQuery
Module 4: Introduction to Building Batch Data Pipelines
  • EL, ELT, ETL
  • Quality considerations
  • How to carry out operations in BigQuery
  • Shortcomings
  • ETL to solve data quality issues
Module 5: Executing Spark on Dataproc
  • The Hadoop ecosystem
  • Run Hadoop on Dataproc
  • Cloud Storage instead of HDFS
  • Optimize Dataproc
  • Lab: Running Apache Spark jobs on Dataproc
Module 6: Serverless Data Processing with Dataflow
  • Introduction to Dataflow
  • Why customers value Dataflow
  • Dataflow pipelines
  • Aggregating with GroupByKey and Combine
  • Side inputs and windows
  • Dataflow templates
  • Dataflow SQL
  • Lab: A Simple Dataflow Pipeline (Python/Java)
  • Lab: MapReduce in Dataflow (Python/Java)
  • Lab: Side inputs (Python/Java)
Module 7: Manage Data Pipelines with Cloud Data Fusion & Cloud Composer
  • Building batch data pipelines visually with Cloud Data Fusion
  • Components
  • UI overview
  • Building a pipeline
  • Exploring data using Wrangler
  • Orchestrating work between Google Cloud services with Cloud Composer
  • Apache Airflow environment
  • DAGs and operators
  • Workflow scheduling
  • Monitoring and logging
  • Lab: Building and Executing a Pipeline Graph in Data Fusion
  • Optional Lab: An introduction to Cloud Composer
Module 8: Introduction to Processing Streaming Data
  • Process Streaming Data
  • Explain streaming data processing
  • Describe the challenges with streaming data
  • Identify the Google Cloud products and tools that can help address streaming data challenges
Module 9: Serverless Messaging with Pub / Sub
  • Introduction to Pub / Sub
  • Pub / Sub push versus pull
  • Publishing with Pub / Sub code
  • Lab: Publish Streaming Data into Pub / Sub
Module 10: Dataflow Streaming Features
  • Steaming data challenges
  • Dataflow windowing
  • Lab: Streaming Data Pipelines
Module 11: High-Throughput BigQuery & Bigtable Streaming Features
  • Streaming into BigQuery and visualizing results
  • High-throughput streaming with Cloud Bigtable
  • Optimizing Cloud Bigtable performance
  • Lab: Streaming Analytics and Dashboards
  • Lab: Streaming Data Pipelines into Bigtable
Module 12: Advanced BigQuery Functionality & Performance
  • Analytic window functions
  • Use With clauses
  • GIS functions
  • Performance considerations
  • Lab: Optimizing your BigQuery Queries for Performance
  • Optional Lab: Partitioned Tables in BigQuery
Module 13: Introduction to Analytics & AI
  • What is AI?
  • From ad-hoc data analysis to data-driven decisions
  • Options for ML models on Google Cloud
Module 14: Prebuilt ML Model APIs for Unstructured Data
  • Challenges dealing with unstructured data
  • ML APIs for enriching data
  • Lab: Using the Natural Language API to Classify Unstructured Text
Module 15: Big Data Analytics with Notebooks
  • What’s a notebook?
  • BigQuery magic and ties to Pandas
  • Lab: BigQuery in Jupyter Labs on AI Platform
Module 16: Production ML Pipelines
  • Ways to do ML on Google Cloud
  • Vertex AI Pipelines
  • AI Hub
  • Lab: Running Pipelines on Vertex AI
Module 17: Custom Model Building with SQL in BigQuery ML
  • BigQuery ML for quick model building
  • Supported models
  • Lab option 1: Predict Bike Trip Duration with a Regression Model in BigQuery ML
  • Lab option 2: Movie Recommendations in BigQuery ML
Module 18: Custom Model Building with AutoML
  • Why AutoML?
  • AutoML Vision
  • AutoML NLP
  • AutoML tables
close
Don't miss out
Keep up to date with news, views and offers from Jellyfish Training.
Your data will be handled in accordance with our Privacy Policy