DEV Community

Data Tech Bridge
Data Tech Bridge

Posted on

The Ultimate Databricks Data Engineer Associate Exam Guide for AWS Engineers

A Comprehensive Deep-Dive into Every Exam Topic


Table of Contents

  1. Introduction to Databricks & the Lakehouse Paradigm
  2. Databricks Architecture
  3. Compute — Clusters & Runtimes
  4. Notebooks
  5. Delta Lake & Data Management
  6. Tables, Schemas & Databases
  7. Databricks Utilities (dbutils)
  8. Git Folders / Repos
  9. Unity Catalog & Governance
  10. Job Orchestration & Workflows
  11. Data Ingestion Patterns
  12. Delta Live Tables (DLT) & Pipelines
  13. Apache Spark — High-Level Overview
  14. AWS Integration Specifics
  15. Security & Access Control
  16. Performance & Best Practices
  17. Exam Tips & Summary

Chapter 1 — Introduction to Databricks & the Lakehouse Paradigm

What Is Databricks?

Databricks is a unified data analytics platform built on top of Apache Spark. It was founded by the original creators of Apache Spark and is available as a managed cloud service on AWS, Azure, and GCP. For AWS engineers, Databricks runs inside your AWS account — it provisions and manages Amazon EC2 instances, S3 buckets, IAM roles, VPCs, and other AWS resources on your behalf.

The platform is designed to bridge the gap between data engineering, data science, and business analytics — providing a single pane of glass for all data workloads.

The Lakehouse Architecture

Before Databricks, teams operated in one of two paradigms:

Data Warehouses (Redshift, Snowflake) — Structured, fast SQL queries, but expensive for large raw datasets, limited support for ML, and data duplication.

Data Lakes (S3 + Glue + Athena) — Cheap storage of any format, but no ACID transactions, no data quality guarantees, schema enforcement challenges, and slow query performance without significant tuning.

The Lakehouse combines the best of both:

  • Cheap, open-format storage (S3 + Parquet/Delta)
  • ACID transactions and data reliability
  • Support for SQL, Batch, Streaming, ML, and BI workloads
  • A unified governance layer
  • Open standards (Delta Lake, Apache Iceberg support)

Databricks is the leading commercial implementation of the Lakehouse pattern.

Why This Matters for the Exam

The exam frequently tests whether you understand which layer (Bronze, Silver, Gold) a given operation belongs to, what Delta Lake provides over raw Parquet, and how Databricks differs from a traditional data warehouse or raw S3 data lake.


Chapter 2 — Databricks Architecture

Control Plane vs. Data Plane

This is one of the most fundamental concepts for both the exam and real-world AWS deployments.

Control Plane (Databricks-managed AWS Account)

The control plane lives in Databricks' own AWS account. It contains:

  • Databricks Web Application — the UI you interact with
  • Cluster Manager — orchestrates cluster lifecycle
  • Job Scheduler — triggers and monitors workflow runs
  • Notebook Storage — notebooks are stored in the control plane (unless using Git Folders)
  • Metadata Services — the Hive Metastore (legacy) or Unity Catalog Metastore

The control plane never touches your actual data. It sends instructions to resources in your account.

Data Plane (Your AWS Account)

The data plane lives in your AWS account and contains:

  • EC2 instances — actual Spark driver and worker nodes
  • S3 buckets — your data storage (the DBFS root bucket and your own buckets)
  • VPC, Subnets, Security Groups — network resources
  • IAM Roles & Instance Profiles — permissions for cluster nodes to access S3

When a cluster is launched, the Databricks control plane communicates with your AWS account (via a cross-account IAM role) to spin up EC2 instances inside your VPC.

Why This Split Matters

Security teams love this architecture because your data never leaves your AWS account. Databricks only stores metadata and notebook code. If you use Secrets (more on that later), sensitive values are encrypted in the control plane but never transmitted in plain text.

Databricks Workspace

A workspace is the top-level organizational unit in Databricks. Think of it like an AWS account — it has its own:

  • Users and groups
  • Clusters
  • Notebooks
  • Jobs
  • Data objects (catalogs, schemas, tables)
  • Secrets

On AWS, a workspace is tied to a specific AWS region and a specific S3 bucket (the "root DBFS bucket"). Large organizations often have multiple workspaces (dev, staging, prod) — this is the recommended pattern.

DBFS — Databricks File System

DBFS is a virtual file system layer that abstracts over your underlying storage. It provides a unified path namespace (dbfs:/) that maps to:

  • An S3 bucket managed by Databricks (the root bucket, e.g., s3://databricks-workspace-XXXXXXX/)
  • Other mounted S3 paths
  • External locations (with Unity Catalog)

Key points:

  • DBFS is not a separate storage system — it's a layer over S3
  • The default DBFS root bucket is managed by Databricks but lives in your AWS account
  • You can mount external S3 buckets to DBFS paths (legacy approach)
  • With Unity Catalog, External Locations replace mounts as the preferred approach
  • DBFS paths look like dbfs:/user/hive/warehouse/ or dbfs:/FileStore/

Important DBFS paths to know:

  • dbfs:/FileStore/ — files uploaded via the UI (accessible via /FileStore/ URL)
  • dbfs:/user/hive/warehouse/ — default managed table location (Hive Metastore)
  • dbfs:/databricks/ — internal Databricks paths
  • dbfs:/mnt/<mount-name>/ — mounted external storage (legacy)

Cluster Architecture

Each Databricks cluster is a standard Apache Spark cluster:

Driver Node (EC2)
  |
  |-- Worker Node 1 (EC2)
  |      |-- Executor 1
  |      |-- Executor 2
  |
  |-- Worker Node 2 (EC2)
         |-- Executor 3
         |-- Executor 4
Enter fullscreen mode Exit fullscreen mode
  • Driver — runs the main program, coordinates Spark jobs, hosts the SparkSession
  • Workers — run Spark executors that process data in parallel
  • Executors — JVM processes on workers that execute tasks

Chapter 3 — Compute: Clusters & Runtimes

Cluster Types

All-Purpose Clusters

  • Created manually (UI, CLI, API) or via the cluster policy
  • Designed for interactive development — notebooks, ad-hoc queries
  • Can be shared by multiple users simultaneously
  • More expensive because they run continuously (until terminated)
  • Persist between runs; you can restart them
  • Billed per DBU (Databricks Unit) while running

Exam tip: All-purpose clusters are for development. Never use them exclusively in production workflows without good reason — that's what job clusters are for.

Job Clusters

  • Created automatically when a Databricks Job runs
  • Terminated automatically when the job completes
  • Cannot be shared interactively
  • Much cheaper — you only pay while the job is running
  • Configuration is defined in the job definition
  • Each job run gets a fresh, isolated cluster

Exam tip: For production workloads, always prefer job clusters for cost efficiency and isolation.

SQL Warehouses (Databricks SQL)

  • Specifically designed for SQL analytics workloads
  • Used by Databricks SQL, BI tools (Tableau, Power BI), and dbt
  • Serverless (Serverless SQL Warehouses) or Classic (Pro/Serverless)
  • Auto-scaling and auto-stop built-in
  • Not used for Spark DataFrame API or ML workloads directly

Cluster Configuration

Runtime Versions

The Databricks Runtime (DBR) is the set of software running on your cluster:

  • Databricks Runtime (Standard) — Spark + Delta Lake + standard libraries
  • Databricks Runtime ML — adds ML libraries (MLflow, TensorFlow, PyTorch, scikit-learn)
  • Databricks Runtime for Genomics — specialized for genomics
  • Photon Runtime — includes the Photon query engine (C++ vectorized execution for Delta Lake queries — much faster for SQL)

LTS versions (Long Term Support) — recommended for production. They receive bug fixes for a longer period. Example: 13.3 LTS, 12.2 LTS.

Always use the latest LTS version unless you have a specific reason to use a different version.

Cluster Modes

Standard Mode (Single User)

  • One user owns the cluster
  • Recommended for most production workloads
  • Full Spark features available

Shared Mode

  • Multiple users share the same cluster
  • Isolation via Unity Catalog and fine-grained access control
  • Some Spark features restricted for safety
  • Requires Unity Catalog

Single Node Mode

  • Driver only — no worker nodes
  • Good for small data, ML model training on single machine, development
  • Spark still works (local mode)

Auto-Scaling

Databricks clusters can automatically scale the number of worker nodes based on load:

  • Set a minimum and maximum worker count
  • Databricks monitors the job queue and scales up/down accordingly
  • Scale-down has a configurable timeout (default: 120 seconds of idle)
  • Does not work well with Streaming jobs — use fixed-size clusters for streaming

Auto-Termination

Set an idle timeout (e.g., 60 minutes) after which the cluster terminates automatically if no notebooks are attached or jobs are running. This is critical for cost control.

Instance Types (AWS)

For the exam, you don't need to memorize specific EC2 types, but understand:

  • Memory-optimized (r5, r6i) — good for large DataFrames, joins, aggregations
  • Compute-optimized (c5, c6i) — good for CPU-intensive transforms
  • Storage-optimized (i3) — good for heavy shuffle operations (local NVMe SSDs)
  • GPU instances (p3, g4dn) — for ML Runtime (deep learning)

Instance Pools

An instance pool maintains a set of pre-warmed EC2 instances ready to be allocated to clusters. This dramatically reduces cluster startup time (from ~5-8 minutes to ~30-60 seconds).

Key characteristics:

  • Pools have a minimum idle instance count and maximum capacity
  • Instances in pools are EC2 instances that are running (you pay for them)
  • When a cluster uses a pool, it grabs instances from the pool
  • When a cluster terminates, instances return to the pool
  • Configured with a specific instance type and Databricks Runtime

Exam tip: Use instance pools when you need fast cluster startup for jobs or have many short-lived clusters.

Cluster Policies

Cluster policies are JSON configurations that restrict and/or set default values for cluster configuration. They help:

  • Control costs (max DBUs per hour, max instance count)
  • Enforce standards (specific runtimes, instance types)
  • Simplify UX for non-technical users
  • Apply governance rules

Only workspace admins can create policies. Users can only create clusters that conform to the policies assigned to them.

// Example: restrict cluster to specific runtime
{
  "spark_version": {
    "type": "fixed",
    "value": "13.3.x-scala2.12"
  },
  "autotermination_minutes": {
    "type": "fixed",
    "value": 60
  }
}
Enter fullscreen mode Exit fullscreen mode

DBUs — Databricks Units

DBUs (Databricks Units) are the billing unit for Databricks. One DBU is roughly equivalent to one hour of a specific instance type doing compute. The rate varies by:

  • Cluster type (All-Purpose vs. Jobs vs. SQL Warehouse)
  • Workload type (Data Engineering, Data Analytics, etc.)
  • Whether Photon is enabled

You pay both EC2 costs (to AWS) AND DBU costs (to Databricks).


Chapter 4 — Notebooks

What Is a Databricks Notebook?

A notebook is a web-based, interactive document containing cells of code and markdown. It's the primary development interface in Databricks. Notebooks in Databricks have superpowers compared to Jupyter notebooks — they integrate tightly with the platform.

Notebook Basics

Languages

Each notebook has a default language (Python, SQL, Scala, R). However, you can mix languages within a single notebook using magic commands:

  • %python — execute a Python cell
  • %sql — execute a SQL cell
  • %scala — execute a Scala cell
  • %r — execute an R cell
  • %md — render Markdown (documentation)
  • %sh — execute shell commands on the driver node
  • %fs — shortcut for dbutils.fs commands
  • %run — run another notebook inline (powerful for modularization)
  • %pip — install Python packages in the notebook environment
  • %conda — use conda for package management (less common)

Key insight about %run: When you %run another notebook, it executes in the same SparkSession and scope as the calling notebook. Variables, functions, and DataFrames defined in the sub-notebook are available in the parent notebook. This is different from a function call — it's more like an include.

Key insight about %sh: Shell commands run on the driver node only, not on worker nodes. Output is printed inline. You can install packages with %sh pip install ... but %pip is preferred because it installs on all nodes.

Widget Parameters

Notebooks can accept input parameters using widgets:

dbutils.widgets.text("table_name", "default_value", "Table Name")
table_name = dbutils.widgets.get("table_name")
Enter fullscreen mode Exit fullscreen mode

Widget types:

  • text — free-text input
  • dropdown — select from a list
  • combobox — dropdown with free-text option
  • multiselect — select multiple values

Widgets are particularly useful when notebooks are run as jobs — you can pass parameter values at runtime.

Notebook Results & Visualization

  • The last expression in a cell is displayed as output
  • DataFrames are displayed as interactive tables with display(df)
  • display() supports charts — bar, line, pie, map, etc.
  • print() outputs plain text
  • Markdown cells render formatted documentation inline

Notebook Lifecycle

Notebooks can be in these states:

  • Detached — not connected to a cluster, cannot run code
  • Attached — connected to a running or starting cluster
  • Running — actively executing cells

You attach a notebook to a cluster by selecting the cluster from the dropdown in the top-right. Multiple notebooks can be attached to the same all-purpose cluster simultaneously.

Auto-Save and Versioning

Notebooks auto-save frequently. Databricks maintains a revision history for every notebook — you can see the full history and revert to any previous version. This is separate from Git versioning (covered later).

Exporting Notebooks

Notebooks can be exported as:

  • .ipynb — Jupyter Notebook format
  • .py — Python source file (with # COMMAND ---------- separators)
  • .scala — Scala source
  • .sql — SQL source
  • .dbc — Databricks archive (proprietary, includes all cells and results)
  • .html — HTML with rendered output (for sharing results)

Notebook Context & SparkSession

In every Databricks notebook, these objects are pre-initialized:

  • spark — the SparkSession
  • sc — the SparkContext
  • dbutils — Databricks Utilities
  • display() — the display function
  • displayHTML() — render HTML content

You never need to create a SparkSession manually in a notebook.


Chapter 5 — Delta Lake & Data Management

This is the most important chapter for the exam. Delta Lake underpins everything in Databricks.

What Is Delta Lake?

Delta Lake is an open-source storage layer that brings ACID transactions and reliability to data lakes. It uses standard Parquet files for data storage but adds a transaction log on top that enables:

  • ACID transactions
  • Schema enforcement and evolution
  • Time travel (data versioning)
  • Upserts and deletes (DML operations)
  • Streaming and batch unification
  • Optimized read/write operations

Delta Lake is the default table format in Databricks. When you create a table without specifying a format, it's Delta.

Delta Lake Architecture

The Delta Transaction Log

The transaction log (located in the _delta_log/ folder of every Delta table) is the backbone of Delta Lake. It's a folder containing:

  • JSON files (0000000000.json, 0000000001.json, ...) — each represents one transaction
  • Checkpoint files (.parquet) — every 10 commits, a full checkpoint is created for performance

Every operation (write, update, delete, schema change) creates a new entry in the transaction log. The log is append-only — you never modify existing log entries.

Reading a Delta table:

  1. Spark reads the _delta_log/ to determine the current state
  2. It identifies which Parquet files are "live" (not removed by deletes/updates)
  3. It reads only those Parquet files

Writing to a Delta table:

  1. Spark writes new Parquet files to the table directory
  2. Atomically commits a new JSON entry to _delta_log/
  3. The transaction either fully commits or fully fails — no partial writes

ACID Properties in Delta Lake

Atomicity — A transaction either fully succeeds or fully fails. If your Spark job fails halfway, no partial data is committed.

Consistency — Schema enforcement ensures data always conforms to the table's schema.

Isolation — Concurrent reads and writes don't interfere. Readers always see a consistent snapshot. Writers use optimistic concurrency control.

Durability — Once committed, data persists to S3.

Delta Lake Operations

Creating Delta Tables

-- SQL approach (creates a managed table in default schema)
CREATE TABLE employees (
  id INT,
  name STRING,
  salary DOUBLE,
  department STRING
)
USING DELTA;

-- Or from an existing data source
CREATE TABLE employees
USING DELTA
AS SELECT * FROM parquet.`/path/to/parquet/`;

-- CTAS (Create Table As Select)
CREATE TABLE silver_employees
AS SELECT id, name, salary FROM bronze_employees WHERE salary > 0;
Enter fullscreen mode Exit fullscreen mode

Writing Data

# Overwrite entire table
df.write.format("delta").mode("overwrite").saveAsTable("employees")

# Append to table
df.write.format("delta").mode("append").saveAsTable("employees")

# Write to a path (external table pattern)
df.write.format("delta").save("s3://my-bucket/delta/employees/")
Enter fullscreen mode Exit fullscreen mode

Reading Data

# Read a table
df = spark.read.table("employees")

# Read from path
df = spark.read.format("delta").load("s3://my-bucket/delta/employees/")

# SQL
spark.sql("SELECT * FROM employees WHERE department = 'Engineering'")
Enter fullscreen mode Exit fullscreen mode

MERGE (Upsert)

MERGE is one of Delta Lake's most powerful features — it allows you to upsert (update + insert) data efficiently:

MERGE INTO target_table AS target
USING source_table AS source
ON target.id = source.id
WHEN MATCHED THEN
  UPDATE SET target.name = source.name,
             target.salary = source.salary
WHEN NOT MATCHED THEN
  INSERT (id, name, salary, department)
  VALUES (source.id, source.name, source.salary, source.department)
WHEN NOT MATCHED BY SOURCE THEN
  DELETE
Enter fullscreen mode Exit fullscreen mode

Key MERGE clauses:

  • WHEN MATCHED — row exists in both target and source
  • WHEN NOT MATCHED (or WHEN NOT MATCHED BY TARGET) — row in source but not target (INSERT)
  • WHEN NOT MATCHED BY SOURCE — row in target but not in source (useful for DELETE)

You can have multiple WHEN MATCHED clauses with different conditions.

DELETE and UPDATE

-- Delete rows
DELETE FROM employees WHERE department = 'Legacy';

-- Update rows
UPDATE employees SET salary = salary * 1.10 WHERE department = 'Engineering';
Enter fullscreen mode Exit fullscreen mode

These are true DML operations — not possible with raw Parquet files!

Time Travel

Time travel is the ability to query historical versions of a Delta table. Every commit creates a new version.

-- Query version 5
SELECT * FROM employees VERSION AS OF 5;

-- Query as of a timestamp
SELECT * FROM employees TIMESTAMP AS OF '2024-01-15 10:00:00';

-- In Python
df = spark.read.format("delta") \
  .option("versionAsOf", 5) \
  .load("/path/to/table")
Enter fullscreen mode Exit fullscreen mode

Use cases:

  • Audit and compliance — what did the data look like at a specific time?
  • Recovery — accidentally deleted data? Roll back!
  • Reproducibility — ML experiments can reference exact data versions

How long can you go back?

Time travel is limited by the data retention period (default: 30 days). Old versions are removed by the VACUUM command.

RESTORE

You can restore a table to a previous version:

RESTORE TABLE employees TO VERSION AS OF 3;
RESTORE TABLE employees TO TIMESTAMP AS OF '2024-01-10';
Enter fullscreen mode Exit fullscreen mode

This creates a new commit that restores the state — it doesn't delete history.

Schema Enforcement and Evolution

Schema Enforcement (Schema Validation)

By default, Delta Lake rejects writes that don't match the table's schema. This prevents silent data corruption.

If you try to write a DataFrame with extra columns or incompatible types to a Delta table, you'll get a AnalysisException.

Schema Evolution

You can allow schema changes using mergeSchema or overwriteSchema:

# Add new columns to the table schema automatically
df.write.format("delta") \
  .mode("append") \
  .option("mergeSchema", "true") \
  .save("/path/to/table")

# Completely replace the schema (dangerous!)
df.write.format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .save("/path/to/table")
Enter fullscreen mode Exit fullscreen mode

mergeSchema — adds new columns from the incoming data to the schema. Existing columns are preserved.
overwriteSchema — replaces the entire schema (requires overwrite mode).

For SQL, use SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name') for column renaming and dropping.

Delta Lake Table Properties and Optimization

OPTIMIZE

OPTIMIZE compacts small Parquet files into larger ones. Small files are a common performance problem — each Parquet file write from a Spark task creates a file, and tables with many small files are slow to read.

OPTIMIZE employees;

-- With Z-ORDER (more on this below)
OPTIMIZE employees ZORDER BY (department, salary);
Enter fullscreen mode Exit fullscreen mode

OPTIMIZE is idempotent and safe to run at any time.

Z-Ordering

Z-ORDER is a multi-dimensional data clustering technique. It co-locates related data (based on specified columns) within Parquet files. This dramatically improves query performance when filtering on those columns.

OPTIMIZE employees ZORDER BY (department);
Enter fullscreen mode Exit fullscreen mode

After Z-ordering by department, queries like WHERE department = 'Engineering' will skip most files (data skipping), reading only files that contain that department's data.

Z-ORDER is most effective for:

  • High-cardinality columns used in WHERE filters
  • Columns used in JOIN conditions
  • Up to 3-4 columns (diminishing returns beyond that)

VACUUM

VACUUM removes old Parquet files that are no longer referenced by the current or recent transaction log entries:

-- Remove files older than the default retention period (7 days for VACUUM, 30 days for time travel)
VACUUM employees;

-- Specify a retention period
VACUUM employees RETAIN 168 HOURS;  -- 7 days

-- DANGER: DRY RUN first!
VACUUM employees DRY RUN;
Enter fullscreen mode Exit fullscreen mode

Important: The default spark.databricks.delta.retentionDurationCheck.enabled prevents VACUUM from running with less than 7-day retention. Override carefully.

After VACUUM, time travel beyond the retention period is impossible.

ANALYZE

ANALYZE TABLE employees COMPUTE STATISTICS;
ANALYZE TABLE employees COMPUTE STATISTICS FOR COLUMNS department, salary;
Enter fullscreen mode Exit fullscreen mode

Computes statistics (min, max, null count, distinct count) that the Spark query optimizer uses for better query planning.

Auto Optimize (Databricks-specific Delta feature)

Databricks adds two auto-optimization features:

Auto Compaction — automatically runs a compaction after writes if files are too small

ALTER TABLE employees SET TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true');
Enter fullscreen mode Exit fullscreen mode

Optimized Writes — repartitions data before writing to produce right-sized files

ALTER TABLE employees SET TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true');
Enter fullscreen mode Exit fullscreen mode

Or enable globally in Spark config:

spark.databricks.delta.optimizeWrite.enabled = true
spark.databricks.delta.autoCompact.enabled = true
Enter fullscreen mode Exit fullscreen mode

Liquid Clustering (Newer Feature)

Liquid Clustering is a newer alternative to Z-Ordering and partitioning that provides:

  • Flexible, incremental clustering
  • No need to know clustering columns upfront
  • Better performance than Z-ORDER for most workloads
  • Can change clustering columns without rewriting the whole table
CREATE TABLE employees
CLUSTER BY (department, salary)
AS SELECT * FROM source;

-- Re-cluster as data grows
OPTIMIZE employees;
Enter fullscreen mode Exit fullscreen mode

Delta Table History

DESCRIBE HISTORY employees;
Enter fullscreen mode Exit fullscreen mode

Shows all versions of the table with:

  • Version number
  • Timestamp
  • User who made the change
  • Operation (WRITE, DELETE, MERGE, OPTIMIZE, etc.)
  • Operation parameters

Table Partitioning

Partitioning physically separates data into subdirectories based on a column value:

CREATE TABLE sales (
  id INT,
  date DATE,
  region STRING,
  amount DOUBLE
)
PARTITIONED BY (region, date);
Enter fullscreen mode Exit fullscreen mode

Data is stored as:

s3://bucket/sales/region=US/date=2024-01-01/*.parquet
s3://bucket/sales/region=EU/date=2024-01-01/*.parquet
Enter fullscreen mode Exit fullscreen mode

When to partition:

  • Tables with billions of rows
  • Columns with low cardinality (< 10,000 distinct values)
  • Columns consistently used as filters
  • Columns used in GROUP BY at the partition level (e.g., date for daily processing)

When NOT to partition:

  • Small tables (< 1GB)
  • High-cardinality columns (creates too many small files)
  • Columns rarely used as filters

Exam tip: Over-partitioning is a common mistake. Liquid Clustering is generally preferred for new tables in Databricks.

Delta Change Data Feed (CDF)

Change Data Feed (also called Change Data Capture) captures every row-level change (INSERT, UPDATE, DELETE) to a Delta table and exposes it as a readable stream or batch.

Enable it on a table:

ALTER TABLE employees SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
-- OR at creation:
CREATE TABLE employees (...) TBLPROPERTIES (delta.enableChangeDataFeed = true);
Enter fullscreen mode Exit fullscreen mode

Read the changes:

# Batch - read changes between versions
changes = spark.read.format("delta") \
  .option("readChangeFeed", "true") \
  .option("startingVersion", 5) \
  .option("endingVersion", 10) \
  .table("employees")

# Streaming - continuously process changes
stream = spark.readStream.format("delta") \
  .option("readChangeFeed", "true") \
  .option("startingVersion", "latest") \
  .table("employees")
Enter fullscreen mode Exit fullscreen mode

Each change row has additional columns:

  • _change_typeinsert, update_preimage, update_postimage, delete
  • _commit_version — which version of the table
  • _commit_timestamp — when the change was committed

Use cases:

  • Efficient Silver → Gold propagation (only process changed rows)
  • Building CDC pipelines
  • Audit logs

Chapter 6 — Tables, Schemas & Databases

Managed vs. External Tables

This is a critical concept for the exam.

Managed Tables

  • Databricks (via the metastore) controls both the metadata AND the data files
  • Data is stored in the default warehouse location (DBFS or Unity Catalog managed storage)
  • When you DROP TABLE managed_table, both metadata AND data are deleted
  • No LOCATION keyword in the CREATE statement
CREATE TABLE managed_employees (
  id INT,
  name STRING
);
-- Data stored at: dbfs:/user/hive/warehouse/managed_employees/
Enter fullscreen mode Exit fullscreen mode

External Tables

  • Databricks controls the metadata only
  • Data lives at a user-specified location (S3, ADLS, etc.)
  • When you DROP TABLE external_table, only metadata is deleted; data remains in S3
  • Uses LOCATION keyword
CREATE TABLE external_employees (
  id INT,
  name STRING
)
LOCATION 's3://my-data-bucket/employees/';
Enter fullscreen mode Exit fullscreen mode

When to use external tables:

  • Data shared with other tools (Athena, Glue, EMR)
  • Data that should outlive the metastore
  • Data owned by another team
  • Large datasets already at a specific S3 path

When to use managed tables:

  • Data fully managed by Databricks
  • Simplicity — don't want to manage S3 paths manually
  • Unity Catalog managed storage is preferred

Views

Databricks supports three view types:

Regular Views

CREATE VIEW engineering_employees AS
SELECT * FROM employees WHERE department = 'Engineering';
Enter fullscreen mode Exit fullscreen mode
  • Stored in the metastore
  • Query is re-executed each time
  • Can reference tables across schemas

Temporary Views

CREATE TEMP VIEW temp_high_earners AS
SELECT * FROM employees WHERE salary > 100000;
Enter fullscreen mode Exit fullscreen mode
  • Only visible to the current SparkSession (your notebook session)
  • Dropped when the SparkSession ends
  • Not visible to other users or notebooks

Global Temporary Views

CREATE GLOBAL TEMP VIEW global_active AS
SELECT * FROM employees WHERE status = 'active';

-- Access via global_temp schema
SELECT * FROM global_temp.global_active;
Enter fullscreen mode Exit fullscreen mode
  • Shared across SparkSessions on the same cluster
  • Dropped when the cluster terminates
  • Rarely used — prefer regular views for persistence

Schemas (Databases)

In Databricks, Schema and Database are synonymous (SQL standard uses Schema; Databricks/Hive historically uses Database). The two-level namespace is:

database.table
employees_db.employees
Enter fullscreen mode Exit fullscreen mode

With Unity Catalog, it becomes three-level:

catalog.schema.table
main.employees_db.employees
Enter fullscreen mode Exit fullscreen mode

Creating and Using Schemas

CREATE SCHEMA IF NOT EXISTS employees_db
COMMENT 'Employee data'
LOCATION 's3://my-bucket/employees_db/';  -- optional, for external schema

USE employees_db;  -- sets default schema for subsequent queries

SHOW SCHEMAS;
DESCRIBE SCHEMA employees_db;
DROP SCHEMA employees_db CASCADE;  -- CASCADE drops all tables in it
Enter fullscreen mode Exit fullscreen mode

DESCRIBE Commands

Critical for exploring metadata:

-- Basic table info (column names and types)
DESCRIBE TABLE employees;

-- Detailed table info (storage location, format, partitions, etc.)
DESCRIBE DETAIL employees;

-- Table history (all versions)
DESCRIBE HISTORY employees;

-- Extended info (everything + table properties)
DESCRIBE EXTENDED employees;
Enter fullscreen mode Exit fullscreen mode

SHOW Commands

SHOW TABLES;                    -- tables in current schema
SHOW TABLES IN employees_db;   -- tables in specific schema
SHOW SCHEMAS;                   -- all schemas
SHOW DATABASES;                 -- same as SHOW SCHEMAS
SHOW COLUMNS IN employees;     -- columns in a table
SHOW PARTITIONS employees;     -- partitions
SHOW TBLPROPERTIES employees;  -- table properties
CREATE TABLE employees SHOW CREATE TABLE employees;  -- show CREATE statement
Enter fullscreen mode Exit fullscreen mode

Table Cloning

Delta Lake supports two cloning modes:

Shallow Clone

CREATE TABLE employees_backup
SHALLOW CLONE employees;
Enter fullscreen mode Exit fullscreen mode
  • Creates a new table that references the same Parquet files as the original
  • Only the metadata (transaction log) is copied — no data movement
  • Extremely fast and cheap
  • Good for: creating test copies, point-in-time snapshots for experiments

Important: If you run VACUUM on the original table, shallow clones may break! The underlying files may be deleted.

Deep Clone

CREATE TABLE employees_backup
DEEP CLONE employees;
Enter fullscreen mode Exit fullscreen mode
  • Fully copies both metadata AND data files
  • Independent of the original table
  • Good for: migration, backup, production copies
  • Can be expensive for large tables

You can also clone incrementally:

CREATE OR REPLACE TABLE employees_backup
DEEP CLONE employees;
Enter fullscreen mode Exit fullscreen mode

Run this repeatedly and it only copies new/changed files.


Chapter 7 — Databricks Utilities (dbutils)

dbutils is a utility library available in all Databricks notebooks. It provides programmatic access to many platform features.

dbutils.fs — File System Utilities

Think of this as a programmatic interface to DBFS and cloud storage.

# List files and directories
dbutils.fs.ls("dbfs:/")
dbutils.fs.ls("s3://my-bucket/data/")

# Move/rename
dbutils.fs.mv("dbfs:/source/file.txt", "dbfs:/dest/file.txt")

# Copy
dbutils.fs.cp("dbfs:/source/", "dbfs:/dest/", recurse=True)

# Delete
dbutils.fs.rm("dbfs:/path/to/file.txt")
dbutils.fs.rm("dbfs:/path/to/folder/", recurse=True)

# Create directory
dbutils.fs.mkdirs("dbfs:/my/new/directory/")

# Read file content (small files only!)
content = dbutils.fs.head("dbfs:/path/to/file.txt", maxBytes=65536)

# Mount an S3 bucket (legacy - prefer External Locations with UC)
dbutils.fs.mount(
  source="s3a://my-bucket/",
  mount_point="/mnt/my-bucket",
  extra_configs={"fs.s3a.aws.credentials.provider": "..."}
)

# List mounts
dbutils.fs.mounts()

# Unmount
dbutils.fs.unmount("/mnt/my-bucket")
Enter fullscreen mode Exit fullscreen mode

The %fs magic command is a shorthand:

%fs ls /
%fs rm -r /path/to/delete/
Enter fullscreen mode Exit fullscreen mode

dbutils.secrets — Secrets Management

One of the most security-critical utilities. Secrets allow you to store sensitive values (API keys, passwords, connection strings) without hardcoding them in notebooks or code.

Secret Scopes

A secret scope is a named collection of secrets. Two types:

Databricks-backed scopes — secrets stored in the Databricks control plane, encrypted at rest.

AWS Secrets Manager-backed scopes — secrets stored in AWS Secrets Manager; Databricks fetches them dynamically.

# Create a Databricks-backed scope (Databricks CLI)
databricks secrets create-scope --scope my-scope

# Put a secret
databricks secrets put --scope my-scope --key db-password

# List secrets (only shows keys, never values)
databricks secrets list --scope my-scope
Enter fullscreen mode Exit fullscreen mode

Using Secrets in Notebooks

# Retrieve a secret - returns a string
password = dbutils.secrets.get(scope="my-scope", key="db-password")

# IMPORTANT: If you print a secret, Databricks redacts it
print(password)  # Output: [REDACTED]

# List scopes
dbutils.secrets.listScopes()

# List keys in a scope (doesn't show values)
dbutils.secrets.list("my-scope")
Enter fullscreen mode Exit fullscreen mode

Critical exam point: You can never see the actual value of a secret from within a notebook. print() always shows [REDACTED]. This is by design for security.

dbutils.widgets — Notebook Parameters

Covered in the Notebooks chapter. Key methods:

dbutils.widgets.text("param_name", "default_value", "Display Label")
dbutils.widgets.dropdown("env", "dev", ["dev", "staging", "prod"])
dbutils.widgets.get("param_name")  # retrieve value
dbutils.widgets.remove("param_name")
dbutils.widgets.removeAll()
Enter fullscreen mode Exit fullscreen mode

dbutils.notebook — Notebook Orchestration

This enables calling other notebooks programmatically:

# Run another notebook with a timeout (seconds)
result = dbutils.notebook.run(
  "/path/to/notebook",
  timeout_seconds=300,
  arguments={"param1": "value1", "param2": "value2"}
)

# Exit a notebook with a return value
dbutils.notebook.exit("SUCCESS")
dbutils.notebook.exit(json.dumps({"status": "success", "rows_processed": 1000}))
Enter fullscreen mode Exit fullscreen mode

Difference between %run and dbutils.notebook.run():

Feature %run dbutils.notebook.run()
Executes in same session Yes No (new session)
Variables shared Yes No
Can capture return value No Yes
Parallel execution No Yes (multiple calls)
Timeout support No Yes
Pass parameters No (use widgets) Yes

%run is for modular code sharing. dbutils.notebook.run() is for orchestration.

dbutils.data — Preview Data

# Preview a small sample of a DataFrame
dbutils.data.summarize(df)
Enter fullscreen mode Exit fullscreen mode

dbutils.library — Library Management (Deprecated)

In newer runtimes, use %pip instead. But for completeness:

# Deprecated, use %pip
dbutils.library.installPyPI("pandas", version="1.3.0")
Enter fullscreen mode Exit fullscreen mode

dbutils.jobs — Job Context (Within Job Runs)

# Get the current run ID when executing as part of a job
dbutils.jobs.taskValues.set(key="output_path", value="/path/to/output")
result = dbutils.jobs.taskValues.get(taskKey="upstream_task", key="output_path")
Enter fullscreen mode Exit fullscreen mode

This is used in multi-task jobs to pass values between tasks.


Chapter 8 — Git Folders / Repos

What Are Git Folders?

Git Folders (formerly called Databricks Repos) is Databricks' native Git integration. It allows you to:

  • Connect a Databricks workspace folder to a Git repository
  • Pull, push, commit, create branches, and merge within the Databricks UI
  • Version-control your notebooks alongside application code
  • Implement CI/CD workflows for Databricks code

Supported Git Providers

  • GitHub (github.com and GitHub Enterprise)
  • GitLab
  • Bitbucket (Cloud and Server)
  • Azure DevOps
  • AWS CodeCommit

Setting Up Git Integration

  1. Go to User Settings → Git Integration
  2. Select your Git provider
  3. Enter your personal access token (PAT) or OAuth credentials
  4. Databricks stores credentials in the control plane (encrypted)

Working with Git Folders

Cloning a Repo

In the Databricks UI:

  1. Navigate to the Repos section (or Workspace → Git Folders)
  2. Click "Add Repo"
  3. Enter the repository URL
  4. Optionally specify a branch

This creates a folder in your workspace that mirrors the Git repository.

Git Operations in the UI

Within a Git Folder, you can:

  • Pull — fetch and merge latest changes from remote
  • Push — push local commits to remote
  • Commit — stage and commit changes
  • Branch — create or switch branches
  • Merge — merge branches
  • Reset — revert uncommitted changes

What Gets Versioned

Inside a Git Folder:

  • Notebooks — stored as .py, .sql, .scala, or .r files with special Databricks metadata comments
  • Python modules — regular .py files
  • SQL files.sql files
  • Config filesrequirements.txt, setup.py, etc.

Important: Regular workspace notebooks (outside Git Folders) are not in Git. Only notebooks inside a Git Folder are version-controlled.

CI/CD with Git Folders

A typical CI/CD workflow:

Developer (feature branch)
  → Git Folder in dev workspace
  → Runs tests in dev
  → Creates Pull Request on GitHub
  → PR review and approval
  → Merge to main
  → CI/CD pipeline (GitHub Actions / AWS CodePipeline)
    → Runs automated tests
    → Deploys to staging workspace (via Databricks CLI)
    → Deploys to production workspace (via Databricks CLI)
Enter fullscreen mode Exit fullscreen mode

Databricks CLI for CI/CD

# Install CLI
pip install databricks-cli

# Configure
databricks configure --token

# Import a notebook
databricks workspace import local_notebook.py /Remote/Path/notebook --language PYTHON

# Export a notebook
databricks workspace export /Remote/Path/notebook local_notebook.py

# Deploy a job
databricks jobs create --json-file job_config.json
Enter fullscreen mode Exit fullscreen mode

The Databricks Asset Bundles (DABs) is the modern approach for CI/CD — a YAML-based framework for deploying Databricks resources (jobs, pipelines, notebooks, etc.) across environments.

Databricks Asset Bundles (DABs)

DABs use a databricks.yml configuration file:

bundle:
  name: my-data-pipeline

workspace:
  host: https://your-workspace.azuredatabricks.net

targets:
  dev:
    mode: development
    workspace:
      host: https://dev-workspace.databricks.com

  prod:
    mode: production
    workspace:
      host: https://prod-workspace.databricks.com

resources:
  jobs:
    my_etl_job:
      name: My ETL Job
      tasks:
        - task_key: ingest
          notebook_task:
            notebook_path: ./notebooks/ingest
Enter fullscreen mode Exit fullscreen mode

Deploy with:

databricks bundle deploy --target dev
databricks bundle run my_etl_job --target dev
Enter fullscreen mode Exit fullscreen mode

Chapter 9 — Unity Catalog & Governance

What Is Unity Catalog?

Unity Catalog (UC) is Databricks' unified governance solution for data and AI. It provides:

  • Centralized metastore — single source of truth for all metadata across workspaces
  • Fine-grained access control — table, column, and row-level security
  • Data lineage — automatic tracking of data origin and transformations
  • Auditing — full audit log of all data access
  • Data discovery — searchable catalog with tagging and documentation

Before Unity Catalog, each Databricks workspace had its own Hive Metastore — workspace-local, no cross-workspace sharing, no fine-grained governance. Unity Catalog replaces this with a centralized, cross-workspace solution.

Three-Level Namespace

Unity Catalog introduces a three-level namespace:

catalog.schema.table

main.sales.transactions
Enter fullscreen mode Exit fullscreen mode

Compared to the legacy two-level namespace:

schema.table  (or database.table)
sales.transactions
Enter fullscreen mode Exit fullscreen mode

Catalog — the top-level container. You can have multiple catalogs for different environments, business units, or data domains.

Schema — equivalent to a database in the old model. Contains tables, views, and volumes.

Table/View/Volume — the actual data objects.

Navigating the Namespace

-- Set the default catalog
USE CATALOG main;

-- Set the default schema
USE SCHEMA sales;

-- Fully qualified reference
SELECT * FROM main.sales.transactions;

-- Partially qualified (using defaults)
SELECT * FROM sales.transactions;
Enter fullscreen mode Exit fullscreen mode

Unity Catalog Object Hierarchy

Metastore (one per region)
  └── Catalog
        └── Schema
              ├── Table (Managed or External)
              ├── View
              ├── Volume
              └── Function
Enter fullscreen mode Exit fullscreen mode

Metastore — The top-level Unity Catalog object. Created once per Databricks account region. All workspaces in the same account and region share the same metastore (or different metastores if you create multiple).

Catalog — First-level namespace. Create catalogs for environments (dev, staging, prod) or business units.

Schema — Second-level namespace. Organize tables by domain.

Table — Same as before, but now with UC governance features.

Volume — A governance-enabled storage abstraction for non-tabular data (files, images, audio). Similar to a managed directory with access control.

CREATE VOLUME main.raw_data.audio_files;
-- Access volume files
ls /Volumes/main/raw_data/audio_files/
Enter fullscreen mode Exit fullscreen mode

Function — User-defined functions stored in the catalog.

Managed vs. External Storage in Unity Catalog

UC Managed Tables

  • Data stored in the UC metastore's managed storage location (an S3 bucket you configure)
  • When dropped, data is deleted
  • UC manages all file organization and optimization

External Tables in Unity Catalog

  • Data lives in an External Location (governed S3 path)
  • When dropped, only metadata is removed; data persists

External Locations

An External Location is a registered S3 path that UC governs. It requires:

  1. A Storage Credential — an IAM role that Databricks can use to access S3
  2. An External Location — the combination of Storage Credential + S3 path
-- First, an admin creates a storage credential (IAM role ARN)
CREATE STORAGE CREDENTIAL my_s3_credential
WITH IAM_ROLE 'arn:aws:iam::123456789:role/my-databricks-s3-role';

-- Then create an external location
CREATE EXTERNAL LOCATION my_data_location
URL 's3://my-data-bucket/data/'
WITH (STORAGE CREDENTIAL my_s3_credential);

-- Now create an external table at that location
CREATE TABLE main.sales.transactions
LOCATION 's3://my-data-bucket/data/transactions/';
Enter fullscreen mode Exit fullscreen mode

Benefits over DBFS mounts:

  • Access control — only users granted permissions to the External Location can access data
  • Audit logging — all accesses are logged
  • Cross-workspace — works across all workspaces connected to the metastore

Access Control in Unity Catalog

UC uses a privilege-based model with GRANT and REVOKE:

Privilege Types

On Catalogs:

  • USE CATALOG — allows using the catalog (required for all access)
  • CREATE SCHEMA — create schemas within the catalog
  • ALL PRIVILEGES — grants all privileges

On Schemas:

  • USE SCHEMA — allows using the schema
  • CREATE TABLE — create tables
  • CREATE VIEW — create views
  • ALL PRIVILEGES

On Tables:

  • SELECT — read data
  • MODIFY — INSERT, UPDATE, DELETE
  • ALL PRIVILEGES

On External Locations:

  • READ FILES — read files from this location
  • WRITE FILES — write files to this location
  • CREATE EXTERNAL TABLE — create external tables at this location

Granting Privileges

-- Grant table access to a user
GRANT SELECT ON TABLE main.sales.transactions TO `user@example.com`;

-- Grant schema access to a group
GRANT USE SCHEMA, SELECT ON SCHEMA main.sales TO `data_analysts`;

-- Grant catalog access
GRANT USE CATALOG ON CATALOG main TO `data_team`;

-- Revoke
REVOKE SELECT ON TABLE main.sales.transactions FROM `user@example.com`;

-- Show grants
SHOW GRANTS ON TABLE main.sales.transactions;
Enter fullscreen mode Exit fullscreen mode

Privilege Inheritance

Privileges cascade down the hierarchy:

  • Granting ALL PRIVILEGES on a catalog implicitly covers all schemas, tables, and views within it
  • However, USE CATALOG doesn't automatically grant USE SCHEMA — each level must be granted explicitly
  • To read a table, a user needs: USE CATALOG + USE SCHEMA + SELECT ON TABLE

Row-Level Security and Column Masking

Unity Catalog supports fine-grained access control:

Row Filters — restrict which rows a user can see:

-- Create a row filter function
CREATE FUNCTION filter_by_region(region STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('us_team'), region = 'US', TRUE);

-- Apply to table
ALTER TABLE main.sales.transactions
SET ROW FILTER filter_by_region ON (region);
Enter fullscreen mode Exit fullscreen mode

Column Masks — mask sensitive column values:

-- Create a mask function
CREATE FUNCTION mask_ssn(ssn STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('hr_team'), ssn, '***-**-****');

-- Apply to column
ALTER TABLE main.hr.employees
ALTER COLUMN ssn SET MASK mask_ssn;
Enter fullscreen mode Exit fullscreen mode

Data Lineage

Unity Catalog automatically tracks data lineage — how data flows between tables, views, and notebooks. No configuration required.

You can view lineage in the Catalog Explorer UI:

  • Table lineage — upstream and downstream tables
  • Column lineage — trace a specific column back to its source

This is invaluable for:

  • Impact analysis (what breaks if I change this table?)
  • Debugging data quality issues
  • Compliance and auditing

Audit Logs

All Unity Catalog operations are logged to a system table:

-- Query the audit log (system catalog)
SELECT *
FROM system.access.audit
WHERE event_time > '2024-01-01'
  AND action_name = 'databricksUserEvent'
LIMIT 100;
Enter fullscreen mode Exit fullscreen mode

Tags and Documentation

UC supports tagging objects with metadata:

ALTER TABLE main.sales.transactions
SET TAGS ('pii' = 'true', 'domain' = 'sales', 'owner' = 'sales-team@company.com');

COMMENT ON TABLE main.sales.transactions IS 'All sales transaction records from POS systems';
COMMENT ON COLUMN main.sales.transactions.customer_id IS 'Unique customer identifier, PII';
Enter fullscreen mode Exit fullscreen mode

Delta Sharing

Delta Sharing is an open standard for sharing Delta Lake data across organizations without copying the data.

-- Create a share
CREATE SHARE my_data_share;

-- Add tables to the share
ALTER SHARE my_data_share
ADD TABLE main.sales.transactions;

-- Create a recipient
CREATE RECIPIENT partner_company
USING ID 'partner-databricks-id';

-- Grant access
GRANT SELECT ON SHARE my_data_share TO RECIPIENT partner_company;
Enter fullscreen mode Exit fullscreen mode

The recipient accesses the data via the Delta Sharing protocol from their own Databricks (or other supported tool) — they never get direct S3 access.


Chapter 10 — Job Orchestration & Workflows

Databricks Workflows

Databricks Workflows (formerly Jobs) is the native orchestration engine for running Databricks workloads. It supports:

  • Single-task or multi-task jobs
  • Scheduling (cron syntax)
  • Triggered runs
  • Retry policies
  • Alerting (email, webhooks, Slack)
  • Job clusters (ephemeral) or all-purpose clusters

Job Concepts

Task

A task is the atomic unit of work in a Workflow. Each task can be:

  • Notebook task — runs a Databricks notebook
  • Python script task — runs a .py file
  • Python wheel task — runs a function from a packaged Python wheel
  • Spark JAR task — runs a Java/Scala Spark application
  • SQL task — runs SQL (against a SQL Warehouse)
  • Delta Live Tables task — triggers a DLT pipeline update
  • dbt task — runs a dbt project
  • Pipeline task — runs a DLT pipeline
  • Run job task — triggers another job (job chaining)
  • For each task — iterates over a list and runs a sub-task for each item

Multi-Task Jobs (Task Graphs)

Jobs can have multiple tasks connected by dependencies, forming a DAG (Directed Acyclic Graph):

[ingest_raw] → [validate_data] → [transform_silver] → [aggregate_gold]
                                                    ↘
                                                      [generate_report]
Enter fullscreen mode Exit fullscreen mode

Tasks can run in parallel (if no dependency between them) or sequentially.

# Example job configuration structure
{
  "name": "Daily ETL Pipeline",
  "tasks": [
    {
      "task_key": "ingest",
      "notebook_task": {"notebook_path": "/ETL/01_ingest"}
    },
    {
      "task_key": "transform",
      "depends_on": [{"task_key": "ingest"}],
      "notebook_task": {"notebook_path": "/ETL/02_transform"}
    },
    {
      "task_key": "aggregate",
      "depends_on": [{"task_key": "transform"}],
      "notebook_task": {"notebook_path": "/ETL/03_aggregate"}
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Task Values (Passing Data Between Tasks)

Tasks in a multi-task job can pass data to downstream tasks using task values:

# In the upstream task (e.g., "ingest")
dbutils.jobs.taskValues.set(key="rows_ingested", value=1500)
dbutils.jobs.taskValues.set(key="output_path", value="/data/silver/2024-01-15/")

# In the downstream task (e.g., "transform")
rows = dbutils.jobs.taskValues.get(taskKey="ingest", key="rows_ingested", default=0)
path = dbutils.jobs.taskValues.get(taskKey="ingest", key="output_path")
Enter fullscreen mode Exit fullscreen mode

Scheduling Jobs

Jobs can be triggered by:

  1. Manual trigger — click "Run Now" in the UI
  2. Cron schedule — standard cron syntax
  3. File arrival trigger — runs when a new file appears in a storage path
  4. Continuous trigger — reruns as soon as the previous run completes
  5. API trigger — POST to the Jobs API
# Cron examples
0 0 * * *       # Daily at midnight UTC
0 6 * * 1       # Every Monday at 6 AM
*/15 * * * *    # Every 15 minutes
0 2 1 * *       # First of every month at 2 AM
Enter fullscreen mode Exit fullscreen mode

Cluster Configuration for Jobs

Each task can have its own cluster configuration:

  • Job cluster — created and terminated per-run (recommended for production)
  • All-purpose cluster — use an existing running cluster (development only)
  • Instance pool — use instances from a pool for faster startup

For multi-task jobs, different tasks can use different cluster configurations (e.g., a small cluster for data validation, a large cluster for heavy transformation).

Shared job clusters — multiple tasks in the same job run share a single cluster (cost-efficient when tasks don't need isolation).

Retry and Error Handling

Each task supports:

{
  "max_retries": 3,
  "min_retry_interval_millis": 60000,
  "retry_on_timeout": true,
  "timeout_seconds": 3600
}
Enter fullscreen mode Exit fullscreen mode

Conditional task execution — you can run a task based on the outcome of a previous task:

  • all_success (default) — run only if all upstream tasks succeeded
  • all_done — run regardless of upstream status
  • at_least_one_success — run if at least one upstream succeeded
  • all_failed — run only if all upstream failed (useful for cleanup/alerting on failure)
  • at_least_one_failed — run if at least one upstream failed

Job Parameters

Jobs accept parameters at two levels:

Job-level parameters — available to all tasks:

{"run_date": "2024-01-15", "environment": "prod"}
Enter fullscreen mode Exit fullscreen mode

Task-level parameters — specific to a task (e.g., widget values for a notebook task).

Access in notebooks via dbutils.widgets.get() or as job parameters.

Monitoring and Alerting

Job Run Status: Each run shows Pending, Running, Succeeded, Failed, Timedout, or Canceled.

Email Notifications:

{
  "email_notifications": {
    "on_success": ["team@company.com"],
    "on_failure": ["oncall@company.com", "manager@company.com"],
    "on_start": [],
    "no_alert_for_skipped_runs": true
  }
}
Enter fullscreen mode Exit fullscreen mode

Webhook notifications — send HTTP callbacks to Slack, PagerDuty, custom endpoints on job events.

System Tables for Monitoring (Unity Catalog):

SELECT * FROM system.lakeflow.job_run_timeline
WHERE workspace_id = current_workspace_id()
  AND period_start_time > '2024-01-01'
ORDER BY period_start_time DESC;
Enter fullscreen mode Exit fullscreen mode

Repair Runs

If a job fails partway through a multi-task DAG, you can repair the run — re-running only the failed tasks (and their dependents) without re-running successful tasks. This is huge for cost savings in long pipelines.


Chapter 11 — Data Ingestion Patterns

Bronze-Silver-Gold (Medallion Architecture)

This is the standard data organization pattern in Databricks:

Bronze Layer (Raw Ingestion)

  • Exact copy of source data — no transformations
  • All formats preserved (JSON, CSV, XML, Avro, Parquet)
  • Append-only (never delete from Bronze)
  • May store as Delta (preferred) or original format
  • Adds ingestion metadata: ingestion_timestamp, source_file, batch_id
  • This is your "system of record" — if anything goes wrong downstream, reprocess from Bronze

Silver Layer (Cleansed & Enriched)

  • Data validated, cleaned, and normalized
  • Schema enforced (Delta)
  • Deduplication applied
  • Null handling and type casting done
  • Joins to reference/lookup tables
  • Row-level quality checks
  • Still somewhat raw — business logic not fully applied
  • Query-friendly but not aggregated

Gold Layer (Business-Level Aggregates)

  • Aggregated, denormalized, business-specific views
  • Optimized for specific use cases (reporting, ML features)
  • Often organized by business domain or consumer
  • Direct source for BI tools, dashboards, ML models
  • Examples: daily sales totals, monthly active users, feature store tables

Batch Ingestion

Reading from S3

# Read CSV from S3
df = spark.read \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .csv("s3://my-bucket/raw/employees/*.csv")

# Read JSON
df = spark.read.json("s3://my-bucket/raw/events/")

# Read Parquet
df = spark.read.parquet("s3://my-bucket/raw/transactions/")

# Read with explicit schema (preferred - avoids schema inference overhead)
from pyspark.sql.types import *
schema = StructType([
  StructField("id", IntegerType(), False),
  StructField("name", StringType(), True),
  StructField("amount", DoubleType(), True)
])
df = spark.read.schema(schema).json("s3://my-bucket/raw/")
Enter fullscreen mode Exit fullscreen mode

Auto Loader

Auto Loader is Databricks' incremental data ingestion mechanism. It efficiently processes new files as they arrive in S3, tracking which files have been processed.

# Auto Loader with cloudFiles format
df = spark.readStream \
  .format("cloudFiles") \
  .option("cloudFiles.format", "json") \
  .option("cloudFiles.schemaLocation", "dbfs:/checkpoints/schema/") \
  .option("cloudFiles.inferColumnTypes", "true") \
  .load("s3://my-bucket/raw/events/")

# Write the stream
df.writeStream \
  .format("delta") \
  .option("checkpointLocation", "dbfs:/checkpoints/events/") \
  .option("mergeSchema", "true") \
  .trigger(availableNow=True) \  # Process all available files, then stop
  .table("bronze.events")
Enter fullscreen mode Exit fullscreen mode

Key Auto Loader features:

  • Schema inference and evolution — automatically detects and adapts to new columns
  • File notification mode — uses AWS SQS/SNS for efficient, scalable file detection (recommended for large volumes)
  • Directory listing mode — periodically lists the directory (simpler, lower setup, but less efficient for large buckets)
  • Schema location — stores the inferred schema separately so it persists across runs
  • Checkpoint — tracks exactly which files have been processed (exactly-once guarantee)
  • rescuedDataColumn — captures data that didn't match the expected schema into a separate column

Auto Loader vs. COPY INTO:

Feature Auto Loader COPY INTO
Use case Streaming incremental loads Batch incremental loads
Tracking Checkpoint-based Built-in idempotency
Scale Millions of files Thousands of files
Schema evolution Automatic Manual
API readStream SQL command

COPY INTO

COPY INTO is a SQL command for idempotent batch data loading:

COPY INTO bronze.events
FROM 's3://my-bucket/raw/events/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('inferSchema' = 'true', 'mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
Enter fullscreen mode Exit fullscreen mode
  • Idempotent — running the same COPY INTO multiple times doesn't duplicate data; it only loads new files
  • Tracks loaded files internally
  • Good for scheduled batch loading
  • Simpler than Auto Loader but less scalable

Streaming Ingestion

Structured Streaming

Spark Structured Streaming is the engine underlying streaming in Databricks. Write streaming logic using the same DataFrame API as batch.

# Read from Kafka (on MSK - Amazon Managed Streaming for Kafka)
df = spark.readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092") \
  .option("subscribe", "my-topic") \
  .option("startingOffsets", "latest") \
  .load()

# Parse the Kafka message value (typically JSON)
from pyspark.sql.functions import from_json, col
schema = StructType([...])
parsed = df.select(from_json(col("value").cast("string"), schema).alias("data")) \
  .select("data.*")

# Write to Delta
parsed.writeStream \
  .format("delta") \
  .outputMode("append") \
  .option("checkpointLocation", "/checkpoints/kafka_stream/") \
  .table("bronze.events")
Enter fullscreen mode Exit fullscreen mode

Trigger Options

# Micro-batch triggered every 30 seconds
.trigger(processingTime='30 seconds')

# Process all available data, then stop (batch-like)
.trigger(availableNow=True)

# Continuous processing (low-latency, experimental)
.trigger(continuous='1 second')

# Default: process as fast as possible (micro-batch)
Enter fullscreen mode Exit fullscreen mode

Output Modes

  • append — only new rows are written (most common for Delta)
  • complete — all rows of the result are written/overwritten (for aggregations)
  • update — only changed rows are written (requires Delta)

Checkpointing

Checkpoints store the streaming state — which offsets have been processed, aggregate states, etc. Always set a checkpoint location for production streaming jobs.

.option("checkpointLocation", "s3://my-bucket/checkpoints/my-stream/")
Enter fullscreen mode Exit fullscreen mode

If a streaming job crashes and restarts, it resumes from the checkpoint — exactly-once processing.

foreachBatch

foreachBatch lets you apply arbitrary batch operations to each micro-batch of a stream:

def process_batch(batch_df, batch_id):
  # Complex transformations, MERGE, etc.
  batch_df.write.format("delta").mode("append").table("silver.events")
  # Send metrics to monitoring system

stream.writeStream \
  .foreachBatch(process_batch) \
  .option("checkpointLocation", "/checkpoints/") \
  .start()
Enter fullscreen mode Exit fullscreen mode

This is powerful because you can use any batch API (including MERGE) within a stream.


Chapter 12 — Delta Live Tables (DLT) & Pipelines

What Are Delta Live Tables?

Delta Live Tables (DLT) is a declarative pipeline framework for building reliable data pipelines. Instead of writing imperative Spark code and managing the execution yourself, you declare what your tables should look like, and DLT figures out how to build them, run them, and keep them up-to-date.

Key Philosophy

In traditional ETL:

  • You write code that runs sequentially
  • You manage retries, error handling, checkpointing manually
  • You handle schema evolution manually
  • You figure out when to run each step and in what order

In DLT:

  • You declare table definitions
  • DLT analyzes dependencies and builds a DAG automatically
  • DLT handles retries, checkpointing, error handling
  • DLT enforces data quality with expectations
  • DLT supports both batch and streaming seamlessly

DLT Core Concepts

Defining Tables

In DLT notebooks, you use the @dlt.table decorator (Python) or CREATE OR REFRESH LIVE TABLE (SQL):

import dlt
from pyspark.sql.functions import *

# Bronze: Raw ingestion with Auto Loader
@dlt.table(
  name="raw_events",
  comment="Raw event data from S3",
  table_properties={"quality": "bronze"}
)
def raw_events():
  return (
    spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")
      .option("cloudFiles.schemaLocation", "/checkpoints/raw_events_schema/")
      .load("s3://my-bucket/raw/events/")
  )
Enter fullscreen mode Exit fullscreen mode
-- SQL equivalent
CREATE OR REFRESH STREAMING LIVE TABLE raw_events
COMMENT "Raw event data from S3"
AS SELECT * FROM cloud_files('s3://my-bucket/raw/events/', 'json',
  map('cloudFiles.inferColumnTypes', 'true'))
Enter fullscreen mode Exit fullscreen mode

Live Tables vs. Streaming Live Tables

Live Tables (Materialized Views)

  • Process data in batch
  • Fully recomputed each run (by default)
  • Best for: aggregations, small tables, tables that need to be completely refreshed
@dlt.table
def daily_sales_summary():
  return (
    dlt.read("silver_transactions")
      .groupBy("date", "region")
      .agg(sum("amount").alias("total_sales"))
  )
Enter fullscreen mode Exit fullscreen mode

Streaming Live Tables

  • Process data incrementally using Structured Streaming
  • Only process new data each run (huge cost savings)
  • Best for: high-volume append-only data (logs, events, IoT)
@dlt.table
def silver_events():
  return (
    dlt.read_stream("raw_events")
      .filter(col("event_type").isNotNull())
      .withColumn("processed_at", current_timestamp())
  )
Enter fullscreen mode Exit fullscreen mode

Key difference: Use dlt.read() for batch (live table) dependencies, dlt.read_stream() for streaming dependencies.

Data Quality Expectations

Expectations are DLT's data quality enforcement mechanism. They're constraints you define on your tables. DLT tracks violations and can:

  1. @dlt.expect — warn on violations (data flows through, violations recorded as metrics)
  2. @dlt.expect_or_drop — drop violating rows (bad data doesn't contaminate downstream)
  3. @dlt.expect_or_fail — fail the pipeline on any violation (strict validation)
@dlt.table
@dlt.expect("valid_customer_id", "customer_id IS NOT NULL")
@dlt.expect("positive_amount", "amount > 0")
@dlt.expect_or_drop("valid_email", "email RLIKE '^[^@]+@[^@]+\\.[^@]+$'")
@dlt.expect_or_fail("no_negative_ids", "id >= 0")
def silver_transactions():
  return dlt.read_stream("bronze_transactions")
Enter fullscreen mode Exit fullscreen mode
-- SQL expectations
CREATE OR REFRESH STREAMING LIVE TABLE silver_transactions (
  CONSTRAINT valid_customer_id EXPECT (customer_id IS NOT NULL),
  CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
  CONSTRAINT no_negative_ids EXPECT (id >= 0) ON VIOLATION FAIL UPDATE
)
AS SELECT * FROM STREAM(LIVE.bronze_transactions)
Enter fullscreen mode Exit fullscreen mode

Viewing expectation metrics in the pipeline UI shows you:

  • How many rows passed/failed each constraint
  • Trends over time

Pipeline Configuration

A DLT pipeline is the deployment unit. One pipeline can contain multiple notebooks, each defining tables/views.

Key pipeline settings:

  • Pipeline mode: TRIGGERED (batch) or CONTINUOUS (streaming, low latency)
  • Target schema: where to store the resulting tables
  • Storage location: where DLT stores internal state, logs, and tables
  • Cluster configuration: compute for the pipeline
  • Notifications: email on success/failure
  • Channel: CURRENT or PREVIEW (preview gets new DLT features first)

Triggered pipelines — runs once, processes available data, then stops. Triggered via a schedule or manually.

Continuous pipelines — runs indefinitely, processing data as it arrives. Used for near-real-time requirements.

Pipeline UI and Monitoring

The DLT UI provides:

  • DAG visualization — see all tables and their dependencies as a graph
  • Data quality metrics — expectation pass/fail rates per table
  • Table event logs — see exactly what happened during each update
  • Row statistics — rows read, written, failed per table per run

Event log is also queryable as a Delta table:

SELECT * FROM delta.`/path/to/pipeline/_delta_log_events/`
WHERE event_type = 'flow_progress'
ORDER BY timestamp DESC;
Enter fullscreen mode Exit fullscreen mode

DLT with Auto Loader

Auto Loader + DLT is the canonical pattern for Bronze ingestion:

@dlt.table(
  name="bronze_orders",
  table_properties={
    "quality": "bronze",
    "pipelines.reset.allowed": "false"  # prevent accidental full reload
  }
)
def bronze_orders():
  return (
    spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("cloudFiles.schemaLocation", 
              "/pipelines/my_pipeline/schema/orders/")
      .option("header", "true")
      .option("cloudFiles.inferColumnTypes", "true")
      .load("s3://my-bucket/raw/orders/")
      .withColumn("_source_file", input_file_name())
      .withColumn("_ingestion_time", current_timestamp())
  )
Enter fullscreen mode Exit fullscreen mode

Apply Changes (CDC with DLT)

DLT has native support for Change Data Capture (CDC) via APPLY CHANGES INTO:

# Define the target table
@dlt.table
def customers():
  pass  # Target table definition

# Apply changes from a source (CDC stream)
dlt.apply_changes(
  target="customers",
  source="customers_cdc",
  keys=["customer_id"],
  sequence_by="update_timestamp",  # determines order of changes
  apply_as_deletes=expr("operation = 'DELETE'"),
  except_column_list=["operation", "update_timestamp"],
  stored_as_scd_type=1  # Type 1: overwrite, Type 2: keep history
)
Enter fullscreen mode Exit fullscreen mode

SCD Type 1 — just overwrite the record with the latest values
SCD Type 2 — keep history by creating new records with __START_AT and __END_AT columns

-- SQL equivalent
CREATE OR REFRESH STREAMING LIVE TABLE customers;

APPLY CHANGES INTO LIVE.customers
FROM STREAM(LIVE.customers_cdc)
KEYS (customer_id)
APPLY AS DELETE WHEN operation = 'DELETE'
SEQUENCE BY update_timestamp
STORED AS SCD TYPE 1;
Enter fullscreen mode Exit fullscreen mode

DLT Table Properties and Access

Tables created by DLT can be accessed by:

  • Other DLT pipelines
  • Notebooks (after the pipeline has run)
  • SQL analytics
  • Jobs

Tables are stored in the target catalog/schema you specify in the pipeline configuration.

-- After pipeline runs, query the tables normally
SELECT * FROM main.silver.orders;
Enter fullscreen mode Exit fullscreen mode

Limitations of DLT

  • DLT tables cannot be modified outside the pipeline (read-only externally)
  • You cannot manually write to a DLT-managed table
  • Full pipeline resets can be expensive (reprocess all data from scratch)
  • Limited to certain compute configurations

Chapter 13 — Apache Spark: High-Level Overview

Spark is the distributed computing engine underlying everything in Databricks. For the exam, you need a solid conceptual understanding. Here are the key topics at a high level:

Core Concepts

SparkContext / SparkSession

  • SparkContext — the entry point for RDD-based Spark (legacy)
  • SparkSession — the unified entry point for DataFrames, SQL, and Streaming (current)
  • Pre-created as spark in Databricks notebooks

RDDs (Resilient Distributed Datasets)

  • Low-level distributed data structure (the original Spark API)
  • Immutable, partitioned collection of records distributed across nodes
  • Rarely used directly today — use DataFrames/Datasets instead
  • Still underpins the execution engine

DataFrames

  • High-level, schema-aware, distributed collection of data organized in named columns
  • Think of it as a distributed SQL table
  • Lazy evaluation — operations build a logical plan until an action triggers execution

Datasets (Scala/Java)

  • Type-safe version of DataFrames (for JVM languages)
  • Not directly applicable in Python (PySpark uses DataFrames exclusively)

Transformations vs. Actions

Transformations — lazy operations that build a logical plan:

  • select(), filter(), groupBy(), join(), withColumn(), orderBy(), limit()
  • Do NOT trigger computation immediately
  • Can be narrow (don't require data movement between partitions) or wide (require shuffle)

Actions — trigger actual computation:

  • count(), show(), collect(), write(), save(), take()
  • Results are either returned to the driver or written to storage

Understanding lazy evaluation is key — Spark optimizes the entire plan before running.

Catalyst Optimizer and Tungsten

Catalyst Optimizer — Spark's query optimization engine:

  1. Parses logical plan from DataFrame operations
  2. Analyzes and resolves column references
  3. Optimizes the logical plan (predicate pushdown, projection pruning, constant folding)
  4. Generates multiple physical plans
  5. Selects the best physical plan using cost-based optimization

Tungsten — Memory management and code generation:

  • Off-heap memory management
  • Whole-stage code generation (Java bytecode generation for tight loops)
  • Column-based binary format (no Java object overhead)

Photon — Databricks' native vectorized execution engine (C++):

  • Dramatically faster than standard Spark for SQL and Delta queries
  • Enabled via the Photon Runtime
  • Drop-in replacement — same API, faster execution

Partitions and Shuffles

Partitions — how data is distributed across executors:

  • spark.default.parallelism — default partition count for RDDs
  • spark.sql.shuffle.partitions — default partition count after a shuffle (default: 200)
  • On Databricks, Adaptive Query Execution (AQE) automatically adjusts partition counts

Shuffle — data redistribution across partitions/executors (wide transformation):

  • Triggered by: groupBy, join, orderBy, distinct, repartition
  • Expensive — involves writing data to disk and network transfer
  • Minimize shuffles for performance

repartition() vs coalesce():

  • repartition(n) — full shuffle to exactly n partitions (use when increasing or changing distribution)
  • coalesce(n) — narrow operation to reduce to n partitions (more efficient for reducing partition count, no full shuffle)

Key DataFrame Operations

Transformations:

  • select() — projection (choose columns)
  • filter() / where() — row filtering
  • withColumn() — add/modify a column
  • withColumnRenamed() — rename a column
  • drop() — remove columns
  • groupBy().agg() — group and aggregate
  • join() — combine two DataFrames
  • union() / unionByName() — combine rows
  • orderBy() / sort() — sort rows
  • distinct() — deduplicate
  • dropDuplicates() — deduplicate (can specify subset of columns)
  • explode() — expand array/map column into rows
  • pivot() — pivot rows to columns
  • cache() / persist() — cache in memory

Common Functions (pyspark.sql.functions):

  • col(), lit() — column references and literals
  • when().otherwise() — conditional logic (like CASE WHEN)
  • coalesce() — first non-null value
  • to_date(), to_timestamp() — type casting for dates
  • date_format(), date_add(), datediff() — date operations
  • split(), regexp_replace(), trim(), upper() — string operations
  • sum(), count(), avg(), max(), min() — aggregations
  • rank(), dense_rank(), row_number() — window functions
  • collect_list(), collect_set() — aggregate to arrays
  • from_json(), to_json() — JSON parsing
  • struct(), array(), map() — complex type creation
  • explode(), posexplode() — array/map to rows

Window Functions

Window functions perform calculations across a sliding window of rows:

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, row_number, lag, sum

# Define window spec
window = Window.partitionBy("department").orderBy(col("salary").desc())

# Apply window function
df.withColumn("rank", rank().over(window)) \
  .withColumn("running_total", sum("salary").over(
    Window.partitionBy("department").orderBy("hire_date")
      .rowsBetween(Window.unboundedPreceding, Window.currentRow)
  ))
Enter fullscreen mode Exit fullscreen mode

Joins

Join types: inner, left (or left_outer), right (or right_outer), full (or full_outer), left_semi, left_anti, cross

result = orders.join(customers, "customer_id", "inner")
result = orders.join(customers, orders.customer_id == customers.id, "left")
Enter fullscreen mode Exit fullscreen mode

Broadcast Join — when one table is small, broadcast it to all executors to avoid shuffle:

from pyspark.sql.functions import broadcast
result = large_table.join(broadcast(small_table), "key")
Enter fullscreen mode Exit fullscreen mode

Adaptive Query Execution (AQE)

AQE (enabled by default in Databricks) optimizes query plans at runtime based on actual data statistics:

  • Coalescing shuffle partitions — reduces the number of post-shuffle partitions dynamically
  • Converting sort-merge join to broadcast join — if a table turns out to be small
  • Skew join optimization — handles data skew by splitting large partitions

Caching

df.cache()  # Cache in memory (equivalent to MEMORY_AND_DISK)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.unpersist()  # Remove from cache
spark.catalog.cacheTable("my_table")
spark.catalog.uncacheTable("my_table")
Enter fullscreen mode Exit fullscreen mode

Cache intermediate results that are reused multiple times in the same job.


Chapter 14 — AWS Integration Specifics

IAM and Authentication

Instance Profiles (Cluster IAM)

EC2 instances in a Databricks cluster can be assigned an IAM Instance Profile. This allows code running on the cluster to access AWS resources using AWS credentials — without hardcoding access keys.

# With an instance profile attached, this just works:
df = spark.read.parquet("s3://my-bucket/data/")
# Spark automatically uses the EC2 instance profile credentials
Enter fullscreen mode Exit fullscreen mode

Setup:

  1. Create an IAM role with S3 permissions
  2. Create an instance profile from that role
  3. Register the instance profile ARN in Databricks (admin step)
  4. Users can then select that instance profile when creating a cluster

Cross-Account IAM Role

The Databricks control plane uses a cross-account IAM role in your AWS account to:

  • Create/terminate EC2 instances
  • Attach/detach EBS volumes
  • Configure security groups
  • Tag resources

This role has the minimal permissions needed — it cannot read your S3 data.

S3 Integration

Direct S3 Access (Recommended with Unity Catalog)

With Unity Catalog and External Locations, access S3 directly:

df = spark.read.format("delta").load("s3://my-bucket/delta/employees/")
Enter fullscreen mode Exit fullscreen mode

DBFS Mount (Legacy)

dbutils.fs.mount(
  source="s3a://my-bucket/data/",
  mount_point="/mnt/my-data",
  extra_configs={
    "fs.s3a.aws.credentials.provider": 
      "com.amazonaws.auth.InstanceProfileCredentialsProvider"
  }
)
# Use as:
df = spark.read.parquet("/mnt/my-data/employees/")
Enter fullscreen mode Exit fullscreen mode

S3 Path Formats

  • s3:// — standard AWS S3 URL scheme (used by AWS tools)
  • s3a:// — Hadoop's S3A connector (used by Spark, more efficient than s3n://)
  • dbfs:/mnt/<mount>/ — via DBFS mount

VPC Configuration

Databricks can be deployed in your VPC in two modes:

Default VPC — Databricks creates and manages a VPC in your account. Simple but less control.

Customer-managed VPC (VPC Injection) — You provide the VPC, subnets, and security groups. Required for:

  • Existing security/network policies
  • PrivateLink (no public internet)
  • Custom DNS
  • VPC peering with RDS, MSK, etc.

PrivateLink

For enterprises requiring no public internet traffic, Databricks supports AWS PrivateLink:

  • Control plane to data plane communication over private AWS network
  • No data leaves the AWS backbone
  • Requires customer-managed VPC

AWS Services Integration

Amazon MSK (Kafka)

df = spark.readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "b-1.msk-cluster.kafka.us-east-1.amazonaws.com:9092") \
  .option("subscribe", "transactions") \
  .option("kafka.security.protocol", "SASL_SSL") \
  .option("kafka.sasl.mechanism", "AWS_MSK_IAM") \
  .option("kafka.sasl.jaas.config", "software.amazon.msk.auth...") \
  .load()
Enter fullscreen mode Exit fullscreen mode

Amazon Kinesis

df = spark.readStream \
  .format("kinesis") \
  .option("streamName", "my-stream") \
  .option("region", "us-east-1") \
  .option("initialPosition", "TRIM_HORIZON") \
  .load()
Enter fullscreen mode Exit fullscreen mode

AWS Glue Data Catalog Integration

Databricks can use the AWS Glue Data Catalog as a Hive Metastore replacement:

spark.hadoop.hive.metastore.client.factory.class = 
  com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
Enter fullscreen mode Exit fullscreen mode

This allows Databricks to discover tables registered in Glue that were created by Athena, Glue ETL, or other services.

Note: With Unity Catalog, Glue integration is less necessary — UC provides a superior alternative.

Amazon Redshift

# Read from Redshift
df = spark.read \
  .format("com.databricks.spark.redshift") \
  .option("url", "jdbc:redshift://cluster.us-east-1.redshift.amazonaws.com:5439/mydb") \
  .option("dbtable", "public.sales") \
  .option("tempdir", "s3://my-bucket/temp/") \
  .option("aws_iam_role", "arn:aws:iam::123456789:role/RedshiftS3Role") \
  .load()
Enter fullscreen mode Exit fullscreen mode

Amazon RDS / Aurora

# JDBC connection
df = spark.read \
  .format("jdbc") \
  .option("url", "jdbc:mysql://my-rds.cluster.us-east-1.rds.amazonaws.com:3306/mydb") \
  .option("dbtable", "employees") \
  .option("user", dbutils.secrets.get("scope", "rds-user")) \
  .option("password", dbutils.secrets.get("scope", "rds-password")) \
  .option("numPartitions", 10) \
  .option("partitionColumn", "id") \
  .option("lowerBound", "1") \
  .option("upperBound", "1000000") \
  .load()
Enter fullscreen mode Exit fullscreen mode

AWS Secrets Manager Backed Secret Scope

For AWS organizations already using AWS Secrets Manager:

# Create a UC-backed scope (links to AWS Secrets Manager)
databricks secrets create-scope \
  --scope aws-secrets \
  --scope-backend-type AWS_SECRETS_MANAGER \
  --region us-east-1
Enter fullscreen mode Exit fullscreen mode
# Use secrets stored in AWS Secrets Manager
password = dbutils.secrets.get(scope="aws-secrets", key="my-secret-name")
Enter fullscreen mode Exit fullscreen mode

Chapter 15 — Security & Access Control

Workspace-Level Security

User Management

  • Users are managed at the Databricks account level (account console)
  • SCIM (System for Cross-domain Identity Management) provisioning from identity providers (Okta, Azure AD, AWS SSO)
  • Groups simplify permission management — assign users to groups, grant permissions to groups

Permission Levels

For workspace objects (notebooks, clusters, jobs):

Level Notebooks Clusters Jobs
CAN READ View None View results
CAN RUN Run None Run job
CAN EDIT Edit Can restart Edit settings
CAN MANAGE All Full control Full control
IS OWNER All + delete Create/delete Create/delete

Cluster-Level Security

  • Cluster creation permission — not everyone can create clusters (controlled via admin settings or cluster policies)
  • Cluster access permission — who can attach notebooks to or terminate a cluster
  • Cluster policies — restrict what configurations users can apply

Table ACLs (Legacy — Pre-Unity Catalog)

Before Unity Catalog, table-level access control in the Hive Metastore was managed via GRANT statements on the cluster with table ACLs enabled. This is being replaced by Unity Catalog.

Network Security

IP Access Lists

Restrict workspace access to specific IP ranges:

  • Administrators can configure allowed IP ranges
  • Users outside the allowed ranges cannot access the workspace
  • Configured via Admin Settings → IP Access Lists

Encryption

  • Encryption at rest — S3 data is encrypted using AWS KMS (AWS-managed or customer-managed keys)
  • Encryption in transit — TLS/SSL for all communication between control plane, data plane, and clients
  • Customer-managed keys — for workspaces with strict security requirements, customers can manage their own KMS keys for notebook encryption and cluster EBS volumes

Compliance

Databricks supports various compliance certifications relevant for AWS:

  • SOC 2 Type II
  • ISO 27001
  • HIPAA (with Business Associate Agreement)
  • PCI DSS
  • FedRAMP (government use)

Chapter 16 — Performance & Best Practices

File Size and Compaction

The "small file problem" is one of the most common performance issues:

  • Each Spark task writes one output file
  • Thousands of small files → slow reads (S3 overhead per file)
  • Target file size: 128MB to 1GB for Parquet

Solutions:

  • OPTIMIZE command (manual)
  • Auto Compaction (automatic)
  • Optimized Writes (on write)
  • Liquid Clustering (ongoing)

Data Skew

Data skew occurs when one or more partitions are much larger than others, causing certain executors to take much longer (stragglers).

Detection: Look for Tasks in Spark UI where one task takes 10x longer than others.

Solutions:

  • Salting — add a random key suffix to distribute skewed keys
  • AQE skew optimization — automatic in Databricks (handles skew joins)
  • Repartition before a join on the skewed key
  • Broadcast join — if the smaller table is small enough

Caching Strategy

  • Cache DataFrames that are used multiple times in the same session
  • Use delta.cache (disk-based cache for Delta files) at the cluster level for repeatedly read Delta tables
  • Don't cache unnecessarily — caching takes memory and initialization time

Partitioning Strategy

  • Partition by date for time-series data (daily/monthly partitions)
  • Avoid partitioning by high-cardinality columns
  • Use Liquid Clustering as a more flexible alternative
  • Keep partition count manageable (< 10,000 partitions per table)

Schema Best Practices

  • Always define explicit schemas for production pipelines
  • Avoid inferSchema in production (expensive, can produce wrong types)
  • Use StringType initially if unsure, cast downstream
  • Enable mergeSchema only when needed — it can hide bugs

Z-Ordering vs. Partitioning vs. Liquid Clustering

Partitioning Z-Order Liquid Clustering
Data skipping Folder-level File-level File-level
Works on Low-cardinality High-cardinality Any cardinality
Ongoing maintenance None Manual OPTIMIZE Auto (OPTIMIZE)
Flexibility Low (fixed at creation) Medium High
Recommended for New pipelines: rarely Existing tables New tables

Chapter 17 — Additional Databricks Concepts

Databricks SQL (DBSQL)

Databricks SQL is the analytics-focused layer of Databricks:

  • SQL Editor — web-based SQL query interface
  • SQL Warehouses — dedicated compute for SQL (serverless or classic)
  • Dashboards — build visualizations and dashboards from SQL results
  • Alerts — trigger notifications based on query results
  • Query History — see all executed queries

Serverless SQL Warehouses — compute is managed by Databricks, instant startup, automatic scaling, no cluster configuration needed.

Databricks MLflow

MLflow is an open-source ML lifecycle platform built into Databricks:

  • Tracking — log parameters, metrics, and artifacts from ML experiments
  • Models — store, version, and deploy ML models
  • Registry — model versioning and stage management (Staging, Production, Archived)
  • Projects — package ML code for reproducibility
import mlflow
import mlflow.sklearn

with mlflow.start_run():
  # Train model
  model.fit(X_train, y_train)

  # Log params
  mlflow.log_param("n_estimators", 100)
  mlflow.log_metric("accuracy", accuracy)

  # Log model
  mlflow.sklearn.log_model(model, "my-model")
Enter fullscreen mode Exit fullscreen mode

Feature Store

Databricks Feature Store is a centralized repository for ML features:

  • Store features as Delta tables
  • Reuse features across teams and models
  • Automatic feature lineage
  • Point-in-time lookups for training data

Databricks Connect

Databricks Connect allows you to run Spark code from your local IDE (VS Code, IntelliJ, etc.) against a remote Databricks cluster:

  • Write code locally with full IDE support (autocomplete, debugging)
  • Execute on the remote cluster
  • Great for development — combine local tooling with cloud compute

Databricks REST API

Almost everything in Databricks can be automated via the REST API:

  • Create/manage clusters, jobs, notebooks
  • Trigger job runs
  • Manage permissions
  • List and manage tables

The Databricks CLI wraps the REST API for command-line use.

Databricks Lakehouse Monitoring

Lakehouse Monitoring provides data quality monitoring for Delta tables:

  • Profile data distributions over time
  • Detect drift (statistical changes in data)
  • Generate monitoring metrics as Delta tables
  • Works for both batch and streaming tables

Data Contracts and Quality

Beyond DLT expectations, common data quality patterns:

  • Great Expectations — open-source data quality library integrated with Databricks
  • Delta constraints — SQL CHECK constraints on Delta tables
  • Data quality monitoring — Lakehouse Monitoring
-- Add a CHECK constraint
ALTER TABLE orders ADD CONSTRAINT valid_quantity CHECK (quantity > 0);
Enter fullscreen mode Exit fullscreen mode

Chapter 18 — Exam Tips & Topic Summary

Exam Structure

The Databricks Data Engineer Associate exam (exam code: Databricks-Certified-Data-Engineer-Associate) typically covers:

  • ~20% Databricks Lakehouse Platform — architecture, workspace, clusters, notebooks
  • ~30% ELT with Spark and Delta Lake — transformations, Delta features, SQL
  • ~20% Incremental Data Processing — Auto Loader, Structured Streaming, DLT
  • ~20% Production Pipelines — jobs, workflows, DLT pipelines
  • ~10% Data Governance — Unity Catalog, permissions, data access

High-Frequency Exam Topics

Managed vs. External Tables

  • Know the DROP behavior (metadata only vs. data+metadata)
  • Know when to use each

Delta Lake Features

  • Time travel (VERSION AS OF, TIMESTAMP AS OF)
  • MERGE syntax (WHEN MATCHED, WHEN NOT MATCHED, WHEN NOT MATCHED BY SOURCE)
  • OPTIMIZE and Z-ORDER
  • VACUUM and data retention
  • Schema enforcement vs. evolution
  • CDF (Change Data Feed)
  • SHALLOW CLONE vs. DEEP CLONE

Cluster Types

  • All-Purpose for development
  • Job clusters for production (auto-terminated, cost-efficient)
  • SQL Warehouses for SQL analytics
  • Instance pools for fast startup

%run vs. dbutils.notebook.run()

  • %run shares scope (variables, SparkSession)
  • dbutils.notebook.run() runs independently, can capture return value, can run in parallel

Streaming

  • Trigger types (availableNow, processingTime, continuous)
  • Checkpoint location importance
  • foreachBatch for complex streaming operations

DLT

  • @dlt.expect (warn) vs. @dlt.expect_or_drop (drop row) vs. @dlt.expect_or_fail (fail pipeline)
  • Streaming vs. batch tables in DLT
  • APPLY CHANGES INTO for CDC

Auto Loader

  • cloudFiles format
  • Schema inference and schemaLocation
  • File notification vs. directory listing mode

Unity Catalog

  • Three-level namespace (catalog.schema.table)
  • Privilege hierarchy (must grant USE at each level)
  • Managed vs. External tables in UC
  • Storage Credentials and External Locations

dbutils.secrets

  • Secrets are always [REDACTED] in output
  • Two scope types: Databricks-backed and AWS Secrets Manager-backed

Common Traps

  1. VACUUM removes time travel capability — after VACUUM, you can't travel back to removed versions
  2. DROP TABLE on a managed table deletes data — external tables only delete metadata
  3. %run in a notebook runs in the same scope — not isolation like dbutils.notebook.run()
  4. Auto-scaling doesn't work well with streaming — use fixed clusters
  5. OPTIMIZE doesn't run automatically — unless Auto Compaction is enabled
  6. Schema inference in production is bad — use explicit schemas
  7. Temp views are session-scoped — not visible to other notebooks
  8. Job clusters are created per-run and terminated after — you can't manually start/stop them
  9. DLT tables are read-only outside the pipeline — you can't write to them directly
  10. Z-ORDER is only effective when combined with OPTIMIZE — it's not automatic

Quick Reference: Key SQL Commands

-- Time Travel
SELECT * FROM t VERSION AS OF 5;
SELECT * FROM t TIMESTAMP AS OF '2024-01-01';

-- History and Metadata
DESCRIBE HISTORY t;
DESCRIBE DETAIL t;
DESCRIBE EXTENDED t;

-- Maintenance
OPTIMIZE t;
OPTIMIZE t ZORDER BY (col1, col2);
VACUUM t RETAIN 168 HOURS;
RESTORE TABLE t TO VERSION AS OF 3;

-- Clone
CREATE TABLE t_backup SHALLOW CLONE t;
CREATE TABLE t_backup DEEP CLONE t;

-- MERGE
MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

-- DML
UPDATE t SET col = value WHERE condition;
DELETE FROM t WHERE condition;

-- CDC
ALTER TABLE t SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
SELECT * FROM table_changes('t', 5, 10);

-- UC Permissions
GRANT SELECT ON TABLE catalog.schema.table TO `user@company.com`;
REVOKE SELECT ON TABLE catalog.schema.table FROM `user@company.com`;
SHOW GRANTS ON TABLE catalog.schema.table;
Enter fullscreen mode Exit fullscreen mode

Quick Reference: Key Python Patterns

# Auto Loader (Bronze ingestion)
spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "json") \
  .option("cloudFiles.schemaLocation", "/checkpoints/schema/") \
  .load("s3://bucket/path/")

# Streaming write
df.writeStream.format("delta") \
  .option("checkpointLocation", "/checkpoints/stream/") \
  .trigger(availableNow=True) \
  .table("bronze.events")

# Secrets
dbutils.secrets.get(scope="my-scope", key="my-key")

# Notebook parameters
dbutils.widgets.get("param_name")

# Task values (multi-task jobs)
dbutils.jobs.taskValues.set("key", "value")
dbutils.jobs.taskValues.get(taskKey="upstream", key="key")

# Run another notebook
result = dbutils.notebook.run("/path/to/notebook", 300, {"arg": "value"})

# Delta MERGE in Python
from delta.tables import DeltaTable
dt = DeltaTable.forName(spark, "target_table")
dt.alias("target").merge(
  source.alias("source"),
  "target.id = source.id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()
Enter fullscreen mode Exit fullscreen mode

DLT Quick Reference

import dlt

@dlt.table
@dlt.expect("not_null_id", "id IS NOT NULL")
@dlt.expect_or_drop("valid_amount", "amount > 0")
@dlt.expect_or_fail("positive_id", "id > 0")
def my_silver_table():
  return dlt.read_stream("my_bronze_table") \
    .filter(...)
Enter fullscreen mode Exit fullscreen mode
-- SQL DLT
CREATE OR REFRESH STREAMING LIVE TABLE silver_orders (
  CONSTRAINT valid_id EXPECT (id IS NOT NULL),
  CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW
)
AS SELECT * FROM STREAM(LIVE.bronze_orders) WHERE ...;
Enter fullscreen mode Exit fullscreen mode

Conclusion

The Databricks Data Engineer Associate exam tests your ability to design, build, and maintain data pipelines on the Databricks Lakehouse Platform. The most important areas to master are:

  1. Delta Lake fundamentals — This is everywhere. Know ACID, time travel, MERGE, OPTIMIZE/VACUUM/ZORDER inside out.

  2. Medallion Architecture — Understand Bronze/Silver/Gold and what belongs in each layer.

  3. DLT and streaming — Know the declarative pipeline approach, expectations, and Auto Loader.

  4. Cluster types and their appropriate use cases — All-purpose for dev, Job clusters for prod, SQL Warehouses for analytics.

  5. Unity Catalog — The three-level namespace, privilege model, and the difference between managed and external tables.

  6. Orchestration — Multi-task jobs, task dependencies, retry policies, and task values.

  7. dbutils — Know every submodule: fs, secrets, widgets, notebook, and jobs.

The exam is scenario-based — it will describe a business requirement and ask you to identify the correct Databricks tool or approach. Read each question carefully, eliminate clearly wrong answers, and apply the principles covered in this guide.

Good luck with your Databricks Data Engineer Associate certification! 🚀


This guide covers the Databricks Data Engineer Associate exam curriculum. Always check the official Databricks certification exam guide for the latest topic list.

Top comments (0)