Data Tech Bridge

Posted on Feb 19

The Ultimate Databricks Data Engineer Associate Exam Guide for AWS Engineers

#aws #dataengineering #learning #tutorial

A Comprehensive Deep-Dive into Every Exam Topic

Introduction to Databricks & the Lakehouse Paradigm
Databricks Architecture
Compute — Clusters & Runtimes
Notebooks
Delta Lake & Data Management
Tables, Schemas & Databases
Databricks Utilities (dbutils)
Git Folders / Repos
Unity Catalog & Governance
Job Orchestration & Workflows
Data Ingestion Patterns
Delta Live Tables (DLT) & Pipelines
Apache Spark — High-Level Overview
AWS Integration Specifics
Security & Access Control
Performance & Best Practices
Exam Tips & Summary

Chapter 1 — Introduction to Databricks & the Lakehouse Paradigm

What Is Databricks?

Databricks is a unified data analytics platform built on top of Apache Spark. It was founded by the original creators of Apache Spark and is available as a managed cloud service on AWS, Azure, and GCP. For AWS engineers, Databricks runs inside your AWS account — it provisions and manages Amazon EC2 instances, S3 buckets, IAM roles, VPCs, and other AWS resources on your behalf.

The platform is designed to bridge the gap between data engineering, data science, and business analytics — providing a single pane of glass for all data workloads.

The Lakehouse Architecture

Before Databricks, teams operated in one of two paradigms:

Data Warehouses (Redshift, Snowflake) — Structured, fast SQL queries, but expensive for large raw datasets, limited support for ML, and data duplication.

Data Lakes (S3 + Glue + Athena) — Cheap storage of any format, but no ACID transactions, no data quality guarantees, schema enforcement challenges, and slow query performance without significant tuning.

The Lakehouse combines the best of both:

Cheap, open-format storage (S3 + Parquet/Delta)
ACID transactions and data reliability
Support for SQL, Batch, Streaming, ML, and BI workloads
A unified governance layer
Open standards (Delta Lake, Apache Iceberg support)

Databricks is the leading commercial implementation of the Lakehouse pattern.

Why This Matters for the Exam

The exam frequently tests whether you understand which layer (Bronze, Silver, Gold) a given operation belongs to, what Delta Lake provides over raw Parquet, and how Databricks differs from a traditional data warehouse or raw S3 data lake.

Chapter 2 — Databricks Architecture

Control Plane vs. Data Plane

This is one of the most fundamental concepts for both the exam and real-world AWS deployments.

Control Plane (Databricks-managed AWS Account)

The control plane lives in Databricks' own AWS account. It contains:

Databricks Web Application — the UI you interact with
Cluster Manager — orchestrates cluster lifecycle
Job Scheduler — triggers and monitors workflow runs
Notebook Storage — notebooks are stored in the control plane (unless using Git Folders)
Metadata Services — the Hive Metastore (legacy) or Unity Catalog Metastore

The control plane never touches your actual data. It sends instructions to resources in your account.

Data Plane (Your AWS Account)

The data plane lives in your AWS account and contains:

EC2 instances — actual Spark driver and worker nodes
S3 buckets — your data storage (the DBFS root bucket and your own buckets)
VPC, Subnets, Security Groups — network resources
IAM Roles & Instance Profiles — permissions for cluster nodes to access S3

When a cluster is launched, the Databricks control plane communicates with your AWS account (via a cross-account IAM role) to spin up EC2 instances inside your VPC.

Why This Split Matters

Security teams love this architecture because your data never leaves your AWS account. Databricks only stores metadata and notebook code. If you use Secrets (more on that later), sensitive values are encrypted in the control plane but never transmitted in plain text.

Databricks Workspace

A workspace is the top-level organizational unit in Databricks. Think of it like an AWS account — it has its own:

Users and groups
Clusters
Notebooks
Jobs
Data objects (catalogs, schemas, tables)
Secrets

On AWS, a workspace is tied to a specific AWS region and a specific S3 bucket (the "root DBFS bucket"). Large organizations often have multiple workspaces (dev, staging, prod) — this is the recommended pattern.

DBFS — Databricks File System

DBFS is a virtual file system layer that abstracts over your underlying storage. It provides a unified path namespace (dbfs:/) that maps to:

An S3 bucket managed by Databricks (the root bucket, e.g., s3://databricks-workspace-XXXXXXX/)
Other mounted S3 paths
External locations (with Unity Catalog)

Key points:

DBFS is not a separate storage system — it's a layer over S3
The default DBFS root bucket is managed by Databricks but lives in your AWS account
You can mount external S3 buckets to DBFS paths (legacy approach)
With Unity Catalog, External Locations replace mounts as the preferred approach
DBFS paths look like dbfs:/user/hive/warehouse/ or dbfs:/FileStore/

Important DBFS paths to know:

dbfs:/FileStore/ — files uploaded via the UI (accessible via /FileStore/ URL)
dbfs:/user/hive/warehouse/ — default managed table location (Hive Metastore)
dbfs:/databricks/ — internal Databricks paths
dbfs:/mnt/<mount-name>/ — mounted external storage (legacy)

Cluster Architecture

Each Databricks cluster is a standard Apache Spark cluster:

Driver Node (EC2)
  |
  |-- Worker Node 1 (EC2)
  |      |-- Executor 1
  |      |-- Executor 2
  |
  |-- Worker Node 2 (EC2)
         |-- Executor 3
         |-- Executor 4

Driver — runs the main program, coordinates Spark jobs, hosts the SparkSession
Workers — run Spark executors that process data in parallel
Executors — JVM processes on workers that execute tasks

Chapter 3 — Compute: Clusters & Runtimes

Cluster Types

All-Purpose Clusters

Created manually (UI, CLI, API) or via the cluster policy
Designed for interactive development — notebooks, ad-hoc queries
Can be shared by multiple users simultaneously
More expensive because they run continuously (until terminated)
Persist between runs; you can restart them
Billed per DBU (Databricks Unit) while running

Exam tip: All-purpose clusters are for development. Never use them exclusively in production workflows without good reason — that's what job clusters are for.

Job Clusters

Created automatically when a Databricks Job runs
Terminated automatically when the job completes
Cannot be shared interactively
Much cheaper — you only pay while the job is running
Configuration is defined in the job definition
Each job run gets a fresh, isolated cluster

Exam tip: For production workloads, always prefer job clusters for cost efficiency and isolation.

SQL Warehouses (Databricks SQL)

Specifically designed for SQL analytics workloads
Used by Databricks SQL, BI tools (Tableau, Power BI), and dbt
Serverless (Serverless SQL Warehouses) or Classic (Pro/Serverless)
Auto-scaling and auto-stop built-in
Not used for Spark DataFrame API or ML workloads directly

Cluster Configuration

Runtime Versions

The Databricks Runtime (DBR) is the set of software running on your cluster:

Databricks Runtime (Standard) — Spark + Delta Lake + standard libraries
Databricks Runtime ML — adds ML libraries (MLflow, TensorFlow, PyTorch, scikit-learn)
Databricks Runtime for Genomics — specialized for genomics
Photon Runtime — includes the Photon query engine (C++ vectorized execution for Delta Lake queries — much faster for SQL)

LTS versions (Long Term Support) — recommended for production. They receive bug fixes for a longer period. Example: 13.3 LTS, 12.2 LTS.

Always use the latest LTS version unless you have a specific reason to use a different version.

Cluster Modes

Standard Mode (Single User)

One user owns the cluster
Recommended for most production workloads
Full Spark features available

Shared Mode

Multiple users share the same cluster
Isolation via Unity Catalog and fine-grained access control
Some Spark features restricted for safety
Requires Unity Catalog

Single Node Mode

Driver only — no worker nodes
Good for small data, ML model training on single machine, development
Spark still works (local mode)

Auto-Scaling

Databricks clusters can automatically scale the number of worker nodes based on load:

Set a minimum and maximum worker count
Databricks monitors the job queue and scales up/down accordingly
Scale-down has a configurable timeout (default: 120 seconds of idle)
Does not work well with Streaming jobs — use fixed-size clusters for streaming

Auto-Termination

Set an idle timeout (e.g., 60 minutes) after which the cluster terminates automatically if no notebooks are attached or jobs are running. This is critical for cost control.

Instance Types (AWS)

For the exam, you don't need to memorize specific EC2 types, but understand:

Memory-optimized (r5, r6i) — good for large DataFrames, joins, aggregations
Compute-optimized (c5, c6i) — good for CPU-intensive transforms
Storage-optimized (i3) — good for heavy shuffle operations (local NVMe SSDs)
GPU instances (p3, g4dn) — for ML Runtime (deep learning)

Instance Pools

An instance pool maintains a set of pre-warmed EC2 instances ready to be allocated to clusters. This dramatically reduces cluster startup time (from ~5-8 minutes to ~30-60 seconds).

Key characteristics:

Pools have a minimum idle instance count and maximum capacity
Instances in pools are EC2 instances that are running (you pay for them)
When a cluster uses a pool, it grabs instances from the pool
When a cluster terminates, instances return to the pool
Configured with a specific instance type and Databricks Runtime

Exam tip: Use instance pools when you need fast cluster startup for jobs or have many short-lived clusters.

Cluster Policies

Cluster policies are JSON configurations that restrict and/or set default values for cluster configuration. They help:

Control costs (max DBUs per hour, max instance count)
Enforce standards (specific runtimes, instance types)
Simplify UX for non-technical users
Apply governance rules

Only workspace admins can create policies. Users can only create clusters that conform to the policies assigned to them.

// Example: restrict cluster to specific runtime
{
  "spark_version": {
    "type": "fixed",
    "value": "13.3.x-scala2.12"
  },
  "autotermination_minutes": {
    "type": "fixed",
    "value": 60
  }
}

DBUs — Databricks Units

DBUs (Databricks Units) are the billing unit for Databricks. One DBU is roughly equivalent to one hour of a specific instance type doing compute. The rate varies by:

Cluster type (All-Purpose vs. Jobs vs. SQL Warehouse)
Workload type (Data Engineering, Data Analytics, etc.)
Whether Photon is enabled

You pay both EC2 costs (to AWS) AND DBU costs (to Databricks).

Chapter 4 — Notebooks

What Is a Databricks Notebook?

A notebook is a web-based, interactive document containing cells of code and markdown. It's the primary development interface in Databricks. Notebooks in Databricks have superpowers compared to Jupyter notebooks — they integrate tightly with the platform.

Notebook Basics

Languages

Each notebook has a default language (Python, SQL, Scala, R). However, you can mix languages within a single notebook using magic commands:

%python — execute a Python cell
%sql — execute a SQL cell
%scala — execute a Scala cell
%r — execute an R cell
%md — render Markdown (documentation)
%sh — execute shell commands on the driver node
%fs — shortcut for dbutils.fs commands
%run — run another notebook inline (powerful for modularization)
%pip — install Python packages in the notebook environment
%conda — use conda for package management (less common)

Key insight about %run: When you %run another notebook, it executes in the same SparkSession and scope as the calling notebook. Variables, functions, and DataFrames defined in the sub-notebook are available in the parent notebook. This is different from a function call — it's more like an include.

Key insight about %sh: Shell commands run on the driver node only, not on worker nodes. Output is printed inline. You can install packages with %sh pip install ... but %pip is preferred because it installs on all nodes.

Widget Parameters

Notebooks can accept input parameters using widgets:

dbutils.widgets.text("table_name", "default_value", "Table Name")
table_name = dbutils.widgets.get("table_name")

Widget types:

text — free-text input
dropdown — select from a list
combobox — dropdown with free-text option
multiselect — select multiple values

Widgets are particularly useful when notebooks are run as jobs — you can pass parameter values at runtime.

Notebook Results & Visualization

The last expression in a cell is displayed as output
DataFrames are displayed as interactive tables with display(df)
display() supports charts — bar, line, pie, map, etc.
print() outputs plain text
Markdown cells render formatted documentation inline

Notebook Lifecycle

Notebooks can be in these states:

Detached — not connected to a cluster, cannot run code
Attached — connected to a running or starting cluster
Running — actively executing cells

You attach a notebook to a cluster by selecting the cluster from the dropdown in the top-right. Multiple notebooks can be attached to the same all-purpose cluster simultaneously.

Auto-Save and Versioning

Notebooks auto-save frequently. Databricks maintains a revision history for every notebook — you can see the full history and revert to any previous version. This is separate from Git versioning (covered later).

Exporting Notebooks

Notebooks can be exported as:

.ipynb — Jupyter Notebook format
.py — Python source file (with # COMMAND ---------- separators)
.scala — Scala source
.sql — SQL source
.dbc — Databricks archive (proprietary, includes all cells and results)
.html — HTML with rendered output (for sharing results)

Notebook Context & SparkSession

In every Databricks notebook, these objects are pre-initialized:

spark — the SparkSession
sc — the SparkContext
dbutils — Databricks Utilities
display() — the display function
displayHTML() — render HTML content

You never need to create a SparkSession manually in a notebook.

Chapter 5 — Delta Lake & Data Management

This is the most important chapter for the exam. Delta Lake underpins everything in Databricks.

What Is Delta Lake?

Delta Lake is an open-source storage layer that brings ACID transactions and reliability to data lakes. It uses standard Parquet files for data storage but adds a transaction log on top that enables:

ACID transactions
Schema enforcement and evolution
Time travel (data versioning)
Upserts and deletes (DML operations)
Streaming and batch unification
Optimized read/write operations

Delta Lake is the default table format in Databricks. When you create a table without specifying a format, it's Delta.

Delta Lake Architecture

The Delta Transaction Log

The transaction log (located in the _delta_log/ folder of every Delta table) is the backbone of Delta Lake. It's a folder containing:

JSON files (0000000000.json, 0000000001.json, ...) — each represents one transaction
Checkpoint files (.parquet) — every 10 commits, a full checkpoint is created for performance

Every operation (write, update, delete, schema change) creates a new entry in the transaction log. The log is append-only — you never modify existing log entries.

Reading a Delta table:

Spark reads the _delta_log/ to determine the current state
It identifies which Parquet files are "live" (not removed by deletes/updates)
It reads only those Parquet files

Writing to a Delta table:

Spark writes new Parquet files to the table directory
Atomically commits a new JSON entry to _delta_log/
The transaction either fully commits or fully fails — no partial writes

ACID Properties in Delta Lake

Atomicity — A transaction either fully succeeds or fully fails. If your Spark job fails halfway, no partial data is committed.

Consistency — Schema enforcement ensures data always conforms to the table's schema.

Isolation — Concurrent reads and writes don't interfere. Readers always see a consistent snapshot. Writers use optimistic concurrency control.

Durability — Once committed, data persists to S3.

Delta Lake Operations

Creating Delta Tables

-- SQL approach (creates a managed table in default schema)
CREATE TABLE employees (
  id INT,
  name STRING,
  salary DOUBLE,
  department STRING
)
USING DELTA;

-- Or from an existing data source
CREATE TABLE employees
USING DELTA
AS SELECT * FROM parquet.`/path/to/parquet/`;

-- CTAS (Create Table As Select)
CREATE TABLE silver_employees
AS SELECT id, name, salary FROM bronze_employees WHERE salary > 0;

Writing Data

# Overwrite entire table
df.write.format("delta").mode("overwrite").saveAsTable("employees")

# Append to table
df.write.format("delta").mode("append").saveAsTable("employees")

# Write to a path (external table pattern)
df.write.format("delta").save("s3://my-bucket/delta/employees/")

Reading Data

# Read a table
df = spark.read.table("employees")

# Read from path
df = spark.read.format("delta").load("s3://my-bucket/delta/employees/")

# SQL
spark.sql("SELECT * FROM employees WHERE department = 'Engineering'")

MERGE (Upsert)

MERGE is one of Delta Lake's most powerful features — it allows you to upsert (update + insert) data efficiently:

MERGE INTO target_table AS target
USING source_table AS source
ON target.id = source.id
WHEN MATCHED THEN
  UPDATE SET target.name = source.name,
             target.salary = source.salary
WHEN NOT MATCHED THEN
  INSERT (id, name, salary, department)
  VALUES (source.id, source.name, source.salary, source.department)
WHEN NOT MATCHED BY SOURCE THEN
  DELETE

Key MERGE clauses:

WHEN MATCHED — row exists in both target and source
WHEN NOT MATCHED (or WHEN NOT MATCHED BY TARGET) — row in source but not target (INSERT)
WHEN NOT MATCHED BY SOURCE — row in target but not in source (useful for DELETE)

You can have multiple WHEN MATCHED clauses with different conditions.

DELETE and UPDATE

-- Delete rows
DELETE FROM employees WHERE department = 'Legacy';

-- Update rows
UPDATE employees SET salary = salary * 1.10 WHERE department = 'Engineering';

These are true DML operations — not possible with raw Parquet files!

Time Travel

Time travel is the ability to query historical versions of a Delta table. Every commit creates a new version.

-- Query version 5
SELECT * FROM employees VERSION AS OF 5;

-- Query as of a timestamp
SELECT * FROM employees TIMESTAMP AS OF '2024-01-15 10:00:00';

-- In Python
df = spark.read.format("delta") \
  .option("versionAsOf", 5) \
  .load("/path/to/table")

Use cases:

Audit and compliance — what did the data look like at a specific time?
Recovery — accidentally deleted data? Roll back!
Reproducibility — ML experiments can reference exact data versions

How long can you go back?

Time travel is limited by the data retention period (default: 30 days). Old versions are removed by the VACUUM command.

RESTORE

You can restore a table to a previous version:

RESTORE TABLE employees TO VERSION AS OF 3;
RESTORE TABLE employees TO TIMESTAMP AS OF '2024-01-10';

This creates a new commit that restores the state — it doesn't delete history.

Schema Enforcement and Evolution

Schema Enforcement (Schema Validation)

By default, Delta Lake rejects writes that don't match the table's schema. This prevents silent data corruption.

If you try to write a DataFrame with extra columns or incompatible types to a Delta table, you'll get a AnalysisException.

Schema Evolution

You can allow schema changes using mergeSchema or overwriteSchema:

# Add new columns to the table schema automatically
df.write.format("delta") \
  .mode("append") \
  .option("mergeSchema", "true") \
  .save("/path/to/table")

# Completely replace the schema (dangerous!)
df.write.format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .save("/path/to/table")

mergeSchema — adds new columns from the incoming data to the schema. Existing columns are preserved.
overwriteSchema — replaces the entire schema (requires overwrite mode).

For SQL, use SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name') for column renaming and dropping.

Delta Lake Table Properties and Optimization

OPTIMIZE

OPTIMIZE compacts small Parquet files into larger ones. Small files are a common performance problem — each Parquet file write from a Spark task creates a file, and tables with many small files are slow to read.

OPTIMIZE employees;

-- With Z-ORDER (more on this below)
OPTIMIZE employees ZORDER BY (department, salary);

OPTIMIZE is idempotent and safe to run at any time.

Z-Ordering

Z-ORDER is a multi-dimensional data clustering technique. It co-locates related data (based on specified columns) within Parquet files. This dramatically improves query performance when filtering on those columns.

OPTIMIZE employees ZORDER BY (department);

After Z-ordering by department, queries like WHERE department = 'Engineering' will skip most files (data skipping), reading only files that contain that department's data.

Z-ORDER is most effective for:

High-cardinality columns used in WHERE filters
Columns used in JOIN conditions
Up to 3-4 columns (diminishing returns beyond that)

VACUUM

VACUUM removes old Parquet files that are no longer referenced by the current or recent transaction log entries:

-- Remove files older than the default retention period (7 days for VACUUM, 30 days for time travel)
VACUUM employees;

-- Specify a retention period
VACUUM employees RETAIN 168 HOURS;  -- 7 days

-- DANGER: DRY RUN first!
VACUUM employees DRY RUN;

Important: The default spark.databricks.delta.retentionDurationCheck.enabled prevents VACUUM from running with less than 7-day retention. Override carefully.

After VACUUM, time travel beyond the retention period is impossible.

ANALYZE

ANALYZE TABLE employees COMPUTE STATISTICS;
ANALYZE TABLE employees COMPUTE STATISTICS FOR COLUMNS department, salary;

Computes statistics (min, max, null count, distinct count) that the Spark query optimizer uses for better query planning.

Auto Optimize (Databricks-specific Delta feature)

Databricks adds two auto-optimization features:

Auto Compaction — automatically runs a compaction after writes if files are too small

ALTER TABLE employees SET TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true');

Optimized Writes — repartitions data before writing to produce right-sized files

ALTER TABLE employees SET TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true');

Or enable globally in Spark config:

spark.databricks.delta.optimizeWrite.enabled = true
spark.databricks.delta.autoCompact.enabled = true

Liquid Clustering (Newer Feature)

Liquid Clustering is a newer alternative to Z-Ordering and partitioning that provides:

Flexible, incremental clustering
No need to know clustering columns upfront
Better performance than Z-ORDER for most workloads
Can change clustering columns without rewriting the whole table

CREATE TABLE employees
CLUSTER BY (department, salary)
AS SELECT * FROM source;

-- Re-cluster as data grows
OPTIMIZE employees;

Delta Table History

DESCRIBE HISTORY employees;

Shows all versions of the table with:

Version number
Timestamp
User who made the change
Operation (WRITE, DELETE, MERGE, OPTIMIZE, etc.)
Operation parameters

Table Partitioning

Partitioning physically separates data into subdirectories based on a column value:

CREATE TABLE sales (
  id INT,
  date DATE,
  region STRING,
  amount DOUBLE
)
PARTITIONED BY (region, date);

Data is stored as:

s3://bucket/sales/region=US/date=2024-01-01/*.parquet
s3://bucket/sales/region=EU/date=2024-01-01/*.parquet

When to partition:

Tables with billions of rows
Columns with low cardinality (< 10,000 distinct values)
Columns consistently used as filters
Columns used in GROUP BY at the partition level (e.g., date for daily processing)

When NOT to partition:

Small tables (< 1GB)
High-cardinality columns (creates too many small files)
Columns rarely used as filters

Exam tip: Over-partitioning is a common mistake. Liquid Clustering is generally preferred for new tables in Databricks.

Delta Change Data Feed (CDF)

Change Data Feed (also called Change Data Capture) captures every row-level change (INSERT, UPDATE, DELETE) to a Delta table and exposes it as a readable stream or batch.

Enable it on a table:

ALTER TABLE employees SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
-- OR at creation:
CREATE TABLE employees (...) TBLPROPERTIES (delta.enableChangeDataFeed = true);

Read the changes:

# Batch - read changes between versions
changes = spark.read.format("delta") \
  .option("readChangeFeed", "true") \
  .option("startingVersion", 5) \
  .option("endingVersion", 10) \
  .table("employees")

# Streaming - continuously process changes
stream = spark.readStream.format("delta") \
  .option("readChangeFeed", "true") \
  .option("startingVersion", "latest") \
  .table("employees")

Each change row has additional columns:

_change_type — insert, update_preimage, update_postimage, delete
_commit_version — which version of the table
_commit_timestamp — when the change was committed

Use cases:

Efficient Silver → Gold propagation (only process changed rows)
Building CDC pipelines
Audit logs

Chapter 6 — Tables, Schemas & Databases

Managed vs. External Tables

This is a critical concept for the exam.

Managed Tables

Databricks (via the metastore) controls both the metadata AND the data files
Data is stored in the default warehouse location (DBFS or Unity Catalog managed storage)
When you DROP TABLE managed_table, both metadata AND data are deleted
No LOCATION keyword in the CREATE statement

CREATE TABLE managed_employees (
  id INT,
  name STRING
);
-- Data stored at: dbfs:/user/hive/warehouse/managed_employees/

External Tables

Databricks controls the metadata only
Data lives at a user-specified location (S3, ADLS, etc.)
When you DROP TABLE external_table, only metadata is deleted; data remains in S3
Uses LOCATION keyword

CREATE TABLE external_employees (
  id INT,
  name STRING
)
LOCATION 's3://my-data-bucket/employees/';

When to use external tables:

Data shared with other tools (Athena, Glue, EMR)
Data that should outlive the metastore
Data owned by another team
Large datasets already at a specific S3 path

When to use managed tables:

Data fully managed by Databricks
Simplicity — don't want to manage S3 paths manually
Unity Catalog managed storage is preferred

Views

Databricks supports three view types:

Regular Views

CREATE VIEW engineering_employees AS
SELECT * FROM employees WHERE department = 'Engineering';

Stored in the metastore
Query is re-executed each time
Can reference tables across schemas

Temporary Views

CREATE TEMP VIEW temp_high_earners AS
SELECT * FROM employees WHERE salary > 100000;

Only visible to the current SparkSession (your notebook session)
Dropped when the SparkSession ends
Not visible to other users or notebooks

Global Temporary Views

CREATE GLOBAL TEMP VIEW global_active AS
SELECT * FROM employees WHERE status = 'active';

-- Access via global_temp schema
SELECT * FROM global_temp.global_active;

Shared across SparkSessions on the same cluster
Dropped when the cluster terminates
Rarely used — prefer regular views for persistence

Schemas (Databases)

In Databricks, Schema and Database are synonymous (SQL standard uses Schema; Databricks/Hive historically uses Database). The two-level namespace is:

database.table
employees_db.employees

With Unity Catalog, it becomes three-level:

catalog.schema.table
main.employees_db.employees

Creating and Using Schemas

CREATE SCHEMA IF NOT EXISTS employees_db
COMMENT 'Employee data'
LOCATION 's3://my-bucket/employees_db/';  -- optional, for external schema

USE employees_db;  -- sets default schema for subsequent queries

SHOW SCHEMAS;
DESCRIBE SCHEMA employees_db;
DROP SCHEMA employees_db CASCADE;  -- CASCADE drops all tables in it

DESCRIBE Commands

Critical for exploring metadata:

-- Basic table info (column names and types)
DESCRIBE TABLE employees;

-- Detailed table info (storage location, format, partitions, etc.)
DESCRIBE DETAIL employees;

-- Table history (all versions)
DESCRIBE HISTORY employees;

-- Extended info (everything + table properties)
DESCRIBE EXTENDED employees;

SHOW Commands

SHOW TABLES;                    -- tables in current schema
SHOW TABLES IN employees_db;   -- tables in specific schema
SHOW SCHEMAS;                   -- all schemas
SHOW DATABASES;                 -- same as SHOW SCHEMAS
SHOW COLUMNS IN employees;     -- columns in a table
SHOW PARTITIONS employees;     -- partitions
SHOW TBLPROPERTIES employees;  -- table properties
CREATE TABLE employees SHOW CREATE TABLE employees;  -- show CREATE statement

Table Cloning

Delta Lake supports two cloning modes:

Shallow Clone

CREATE TABLE employees_backup
SHALLOW CLONE employees;

Creates a new table that references the same Parquet files as the original
Only the metadata (transaction log) is copied — no data movement
Extremely fast and cheap
Good for: creating test copies, point-in-time snapshots for experiments

Important: If you run VACUUM on the original table, shallow clones may break! The underlying files may be deleted.

Deep Clone

CREATE TABLE employees_backup
DEEP CLONE employees;

Fully copies both metadata AND data files
Independent of the original table
Good for: migration, backup, production copies
Can be expensive for large tables

You can also clone incrementally:

CREATE OR REPLACE TABLE employees_backup
DEEP CLONE employees;

Run this repeatedly and it only copies new/changed files.

Chapter 7 — Databricks Utilities (dbutils)

dbutils is a utility library available in all Databricks notebooks. It provides programmatic access to many platform features.

dbutils.fs — File System Utilities

Think of this as a programmatic interface to DBFS and cloud storage.

# List files and directories
dbutils.fs.ls("dbfs:/")
dbutils.fs.ls("s3://my-bucket/data/")

# Move/rename
dbutils.fs.mv("dbfs:/source/file.txt", "dbfs:/dest/file.txt")

# Copy
dbutils.fs.cp("dbfs:/source/", "dbfs:/dest/", recurse=True)

# Delete
dbutils.fs.rm("dbfs:/path/to/file.txt")
dbutils.fs.rm("dbfs:/path/to/folder/", recurse=True)

# Create directory
dbutils.fs.mkdirs("dbfs:/my/new/directory/")

# Read file content (small files only!)
content = dbutils.fs.head("dbfs:/path/to/file.txt", maxBytes=65536)

# Mount an S3 bucket (legacy - prefer External Locations with UC)
dbutils.fs.mount(
  source="s3a://my-bucket/",
  mount_point="/mnt/my-bucket",
  extra_configs={"fs.s3a.aws.credentials.provider": "..."}
)

# List mounts
dbutils.fs.mounts()

# Unmount
dbutils.fs.unmount("/mnt/my-bucket")

The %fs magic command is a shorthand:

%fs ls /
%fs rm -r /path/to/delete/

dbutils.secrets — Secrets Management

One of the most security-critical utilities. Secrets allow you to store sensitive values (API keys, passwords, connection strings) without hardcoding them in notebooks or code.

Secret Scopes

A secret scope is a named collection of secrets. Two types:

Databricks-backed scopes — secrets stored in the Databricks control plane, encrypted at rest.

AWS Secrets Manager-backed scopes — secrets stored in AWS Secrets Manager; Databricks fetches them dynamically.

# Create a Databricks-backed scope (Databricks CLI)
databricks secrets create-scope --scope my-scope

# Put a secret
databricks secrets put --scope my-scope --key db-password

# List secrets (only shows keys, never values)
databricks secrets list --scope my-scope

Using Secrets in Notebooks

# Retrieve a secret - returns a string
password = dbutils.secrets.get(scope="my-scope", key="db-password")

# IMPORTANT: If you print a secret, Databricks redacts it
print(password)  # Output: [REDACTED]

# List scopes
dbutils.secrets.listScopes()

# List keys in a scope (doesn't show values)
dbutils.secrets.list("my-scope")

Critical exam point: You can never see the actual value of a secret from within a notebook. print() always shows [REDACTED]. This is by design for security.

dbutils.widgets — Notebook Parameters

Covered in the Notebooks chapter. Key methods:

dbutils.widgets.text("param_name", "default_value", "Display Label")
dbutils.widgets.dropdown("env", "dev", ["dev", "staging", "prod"])
dbutils.widgets.get("param_name")  # retrieve value
dbutils.widgets.remove("param_name")
dbutils.widgets.removeAll()

dbutils.notebook — Notebook Orchestration

This enables calling other notebooks programmatically:

# Run another notebook with a timeout (seconds)
result = dbutils.notebook.run(
  "/path/to/notebook",
  timeout_seconds=300,
  arguments={"param1": "value1", "param2": "value2"}
)

# Exit a notebook with a return value
dbutils.notebook.exit("SUCCESS")
dbutils.notebook.exit(json.dumps({"status": "success", "rows_processed": 1000}))

Difference between %run and dbutils.notebook.run():

Feature	`%run`	`dbutils.notebook.run()`
Executes in same session	Yes	No (new session)
Variables shared	Yes	No
Can capture return value	No	Yes
Parallel execution	No	Yes (multiple calls)
Timeout support	No	Yes
Pass parameters	No (use widgets)	Yes

%run is for modular code sharing. dbutils.notebook.run() is for orchestration.

dbutils.data — Preview Data

# Preview a small sample of a DataFrame
dbutils.data.summarize(df)

dbutils.library — Library Management (Deprecated)

In newer runtimes, use %pip instead. But for completeness:

# Deprecated, use %pip
dbutils.library.installPyPI("pandas", version="1.3.0")

dbutils.jobs — Job Context (Within Job Runs)

# Get the current run ID when executing as part of a job
dbutils.jobs.taskValues.set(key="output_path", value="/path/to/output")
result = dbutils.jobs.taskValues.get(taskKey="upstream_task", key="output_path")

This is used in multi-task jobs to pass values between tasks.

Chapter 8 — Git Folders / Repos

What Are Git Folders?

Git Folders (formerly called Databricks Repos) is Databricks' native Git integration. It allows you to:

Connect a Databricks workspace folder to a Git repository
Pull, push, commit, create branches, and merge within the Databricks UI
Version-control your notebooks alongside application code
Implement CI/CD workflows for Databricks code

Supported Git Providers

GitHub (github.com and GitHub Enterprise)
GitLab
Bitbucket (Cloud and Server)
Azure DevOps
AWS CodeCommit

Setting Up Git Integration

Go to User Settings → Git Integration
Select your Git provider
Enter your personal access token (PAT) or OAuth credentials
Databricks stores credentials in the control plane (encrypted)

Working with Git Folders

Cloning a Repo

In the Databricks UI:

Navigate to the Repos section (or Workspace → Git Folders)
Click "Add Repo"
Enter the repository URL
Optionally specify a branch

This creates a folder in your workspace that mirrors the Git repository.

Git Operations in the UI

Within a Git Folder, you can:

Pull — fetch and merge latest changes from remote
Push — push local commits to remote
Commit — stage and commit changes
Branch — create or switch branches
Merge — merge branches
Reset — revert uncommitted changes

What Gets Versioned

Inside a Git Folder:

Notebooks — stored as .py, .sql, .scala, or .r files with special Databricks metadata comments
Python modules — regular .py files
SQL files — .sql files
Config files — requirements.txt, setup.py, etc.

Important: Regular workspace notebooks (outside Git Folders) are not in Git. Only notebooks inside a Git Folder are version-controlled.

CI/CD with Git Folders

A typical CI/CD workflow:

Developer (feature branch)
  → Git Folder in dev workspace
  → Runs tests in dev
  → Creates Pull Request on GitHub
  → PR review and approval
  → Merge to main
  → CI/CD pipeline (GitHub Actions / AWS CodePipeline)
    → Runs automated tests
    → Deploys to staging workspace (via Databricks CLI)
    → Deploys to production workspace (via Databricks CLI)

Databricks CLI for CI/CD

# Install CLI
pip install databricks-cli

# Configure
databricks configure --token

# Import a notebook
databricks workspace import local_notebook.py /Remote/Path/notebook --language PYTHON

# Export a notebook
databricks workspace export /Remote/Path/notebook local_notebook.py

# Deploy a job
databricks jobs create --json-file job_config.json

The Databricks Asset Bundles (DABs) is the modern approach for CI/CD — a YAML-based framework for deploying Databricks resources (jobs, pipelines, notebooks, etc.) across environments.

Databricks Asset Bundles (DABs)

DABs use a databricks.yml configuration file:

bundle:
  name: my-data-pipeline

workspace:
  host: https://your-workspace.azuredatabricks.net

targets:
  dev:
    mode: development
    workspace:
      host: https://dev-workspace.databricks.com

  prod:
    mode: production
    workspace:
      host: https://prod-workspace.databricks.com

resources:
  jobs:
    my_etl_job:
      name: My ETL Job
      tasks:
        - task_key: ingest
          notebook_task:
            notebook_path: ./notebooks/ingest

Deploy with:

databricks bundle deploy --target dev
databricks bundle run my_etl_job --target dev

Chapter 9 — Unity Catalog & Governance

What Is Unity Catalog?

Unity Catalog (UC) is Databricks' unified governance solution for data and AI. It provides:

Centralized metastore — single source of truth for all metadata across workspaces
Fine-grained access control — table, column, and row-level security
Data lineage — automatic tracking of data origin and transformations
Auditing — full audit log of all data access
Data discovery — searchable catalog with tagging and documentation

Before Unity Catalog, each Databricks workspace had its own Hive Metastore — workspace-local, no cross-workspace sharing, no fine-grained governance. Unity Catalog replaces this with a centralized, cross-workspace solution.

Three-Level Namespace

Unity Catalog introduces a three-level namespace:

catalog.schema.table

main.sales.transactions

Compared to the legacy two-level namespace:

schema.table  (or database.table)
sales.transactions

Catalog — the top-level container. You can have multiple catalogs for different environments, business units, or data domains.

Schema — equivalent to a database in the old model. Contains tables, views, and volumes.

Table/View/Volume — the actual data objects.

Navigating the Namespace

-- Set the default catalog
USE CATALOG main;

-- Set the default schema
USE SCHEMA sales;

-- Fully qualified reference
SELECT * FROM main.sales.transactions;

-- Partially qualified (using defaults)
SELECT * FROM sales.transactions;

Unity Catalog Object Hierarchy

Metastore (one per region)
  └── Catalog
        └── Schema
              ├── Table (Managed or External)
              ├── View
              ├── Volume
              └── Function

Metastore — The top-level Unity Catalog object. Created once per Databricks account region. All workspaces in the same account and region share the same metastore (or different metastores if you create multiple).

Catalog — First-level namespace. Create catalogs for environments (dev, staging, prod) or business units.

Schema — Second-level namespace. Organize tables by domain.

Table — Same as before, but now with UC governance features.

Volume — A governance-enabled storage abstraction for non-tabular data (files, images, audio). Similar to a managed directory with access control.

CREATE VOLUME main.raw_data.audio_files;
-- Access volume files
ls /Volumes/main/raw_data/audio_files/

Function — User-defined functions stored in the catalog.

Managed vs. External Storage in Unity Catalog

UC Managed Tables

Data stored in the UC metastore's managed storage location (an S3 bucket you configure)
When dropped, data is deleted
UC manages all file organization and optimization

External Tables in Unity Catalog

Data lives in an External Location (governed S3 path)
When dropped, only metadata is removed; data persists

External Locations

An External Location is a registered S3 path that UC governs. It requires:

A Storage Credential — an IAM role that Databricks can use to access S3
An External Location — the combination of Storage Credential + S3 path

-- First, an admin creates a storage credential (IAM role ARN)
CREATE STORAGE CREDENTIAL my_s3_credential
WITH IAM_ROLE 'arn:aws:iam::123456789:role/my-databricks-s3-role';

-- Then create an external location
CREATE EXTERNAL LOCATION my_data_location
URL 's3://my-data-bucket/data/'
WITH (STORAGE CREDENTIAL my_s3_credential);

-- Now create an external table at that location
CREATE TABLE main.sales.transactions
LOCATION 's3://my-data-bucket/data/transactions/';

Benefits over DBFS mounts:

Access control — only users granted permissions to the External Location can access data
Audit logging — all accesses are logged
Cross-workspace — works across all workspaces connected to the metastore

Access Control in Unity Catalog

UC uses a privilege-based model with GRANT and REVOKE:

Privilege Types

On Catalogs:

USE CATALOG — allows using the catalog (required for all access)
CREATE SCHEMA — create schemas within the catalog
ALL PRIVILEGES — grants all privileges

On Schemas:

USE SCHEMA — allows using the schema
CREATE TABLE — create tables
CREATE VIEW — create views
ALL PRIVILEGES

On Tables:

SELECT — read data
MODIFY — INSERT, UPDATE, DELETE
ALL PRIVILEGES

On External Locations:

READ FILES — read files from this location
WRITE FILES — write files to this location
CREATE EXTERNAL TABLE — create external tables at this location

Granting Privileges

-- Grant table access to a user
GRANT SELECT ON TABLE main.sales.transactions TO `user@example.com`;

-- Grant schema access to a group
GRANT USE SCHEMA, SELECT ON SCHEMA main.sales TO `data_analysts`;

-- Grant catalog access
GRANT USE CATALOG ON CATALOG main TO `data_team`;

-- Revoke
REVOKE SELECT ON TABLE main.sales.transactions FROM `user@example.com`;

-- Show grants
SHOW GRANTS ON TABLE main.sales.transactions;

Privilege Inheritance

Privileges cascade down the hierarchy:

Granting ALL PRIVILEGES on a catalog implicitly covers all schemas, tables, and views within it
However, USE CATALOG doesn't automatically grant USE SCHEMA — each level must be granted explicitly
To read a table, a user needs: USE CATALOG + USE SCHEMA + SELECT ON TABLE

Row-Level Security and Column Masking

Unity Catalog supports fine-grained access control:

Row Filters — restrict which rows a user can see:

-- Create a row filter function
CREATE FUNCTION filter_by_region(region STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('us_team'), region = 'US', TRUE);

-- Apply to table
ALTER TABLE main.sales.transactions
SET ROW FILTER filter_by_region ON (region);

Column Masks — mask sensitive column values:

-- Create a mask function
CREATE FUNCTION mask_ssn(ssn STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('hr_team'), ssn, '***-**-****');

-- Apply to column
ALTER TABLE main.hr.employees
ALTER COLUMN ssn SET MASK mask_ssn;

Data Lineage

Unity Catalog automatically tracks data lineage — how data flows between tables, views, and notebooks. No configuration required.

You can view lineage in the Catalog Explorer UI:

Table lineage — upstream and downstream tables
Column lineage — trace a specific column back to its source

This is invaluable for:

Impact analysis (what breaks if I change this table?)
Debugging data quality issues
Compliance and auditing

Audit Logs

All Unity Catalog operations are logged to a system table:

-- Query the audit log (system catalog)
SELECT *
FROM system.access.audit
WHERE event_time > '2024-01-01'
  AND action_name = 'databricksUserEvent'
LIMIT 100;

Tags and Documentation

UC supports tagging objects with metadata:

ALTER TABLE main.sales.transactions
SET TAGS ('pii' = 'true', 'domain' = 'sales', 'owner' = 'sales-team@company.com');

COMMENT ON TABLE main.sales.transactions IS 'All sales transaction records from POS systems';
COMMENT ON COLUMN main.sales.transactions.customer_id IS 'Unique customer identifier, PII';

Delta Sharing

Delta Sharing is an open standard for sharing Delta Lake data across organizations without copying the data.

-- Create a share
CREATE SHARE my_data_share;

-- Add tables to the share
ALTER SHARE my_data_share
ADD TABLE main.sales.transactions;

-- Create a recipient
CREATE RECIPIENT partner_company
USING ID 'partner-databricks-id';

-- Grant access
GRANT SELECT ON SHARE my_data_share TO RECIPIENT partner_company;

The recipient accesses the data via the Delta Sharing protocol from their own Databricks (or other supported tool) — they never get direct S3 access.

Chapter 10 — Job Orchestration & Workflows

Databricks Workflows

Databricks Workflows (formerly Jobs) is the native orchestration engine for running Databricks workloads. It supports:

Single-task or multi-task jobs
Scheduling (cron syntax)
Triggered runs
Retry policies
Alerting (email, webhooks, Slack)
Job clusters (ephemeral) or all-purpose clusters

Job Concepts

Task

A task is the atomic unit of work in a Workflow. Each task can be:

Notebook task — runs a Databricks notebook
Python script task — runs a .py file
Python wheel task — runs a function from a packaged Python wheel
Spark JAR task — runs a Java/Scala Spark application
SQL task — runs SQL (against a SQL Warehouse)
Delta Live Tables task — triggers a DLT pipeline update
dbt task — runs a dbt project
Pipeline task — runs a DLT pipeline
Run job task — triggers another job (job chaining)
For each task — iterates over a list and runs a sub-task for each item

Multi-Task Jobs (Task Graphs)

Jobs can have multiple tasks connected by dependencies, forming a DAG (Directed Acyclic Graph):

[ingest_raw] → [validate_data] → [transform_silver] → [aggregate_gold]
                                                    ↘
                                                      [generate_report]

Tasks can run in parallel (if no dependency between them) or sequentially.

# Example job configuration structure
{
  "name": "Daily ETL Pipeline",
  "tasks": [
    {
      "task_key": "ingest",
      "notebook_task": {"notebook_path": "/ETL/01_ingest"}
    },
    {
      "task_key": "transform",
      "depends_on": [{"task_key": "ingest"}],
      "notebook_task": {"notebook_path": "/ETL/02_transform"}
    },
    {
      "task_key": "aggregate",
      "depends_on": [{"task_key": "transform"}],
      "notebook_task": {"notebook_path": "/ETL/03_aggregate"}
    }
  ]
}

Task Values (Passing Data Between Tasks)

Tasks in a multi-task job can pass data to downstream tasks using task values:

# In the upstream task (e.g., "ingest")
dbutils.jobs.taskValues.set(key="rows_ingested", value=1500)
dbutils.jobs.taskValues.set(key="output_path", value="/data/silver/2024-01-15/")

# In the downstream task (e.g., "transform")
rows = dbutils.jobs.taskValues.get(taskKey="ingest", key="rows_ingested", default=0)
path = dbutils.jobs.taskValues.get(taskKey="ingest", key="output_path")

Scheduling Jobs

Jobs can be triggered by:

Manual trigger — click "Run Now" in the UI
Cron schedule — standard cron syntax
File arrival trigger — runs when a new file appears in a storage path
Continuous trigger — reruns as soon as the previous run completes
API trigger — POST to the Jobs API

# Cron examples
0 0 * * *       # Daily at midnight UTC
0 6 * * 1       # Every Monday at 6 AM
*/15 * * * *    # Every 15 minutes
0 2 1 * *       # First of every month at 2 AM

Cluster Configuration for Jobs

Each task can have its own cluster configuration:

Job cluster — created and terminated per-run (recommended for production)
All-purpose cluster — use an existing running cluster (development only)
Instance pool — use instances from a pool for faster startup

For multi-task jobs, different tasks can use different cluster configurations (e.g., a small cluster for data validation, a large cluster for heavy transformation).

Shared job clusters — multiple tasks in the same job run share a single cluster (cost-efficient when tasks don't need isolation).

Retry and Error Handling

Each task supports:

{
  "max_retries": 3,
  "min_retry_interval_millis": 60000,
  "retry_on_timeout": true,
  "timeout_seconds": 3600
}

Conditional task execution — you can run a task based on the outcome of a previous task:

all_success (default) — run only if all upstream tasks succeeded
all_done — run regardless of upstream status
at_least_one_success — run if at least one upstream succeeded
all_failed — run only if all upstream failed (useful for cleanup/alerting on failure)
at_least_one_failed — run if at least one upstream failed

Job Parameters

Jobs accept parameters at two levels:

Job-level parameters — available to all tasks:

{"run_date": "2024-01-15", "environment": "prod"}

Task-level parameters — specific to a task (e.g., widget values for a notebook task).

Access in notebooks via dbutils.widgets.get() or as job parameters.

Monitoring and Alerting

Job Run Status: Each run shows Pending, Running, Succeeded, Failed, Timedout, or Canceled.

Email Notifications:

{
  "email_notifications": {
    "on_success": ["team@company.com"],
    "on_failure": ["oncall@company.com", "manager@company.com"],
    "on_start": [],
    "no_alert_for_skipped_runs": true
  }
}

Webhook notifications — send HTTP callbacks to Slack, PagerDuty, custom endpoints on job events.

System Tables for Monitoring (Unity Catalog):

SELECT * FROM system.lakeflow.job_run_timeline
WHERE workspace_id = current_workspace_id()
  AND period_start_time > '2024-01-01'
ORDER BY period_start_time DESC;

Repair Runs

If a job fails partway through a multi-task DAG, you can repair the run — re-running only the failed tasks (and their dependents) without re-running successful tasks. This is huge for cost savings in long pipelines.

Chapter 11 — Data Ingestion Patterns

Bronze-Silver-Gold (Medallion Architecture)

This is the standard data organization pattern in Databricks:

Bronze Layer (Raw Ingestion)

Exact copy of source data — no transformations
All formats preserved (JSON, CSV, XML, Avro, Parquet)
Append-only (never delete from Bronze)
May store as Delta (preferred) or original format
Adds ingestion metadata: ingestion_timestamp, source_file, batch_id
This is your "system of record" — if anything goes wrong downstream, reprocess from Bronze

Silver Layer (Cleansed & Enriched)

Data validated, cleaned, and normalized
Schema enforced (Delta)
Deduplication applied
Null handling and type casting done
Joins to reference/lookup tables
Row-level quality checks
Still somewhat raw — business logic not fully applied
Query-friendly but not aggregated

Gold Layer (Business-Level Aggregates)

Aggregated, denormalized, business-specific views
Optimized for specific use cases (reporting, ML features)
Often organized by business domain or consumer
Direct source for BI tools, dashboards, ML models
Examples: daily sales totals, monthly active users, feature store tables

Batch Ingestion

Reading from S3

# Read CSV from S3
df = spark.read \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .csv("s3://my-bucket/raw/employees/*.csv")

# Read JSON
df = spark.read.json("s3://my-bucket/raw/events/")

# Read Parquet
df = spark.read.parquet("s3://my-bucket/raw/transactions/")

# Read with explicit schema (preferred - avoids schema inference overhead)
from pyspark.sql.types import *
schema = StructType([
  StructField("id", IntegerType(), False),
  StructField("name", StringType(), True),
  StructField("amount", DoubleType(), True)
])
df = spark.read.schema(schema).json("s3://my-bucket/raw/")

Auto Loader

Auto Loader is Databricks' incremental data ingestion mechanism. It efficiently processes new files as they arrive in S3, tracking which files have been processed.

# Auto Loader with cloudFiles format
df = spark.readStream \
  .format("cloudFiles") \
  .option("cloudFiles.format", "json") \
  .option("cloudFiles.schemaLocation", "dbfs:/checkpoints/schema/") \
  .option("cloudFiles.inferColumnTypes", "true") \
  .load("s3://my-bucket/raw/events/")

# Write the stream
df.writeStream \
  .format("delta") \
  .option("checkpointLocation", "dbfs:/checkpoints/events/") \
  .option("mergeSchema", "true") \
  .trigger(availableNow=True) \  # Process all available files, then stop
  .table("bronze.events")

Key Auto Loader features:

Schema inference and evolution — automatically detects and adapts to new columns
File notification mode — uses AWS SQS/SNS for efficient, scalable file detection (recommended for large volumes)
Directory listing mode — periodically lists the directory (simpler, lower setup, but less efficient for large buckets)
Schema location — stores the inferred schema separately so it persists across runs
Checkpoint — tracks exactly which files have been processed (exactly-once guarantee)
rescuedDataColumn — captures data that didn't match the expected schema into a separate column

Auto Loader vs. COPY INTO:

Feature	Auto Loader	COPY INTO
Use case	Streaming incremental loads	Batch incremental loads
Tracking	Checkpoint-based	Built-in idempotency
Scale	Millions of files	Thousands of files
Schema evolution	Automatic	Manual
API	readStream	SQL command

COPY INTO

COPY INTO is a SQL command for idempotent batch data loading:

COPY INTO bronze.events
FROM 's3://my-bucket/raw/events/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('inferSchema' = 'true', 'mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

Idempotent — running the same COPY INTO multiple times doesn't duplicate data; it only loads new files
Tracks loaded files internally
Good for scheduled batch loading
Simpler than Auto Loader but less scalable

Streaming Ingestion

Structured Streaming

Spark Structured Streaming is the engine underlying streaming in Databricks. Write streaming logic using the same DataFrame API as batch.

# Read from Kafka (on MSK - Amazon Managed Streaming for Kafka)
df = spark.readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092") \
  .option("subscribe", "my-topic") \
  .option("startingOffsets", "latest") \
  .load()

# Parse the Kafka message value (typically JSON)
from pyspark.sql.functions import from_json, col
schema = StructType([...])
parsed = df.select(from_json(col("value").cast("string"), schema).alias("data")) \
  .select("data.*")

# Write to Delta
parsed.writeStream \
  .format("delta") \
  .outputMode("append") \
  .option("checkpointLocation", "/checkpoints/kafka_stream/") \
  .table("bronze.events")

Trigger Options

# Micro-batch triggered every 30 seconds
.trigger(processingTime='30 seconds')

# Process all available data, then stop (batch-like)
.trigger(availableNow=True)

# Continuous processing (low-latency, experimental)
.trigger(continuous='1 second')

# Default: process as fast as possible (micro-batch)

Output Modes

append — only new rows are written (most common for Delta)
complete — all rows of the result are written/overwritten (for aggregations)
update — only changed rows are written (requires Delta)

Checkpointing

Checkpoints store the streaming state — which offsets have been processed, aggregate states, etc. Always set a checkpoint location for production streaming jobs.

.option("checkpointLocation", "s3://my-bucket/checkpoints/my-stream/")

If a streaming job crashes and restarts, it resumes from the checkpoint — exactly-once processing.

foreachBatch

foreachBatch lets you apply arbitrary batch operations to each micro-batch of a stream:

def process_batch(batch_df, batch_id):
  # Complex transformations, MERGE, etc.
  batch_df.write.format("delta").mode("append").table("silver.events")
  # Send metrics to monitoring system

stream.writeStream \
  .foreachBatch(process_batch) \
  .option("checkpointLocation", "/checkpoints/") \
  .start()

This is powerful because you can use any batch API (including MERGE) within a stream.

Chapter 12 — Delta Live Tables (DLT) & Pipelines

What Are Delta Live Tables?

Delta Live Tables (DLT) is a declarative pipeline framework for building reliable data pipelines. Instead of writing imperative Spark code and managing the execution yourself, you declare what your tables should look like, and DLT figures out how to build them, run them, and keep them up-to-date.

Key Philosophy

In traditional ETL:

You write code that runs sequentially
You manage retries, error handling, checkpointing manually
You handle schema evolution manually
You figure out when to run each step and in what order

In DLT:

You declare table definitions
DLT analyzes dependencies and builds a DAG automatically
DLT handles retries, checkpointing, error handling
DLT enforces data quality with expectations
DLT supports both batch and streaming seamlessly

DLT Core Concepts

Defining Tables

In DLT notebooks, you use the @dlt.table decorator (Python) or CREATE OR REFRESH LIVE TABLE (SQL):

import dlt
from pyspark.sql.functions import *

# Bronze: Raw ingestion with Auto Loader
@dlt.table(
  name="raw_events",
  comment="Raw event data from S3",
  table_properties={"quality": "bronze"}
)
def raw_events():
  return (
    spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")
      .option("cloudFiles.schemaLocation", "/checkpoints/raw_events_schema/")
      .load("s3://my-bucket/raw/events/")
  )

-- SQL equivalent
CREATE OR REFRESH STREAMING LIVE TABLE raw_events
COMMENT "Raw event data from S3"
AS SELECT * FROM cloud_files('s3://my-bucket/raw/events/', 'json',
  map('cloudFiles.inferColumnTypes', 'true'))

Live Tables vs. Streaming Live Tables

Live Tables (Materialized Views)

Process data in batch
Fully recomputed each run (by default)
Best for: aggregations, small tables, tables that need to be completely refreshed

@dlt.table
def daily_sales_summary():
  return (
    dlt.read("silver_transactions")
      .groupBy("date", "region")
      .agg(sum("amount").alias("total_sales"))
  )

Streaming Live Tables

Process data incrementally using Structured Streaming
Only process new data each run (huge cost savings)
Best for: high-volume append-only data (logs, events, IoT)

@dlt.table
def silver_events():
  return (
    dlt.read_stream("raw_events")
      .filter(col("event_type").isNotNull())
      .withColumn("processed_at", current_timestamp())
  )

Key difference: Use dlt.read() for batch (live table) dependencies, dlt.read_stream() for streaming dependencies.

Data Quality Expectations

Expectations are DLT's data quality enforcement mechanism. They're constraints you define on your tables. DLT tracks violations and can:

@dlt.expect — warn on violations (data flows through, violations recorded as metrics)
@dlt.expect_or_drop — drop violating rows (bad data doesn't contaminate downstream)
@dlt.expect_or_fail — fail the pipeline on any violation (strict validation)

@dlt.table
@dlt.expect("valid_customer_id", "customer_id IS NOT NULL")
@dlt.expect("positive_amount", "amount > 0")
@dlt.expect_or_drop("valid_email", "email RLIKE '^[^@]+@[^@]+\\.[^@]+$'")
@dlt.expect_or_fail("no_negative_ids", "id >= 0")
def silver_transactions():
  return dlt.read_stream("bronze_transactions")

-- SQL expectations
CREATE OR REFRESH STREAMING LIVE TABLE silver_transactions (
  CONSTRAINT valid_customer_id EXPECT (customer_id IS NOT NULL),
  CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
  CONSTRAINT no_negative_ids EXPECT (id >= 0) ON VIOLATION FAIL UPDATE
)
AS SELECT * FROM STREAM(LIVE.bronze_transactions)

Viewing expectation metrics in the pipeline UI shows you:

How many rows passed/failed each constraint
Trends over time

Pipeline Configuration

A DLT pipeline is the deployment unit. One pipeline can contain multiple notebooks, each defining tables/views.

Key pipeline settings:

Pipeline mode: TRIGGERED (batch) or CONTINUOUS (streaming, low latency)
Target schema: where to store the resulting tables
Storage location: where DLT stores internal state, logs, and tables
Cluster configuration: compute for the pipeline
Notifications: email on success/failure
Channel: CURRENT or PREVIEW (preview gets new DLT features first)

Triggered pipelines — runs once, processes available data, then stops. Triggered via a schedule or manually.

Continuous pipelines — runs indefinitely, processing data as it arrives. Used for near-real-time requirements.

Pipeline UI and Monitoring

The DLT UI provides:

DAG visualization — see all tables and their dependencies as a graph
Data quality metrics — expectation pass/fail rates per table
Table event logs — see exactly what happened during each update
Row statistics — rows read, written, failed per table per run

Event log is also queryable as a Delta table:

SELECT * FROM delta.`/path/to/pipeline/_delta_log_events/`
WHERE event_type = 'flow_progress'
ORDER BY timestamp DESC;

DLT with Auto Loader

Auto Loader + DLT is the canonical pattern for Bronze ingestion:

@dlt.table(
  name="bronze_orders",
  table_properties={
    "quality": "bronze",
    "pipelines.reset.allowed": "false"  # prevent accidental full reload
  }
)
def bronze_orders():
  return (
    spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("cloudFiles.schemaLocation", 
              "/pipelines/my_pipeline/schema/orders/")
      .option("header", "true")
      .option("cloudFiles.inferColumnTypes", "true")
      .load("s3://my-bucket/raw/orders/")
      .withColumn("_source_file", input_file_name())
      .withColumn("_ingestion_time", current_timestamp())
  )

Apply Changes (CDC with DLT)

DLT has native support for Change Data Capture (CDC) via APPLY CHANGES INTO:

# Define the target table
@dlt.table
def customers():
  pass  # Target table definition

# Apply changes from a source (CDC stream)
dlt.apply_changes(
  target="customers",
  source="customers_cdc",
  keys=["customer_id"],
  sequence_by="update_timestamp",  # determines order of changes
  apply_as_deletes=expr("operation = 'DELETE'"),
  except_column_list=["operation", "update_timestamp"],
  stored_as_scd_type=1  # Type 1: overwrite, Type 2: keep history
)

SCD Type 1 — just overwrite the record with the latest values
SCD Type 2 — keep history by creating new records with __START_AT and __END_AT columns

-- SQL equivalent
CREATE OR REFRESH STREAMING LIVE TABLE customers;

APPLY CHANGES INTO LIVE.customers
FROM STREAM(LIVE.customers_cdc)
KEYS (customer_id)
APPLY AS DELETE WHEN operation = 'DELETE'
SEQUENCE BY update_timestamp
STORED AS SCD TYPE 1;

DLT Table Properties and Access

Tables created by DLT can be accessed by:

Other DLT pipelines
Notebooks (after the pipeline has run)
SQL analytics
Jobs

Tables are stored in the target catalog/schema you specify in the pipeline configuration.

-- After pipeline runs, query the tables normally
SELECT * FROM main.silver.orders;

Limitations of DLT

DLT tables cannot be modified outside the pipeline (read-only externally)
You cannot manually write to a DLT-managed table
Full pipeline resets can be expensive (reprocess all data from scratch)
Limited to certain compute configurations

Chapter 13 — Apache Spark: High-Level Overview

Spark is the distributed computing engine underlying everything in Databricks. For the exam, you need a solid conceptual understanding. Here are the key topics at a high level:

Core Concepts

SparkContext / SparkSession

SparkContext — the entry point for RDD-based Spark (legacy)
SparkSession — the unified entry point for DataFrames, SQL, and Streaming (current)
Pre-created as spark in Databricks notebooks

RDDs (Resilient Distributed Datasets)

Low-level distributed data structure (the original Spark API)
Immutable, partitioned collection of records distributed across nodes
Rarely used directly today — use DataFrames/Datasets instead
Still underpins the execution engine

DataFrames

High-level, schema-aware, distributed collection of data organized in named columns
Think of it as a distributed SQL table
Lazy evaluation — operations build a logical plan until an action triggers execution

Datasets (Scala/Java)

Type-safe version of DataFrames (for JVM languages)
Not directly applicable in Python (PySpark uses DataFrames exclusively)

Transformations vs. Actions

Transformations — lazy operations that build a logical plan:

select(), filter(), groupBy(), join(), withColumn(), orderBy(), limit()
Do NOT trigger computation immediately
Can be narrow (don't require data movement between partitions) or wide (require shuffle)

Actions — trigger actual computation:

count(), show(), collect(), write(), save(), take()
Results are either returned to the driver or written to storage

Understanding lazy evaluation is key — Spark optimizes the entire plan before running.

Catalyst Optimizer and Tungsten

Catalyst Optimizer — Spark's query optimization engine:

Parses logical plan from DataFrame operations
Analyzes and resolves column references
Optimizes the logical plan (predicate pushdown, projection pruning, constant folding)
Generates multiple physical plans
Selects the best physical plan using cost-based optimization

Tungsten — Memory management and code generation:

Off-heap memory management
Whole-stage code generation (Java bytecode generation for tight loops)
Column-based binary format (no Java object overhead)

Photon — Databricks' native vectorized execution engine (C++):

Dramatically faster than standard Spark for SQL and Delta queries
Enabled via the Photon Runtime
Drop-in replacement — same API, faster execution

Partitions and Shuffles

Partitions — how data is distributed across executors:

spark.default.parallelism — default partition count for RDDs
spark.sql.shuffle.partitions — default partition count after a shuffle (default: 200)
On Databricks, Adaptive Query Execution (AQE) automatically adjusts partition counts

Shuffle — data redistribution across partitions/executors (wide transformation):

Triggered by: groupBy, join, orderBy, distinct, repartition
Expensive — involves writing data to disk and network transfer
Minimize shuffles for performance

repartition() vs coalesce():

repartition(n) — full shuffle to exactly n partitions (use when increasing or changing distribution)
coalesce(n) — narrow operation to reduce to n partitions (more efficient for reducing partition count, no full shuffle)

Key DataFrame Operations

Transformations:

select() — projection (choose columns)
filter() / where() — row filtering
withColumn() — add/modify a column
withColumnRenamed() — rename a column
drop() — remove columns
groupBy().agg() — group and aggregate
join() — combine two DataFrames
union() / unionByName() — combine rows
orderBy() / sort() — sort rows
distinct() — deduplicate
dropDuplicates() — deduplicate (can specify subset of columns)
explode() — expand array/map column into rows
pivot() — pivot rows to columns
cache() / persist() — cache in memory

Common Functions (pyspark.sql.functions):

col(), lit() — column references and literals
when().otherwise() — conditional logic (like CASE WHEN)
coalesce() — first non-null value
to_date(), to_timestamp() — type casting for dates
date_format(), date_add(), datediff() — date operations
split(), regexp_replace(), trim(), upper() — string operations
sum(), count(), avg(), max(), min() — aggregations
rank(), dense_rank(), row_number() — window functions
collect_list(), collect_set() — aggregate to arrays
from_json(), to_json() — JSON parsing
struct(), array(), map() — complex type creation
explode(), posexplode() — array/map to rows

Window Functions

Window functions perform calculations across a sliding window of rows:

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, row_number, lag, sum

# Define window spec
window = Window.partitionBy("department").orderBy(col("salary").desc())

# Apply window function
df.withColumn("rank", rank().over(window)) \
  .withColumn("running_total", sum("salary").over(
    Window.partitionBy("department").orderBy("hire_date")
      .rowsBetween(Window.unboundedPreceding, Window.currentRow)
  ))

Joins

Join types: inner, left (or left_outer), right (or right_outer), full (or full_outer), left_semi, left_anti, cross

result = orders.join(customers, "customer_id", "inner")
result = orders.join(customers, orders.customer_id == customers.id, "left")

Broadcast Join — when one table is small, broadcast it to all executors to avoid shuffle:

from pyspark.sql.functions import broadcast
result = large_table.join(broadcast(small_table), "key")

Adaptive Query Execution (AQE)

AQE (enabled by default in Databricks) optimizes query plans at runtime based on actual data statistics:

Coalescing shuffle partitions — reduces the number of post-shuffle partitions dynamically
Converting sort-merge join to broadcast join — if a table turns out to be small
Skew join optimization — handles data skew by splitting large partitions

Caching

df.cache()  # Cache in memory (equivalent to MEMORY_AND_DISK)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.unpersist()  # Remove from cache
spark.catalog.cacheTable("my_table")
spark.catalog.uncacheTable("my_table")

Cache intermediate results that are reused multiple times in the same job.

Chapter 14 — AWS Integration Specifics

IAM and Authentication

Instance Profiles (Cluster IAM)

EC2 instances in a Databricks cluster can be assigned an IAM Instance Profile. This allows code running on the cluster to access AWS resources using AWS credentials — without hardcoding access keys.

# With an instance profile attached, this just works:
df = spark.read.parquet("s3://my-bucket/data/")
# Spark automatically uses the EC2 instance profile credentials

Setup:

Create an IAM role with S3 permissions
Create an instance profile from that role
Register the instance profile ARN in Databricks (admin step)
Users can then select that instance profile when creating a cluster

Cross-Account IAM Role

The Databricks control plane uses a cross-account IAM role in your AWS account to:

Create/terminate EC2 instances
Attach/detach EBS volumes
Configure security groups
Tag resources

This role has the minimal permissions needed — it cannot read your S3 data.

S3 Integration

Direct S3 Access (Recommended with Unity Catalog)

With Unity Catalog and External Locations, access S3 directly:

df = spark.read.format("delta").load("s3://my-bucket/delta/employees/")

DBFS Mount (Legacy)

dbutils.fs.mount(
  source="s3a://my-bucket/data/",
  mount_point="/mnt/my-data",
  extra_configs={
    "fs.s3a.aws.credentials.provider": 
      "com.amazonaws.auth.InstanceProfileCredentialsProvider"
  }
)
# Use as:
df = spark.read.parquet("/mnt/my-data/employees/")

S3 Path Formats

s3:// — standard AWS S3 URL scheme (used by AWS tools)
s3a:// — Hadoop's S3A connector (used by Spark, more efficient than s3n://)
dbfs:/mnt/<mount>/ — via DBFS mount

VPC Configuration

Databricks can be deployed in your VPC in two modes:

Default VPC — Databricks creates and manages a VPC in your account. Simple but less control.

Customer-managed VPC (VPC Injection) — You provide the VPC, subnets, and security groups. Required for:

Existing security/network policies
PrivateLink (no public internet)
Custom DNS
VPC peering with RDS, MSK, etc.

PrivateLink

For enterprises requiring no public internet traffic, Databricks supports AWS PrivateLink:

Control plane to data plane communication over private AWS network
No data leaves the AWS backbone
Requires customer-managed VPC

AWS Services Integration

Amazon MSK (Kafka)

df = spark.readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "b-1.msk-cluster.kafka.us-east-1.amazonaws.com:9092") \
  .option("subscribe", "transactions") \
  .option("kafka.security.protocol", "SASL_SSL") \
  .option("kafka.sasl.mechanism", "AWS_MSK_IAM") \
  .option("kafka.sasl.jaas.config", "software.amazon.msk.auth...") \
  .load()

Amazon Kinesis

df = spark.readStream \
  .format("kinesis") \
  .option("streamName", "my-stream") \
  .option("region", "us-east-1") \
  .option("initialPosition", "TRIM_HORIZON") \
  .load()

AWS Glue Data Catalog Integration

Databricks can use the AWS Glue Data Catalog as a Hive Metastore replacement:

spark.hadoop.hive.metastore.client.factory.class = 
  com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

This allows Databricks to discover tables registered in Glue that were created by Athena, Glue ETL, or other services.

Note: With Unity Catalog, Glue integration is less necessary — UC provides a superior alternative.

Amazon Redshift

# Read from Redshift
df = spark.read \
  .format("com.databricks.spark.redshift") \
  .option("url", "jdbc:redshift://cluster.us-east-1.redshift.amazonaws.com:5439/mydb") \
  .option("dbtable", "public.sales") \
  .option("tempdir", "s3://my-bucket/temp/") \
  .option("aws_iam_role", "arn:aws:iam::123456789:role/RedshiftS3Role") \
  .load()

Amazon RDS / Aurora

# JDBC connection
df = spark.read \
  .format("jdbc") \
  .option("url", "jdbc:mysql://my-rds.cluster.us-east-1.rds.amazonaws.com:3306/mydb") \
  .option("dbtable", "employees") \
  .option("user", dbutils.secrets.get("scope", "rds-user")) \
  .option("password", dbutils.secrets.get("scope", "rds-password")) \
  .option("numPartitions", 10) \
  .option("partitionColumn", "id") \
  .option("lowerBound", "1") \
  .option("upperBound", "1000000") \
  .load()

AWS Secrets Manager Backed Secret Scope

For AWS organizations already using AWS Secrets Manager:

# Create a UC-backed scope (links to AWS Secrets Manager)
databricks secrets create-scope \
  --scope aws-secrets \
  --scope-backend-type AWS_SECRETS_MANAGER \
  --region us-east-1

# Use secrets stored in AWS Secrets Manager
password = dbutils.secrets.get(scope="aws-secrets", key="my-secret-name")

Chapter 15 — Security & Access Control

Workspace-Level Security

User Management

Users are managed at the Databricks account level (account console)
SCIM (System for Cross-domain Identity Management) provisioning from identity providers (Okta, Azure AD, AWS SSO)
Groups simplify permission management — assign users to groups, grant permissions to groups

Permission Levels

For workspace objects (notebooks, clusters, jobs):

Level	Notebooks	Clusters	Jobs
CAN READ	View	None	View results
CAN RUN	Run	None	Run job
CAN EDIT	Edit	Can restart	Edit settings
CAN MANAGE	All	Full control	Full control
IS OWNER	All + delete	Create/delete	Create/delete

Cluster-Level Security

Cluster creation permission — not everyone can create clusters (controlled via admin settings or cluster policies)
Cluster access permission — who can attach notebooks to or terminate a cluster
Cluster policies — restrict what configurations users can apply

Table ACLs (Legacy — Pre-Unity Catalog)

Before Unity Catalog, table-level access control in the Hive Metastore was managed via GRANT statements on the cluster with table ACLs enabled. This is being replaced by Unity Catalog.

Network Security

IP Access Lists

Restrict workspace access to specific IP ranges:

Administrators can configure allowed IP ranges
Users outside the allowed ranges cannot access the workspace
Configured via Admin Settings → IP Access Lists

Encryption

Encryption at rest — S3 data is encrypted using AWS KMS (AWS-managed or customer-managed keys)
Encryption in transit — TLS/SSL for all communication between control plane, data plane, and clients
Customer-managed keys — for workspaces with strict security requirements, customers can manage their own KMS keys for notebook encryption and cluster EBS volumes

Compliance

Databricks supports various compliance certifications relevant for AWS:

SOC 2 Type II
ISO 27001
HIPAA (with Business Associate Agreement)
PCI DSS
FedRAMP (government use)

Chapter 16 — Performance & Best Practices

File Size and Compaction

The "small file problem" is one of the most common performance issues:

Each Spark task writes one output file
Thousands of small files → slow reads (S3 overhead per file)
Target file size: 128MB to 1GB for Parquet

Solutions:

OPTIMIZE command (manual)
Auto Compaction (automatic)
Optimized Writes (on write)
Liquid Clustering (ongoing)

Data Skew

Data skew occurs when one or more partitions are much larger than others, causing certain executors to take much longer (stragglers).

Detection: Look for Tasks in Spark UI where one task takes 10x longer than others.

Solutions:

Salting — add a random key suffix to distribute skewed keys
AQE skew optimization — automatic in Databricks (handles skew joins)
Repartition before a join on the skewed key
Broadcast join — if the smaller table is small enough

Caching Strategy

Cache DataFrames that are used multiple times in the same session
Use delta.cache (disk-based cache for Delta files) at the cluster level for repeatedly read Delta tables
Don't cache unnecessarily — caching takes memory and initialization time

Partitioning Strategy

Partition by date for time-series data (daily/monthly partitions)
Avoid partitioning by high-cardinality columns
Use Liquid Clustering as a more flexible alternative
Keep partition count manageable (< 10,000 partitions per table)

Schema Best Practices

Always define explicit schemas for production pipelines
Avoid inferSchema in production (expensive, can produce wrong types)
Use StringType initially if unsure, cast downstream
Enable mergeSchema only when needed — it can hide bugs

Z-Ordering vs. Partitioning vs. Liquid Clustering

	Partitioning	Z-Order	Liquid Clustering
Data skipping	Folder-level	File-level	File-level
Works on	Low-cardinality	High-cardinality	Any cardinality
Ongoing maintenance	None	Manual OPTIMIZE	Auto (OPTIMIZE)
Flexibility	Low (fixed at creation)	Medium	High
Recommended for	New pipelines: rarely	Existing tables	New tables

Chapter 17 — Additional Databricks Concepts

Databricks SQL (DBSQL)

Databricks SQL is the analytics-focused layer of Databricks:

SQL Editor — web-based SQL query interface
SQL Warehouses — dedicated compute for SQL (serverless or classic)
Dashboards — build visualizations and dashboards from SQL results
Alerts — trigger notifications based on query results
Query History — see all executed queries

Serverless SQL Warehouses — compute is managed by Databricks, instant startup, automatic scaling, no cluster configuration needed.

Databricks MLflow

MLflow is an open-source ML lifecycle platform built into Databricks:

Tracking — log parameters, metrics, and artifacts from ML experiments
Models — store, version, and deploy ML models
Registry — model versioning and stage management (Staging, Production, Archived)
Projects — package ML code for reproducibility

import mlflow
import mlflow.sklearn

with mlflow.start_run():
  # Train model
  model.fit(X_train, y_train)

  # Log params
  mlflow.log_param("n_estimators", 100)
  mlflow.log_metric("accuracy", accuracy)

  # Log model
  mlflow.sklearn.log_model(model, "my-model")

Feature Store

Databricks Feature Store is a centralized repository for ML features:

Store features as Delta tables
Reuse features across teams and models
Automatic feature lineage
Point-in-time lookups for training data

Databricks Connect

Databricks Connect allows you to run Spark code from your local IDE (VS Code, IntelliJ, etc.) against a remote Databricks cluster:

Write code locally with full IDE support (autocomplete, debugging)
Execute on the remote cluster
Great for development — combine local tooling with cloud compute

Databricks REST API

Almost everything in Databricks can be automated via the REST API:

Create/manage clusters, jobs, notebooks
Trigger job runs
Manage permissions
List and manage tables

The Databricks CLI wraps the REST API for command-line use.

Databricks Lakehouse Monitoring

Lakehouse Monitoring provides data quality monitoring for Delta tables:

Profile data distributions over time
Detect drift (statistical changes in data)
Generate monitoring metrics as Delta tables
Works for both batch and streaming tables

Data Contracts and Quality

Beyond DLT expectations, common data quality patterns:

Great Expectations — open-source data quality library integrated with Databricks
Delta constraints — SQL CHECK constraints on Delta tables
Data quality monitoring — Lakehouse Monitoring

-- Add a CHECK constraint
ALTER TABLE orders ADD CONSTRAINT valid_quantity CHECK (quantity > 0);

Chapter 18 — Exam Tips & Topic Summary

Exam Structure

The Databricks Data Engineer Associate exam (exam code: Databricks-Certified-Data-Engineer-Associate) typically covers:

~20% Databricks Lakehouse Platform — architecture, workspace, clusters, notebooks
~30% ELT with Spark and Delta Lake — transformations, Delta features, SQL
~20% Incremental Data Processing — Auto Loader, Structured Streaming, DLT
~20% Production Pipelines — jobs, workflows, DLT pipelines
~10% Data Governance — Unity Catalog, permissions, data access

High-Frequency Exam Topics

Managed vs. External Tables

Know the DROP behavior (metadata only vs. data+metadata)
Know when to use each

Delta Lake Features

Time travel (VERSION AS OF, TIMESTAMP AS OF)
MERGE syntax (WHEN MATCHED, WHEN NOT MATCHED, WHEN NOT MATCHED BY SOURCE)
OPTIMIZE and Z-ORDER
VACUUM and data retention
Schema enforcement vs. evolution
CDF (Change Data Feed)
SHALLOW CLONE vs. DEEP CLONE

Cluster Types

All-Purpose for development
Job clusters for production (auto-terminated, cost-efficient)
SQL Warehouses for SQL analytics
Instance pools for fast startup

`%run` vs. `dbutils.notebook.run()`

%run shares scope (variables, SparkSession)
dbutils.notebook.run() runs independently, can capture return value, can run in parallel

Streaming

Trigger types (availableNow, processingTime, continuous)
Checkpoint location importance
foreachBatch for complex streaming operations

DLT

@dlt.expect (warn) vs. @dlt.expect_or_drop (drop row) vs. @dlt.expect_or_fail (fail pipeline)
Streaming vs. batch tables in DLT
APPLY CHANGES INTO for CDC

Auto Loader

cloudFiles format
Schema inference and schemaLocation
File notification vs. directory listing mode

Unity Catalog

Three-level namespace (catalog.schema.table)
Privilege hierarchy (must grant USE at each level)
Managed vs. External tables in UC
Storage Credentials and External Locations

dbutils.secrets

Secrets are always [REDACTED] in output
Two scope types: Databricks-backed and AWS Secrets Manager-backed

Common Traps

VACUUM removes time travel capability — after VACUUM, you can't travel back to removed versions
DROP TABLE on a managed table deletes data — external tables only delete metadata
%run in a notebook runs in the same scope — not isolation like dbutils.notebook.run()
Auto-scaling doesn't work well with streaming — use fixed clusters
OPTIMIZE doesn't run automatically — unless Auto Compaction is enabled
Schema inference in production is bad — use explicit schemas
Temp views are session-scoped — not visible to other notebooks
Job clusters are created per-run and terminated after — you can't manually start/stop them
DLT tables are read-only outside the pipeline — you can't write to them directly
Z-ORDER is only effective when combined with OPTIMIZE — it's not automatic

Quick Reference: Key SQL Commands

-- Time Travel
SELECT * FROM t VERSION AS OF 5;
SELECT * FROM t TIMESTAMP AS OF '2024-01-01';

-- History and Metadata
DESCRIBE HISTORY t;
DESCRIBE DETAIL t;
DESCRIBE EXTENDED t;

-- Maintenance
OPTIMIZE t;
OPTIMIZE t ZORDER BY (col1, col2);
VACUUM t RETAIN 168 HOURS;
RESTORE TABLE t TO VERSION AS OF 3;

-- Clone
CREATE TABLE t_backup SHALLOW CLONE t;
CREATE TABLE t_backup DEEP CLONE t;

-- MERGE
MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

-- DML
UPDATE t SET col = value WHERE condition;
DELETE FROM t WHERE condition;

-- CDC
ALTER TABLE t SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
SELECT * FROM table_changes('t', 5, 10);

-- UC Permissions
GRANT SELECT ON TABLE catalog.schema.table TO `user@company.com`;
REVOKE SELECT ON TABLE catalog.schema.table FROM `user@company.com`;
SHOW GRANTS ON TABLE catalog.schema.table;

Quick Reference: Key Python Patterns

# Auto Loader (Bronze ingestion)
spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "json") \
  .option("cloudFiles.schemaLocation", "/checkpoints/schema/") \
  .load("s3://bucket/path/")

# Streaming write
df.writeStream.format("delta") \
  .option("checkpointLocation", "/checkpoints/stream/") \
  .trigger(availableNow=True) \
  .table("bronze.events")

# Secrets
dbutils.secrets.get(scope="my-scope", key="my-key")

# Notebook parameters
dbutils.widgets.get("param_name")

# Task values (multi-task jobs)
dbutils.jobs.taskValues.set("key", "value")
dbutils.jobs.taskValues.get(taskKey="upstream", key="key")

# Run another notebook
result = dbutils.notebook.run("/path/to/notebook", 300, {"arg": "value"})

# Delta MERGE in Python
from delta.tables import DeltaTable
dt = DeltaTable.forName(spark, "target_table")
dt.alias("target").merge(
  source.alias("source"),
  "target.id = source.id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

DLT Quick Reference

import dlt

@dlt.table
@dlt.expect("not_null_id", "id IS NOT NULL")
@dlt.expect_or_drop("valid_amount", "amount > 0")
@dlt.expect_or_fail("positive_id", "id > 0")
def my_silver_table():
  return dlt.read_stream("my_bronze_table") \
    .filter(...)

-- SQL DLT
CREATE OR REFRESH STREAMING LIVE TABLE silver_orders (
  CONSTRAINT valid_id EXPECT (id IS NOT NULL),
  CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW
)
AS SELECT * FROM STREAM(LIVE.bronze_orders) WHERE ...;

Conclusion

The Databricks Data Engineer Associate exam tests your ability to design, build, and maintain data pipelines on the Databricks Lakehouse Platform. The most important areas to master are:

Delta Lake fundamentals — This is everywhere. Know ACID, time travel, MERGE, OPTIMIZE/VACUUM/ZORDER inside out.
Medallion Architecture — Understand Bronze/Silver/Gold and what belongs in each layer.
DLT and streaming — Know the declarative pipeline approach, expectations, and Auto Loader.
Cluster types and their appropriate use cases — All-purpose for dev, Job clusters for prod, SQL Warehouses for analytics.
Unity Catalog — The three-level namespace, privilege model, and the difference between managed and external tables.
Orchestration — Multi-task jobs, task dependencies, retry policies, and task values.
dbutils — Know every submodule: fs, secrets, widgets, notebook, and jobs.

The exam is scenario-based — it will describe a business requirement and ask you to identify the correct Databricks tool or approach. Read each question carefully, eliminate clearly wrong answers, and apply the principles covered in this guide.

Good luck with your Databricks Data Engineer Associate certification! 🚀

This guide covers the Databricks Data Engineer Associate exam curriculum. Always check the official Databricks certification exam guide for the latest topic list.

A Comprehensive Deep-Dive into Every Exam Topic

Table of Contents

Chapter 1 — Introduction to Databricks & the Lakehouse Paradigm

What Is Databricks?

The Lakehouse Architecture

Why This Matters for the Exam

Chapter 2 — Databricks Architecture

Control Plane vs. Data Plane

Control Plane (Databricks-managed AWS Account)

Data Plane (Your AWS Account)

Why This Split Matters

Databricks Workspace

DBFS — Databricks File System

Important DBFS paths to know:

Cluster Architecture

Chapter 3 — Compute: Clusters & Runtimes

Cluster Types

All-Purpose Clusters

Job Clusters

SQL Warehouses (Databricks SQL)

Cluster Configuration

Runtime Versions

Cluster Modes

Auto-Scaling

Auto-Termination

Instance Types (AWS)

Instance Pools

Cluster Policies

DBUs — Databricks Units

Chapter 4 — Notebooks

What Is a Databricks Notebook?

Notebook Basics

Languages

Widget Parameters

Notebook Results & Visualization

Notebook Lifecycle

Auto-Save and Versioning

Exporting Notebooks

Notebook Context & SparkSession

Chapter 5 — Delta Lake & Data Management

What Is Delta Lake?

Delta Lake Architecture

The Delta Transaction Log

ACID Properties in Delta Lake

Delta Lake Operations

Creating Delta Tables

Writing Data

Reading Data

MERGE (Upsert)

DELETE and UPDATE

Time Travel

RESTORE

Schema Enforcement and Evolution

Schema Enforcement (Schema Validation)

Schema Evolution

Delta Lake Table Properties and Optimization

OPTIMIZE

Z-Ordering

VACUUM

ANALYZE

Auto Optimize (Databricks-specific Delta feature)

Liquid Clustering (Newer Feature)

Delta Table History

Table Partitioning

Delta Change Data Feed (CDF)

Chapter 6 — Tables, Schemas & Databases

Managed vs. External Tables

Managed Tables

External Tables

Views

Schemas (Databases)

Creating and Using Schemas

DESCRIBE Commands

SHOW Commands

Table Cloning

Shallow Clone

Deep Clone

Chapter 7 — Databricks Utilities (dbutils)

dbutils.fs — File System Utilities

dbutils.secrets — Secrets Management