A Comprehensive Deep-Dive into Every Exam Topic
Table of Contents
- Introduction to Databricks & the Lakehouse Paradigm
- Databricks Architecture
- Compute — Clusters & Runtimes
- Notebooks
- Delta Lake & Data Management
- Tables, Schemas & Databases
- Databricks Utilities (dbutils)
- Git Folders / Repos
- Unity Catalog & Governance
- Job Orchestration & Workflows
- Data Ingestion Patterns
- Delta Live Tables (DLT) & Pipelines
- Apache Spark — High-Level Overview
- AWS Integration Specifics
- Security & Access Control
- Performance & Best Practices
- Exam Tips & Summary
Chapter 1 — Introduction to Databricks & the Lakehouse Paradigm
What Is Databricks?
Databricks is a unified data analytics platform built on top of Apache Spark. It was founded by the original creators of Apache Spark and is available as a managed cloud service on AWS, Azure, and GCP. For AWS engineers, Databricks runs inside your AWS account — it provisions and manages Amazon EC2 instances, S3 buckets, IAM roles, VPCs, and other AWS resources on your behalf.
The platform is designed to bridge the gap between data engineering, data science, and business analytics — providing a single pane of glass for all data workloads.
The Lakehouse Architecture
Before Databricks, teams operated in one of two paradigms:
Data Warehouses (Redshift, Snowflake) — Structured, fast SQL queries, but expensive for large raw datasets, limited support for ML, and data duplication.
Data Lakes (S3 + Glue + Athena) — Cheap storage of any format, but no ACID transactions, no data quality guarantees, schema enforcement challenges, and slow query performance without significant tuning.
The Lakehouse combines the best of both:
- Cheap, open-format storage (S3 + Parquet/Delta)
- ACID transactions and data reliability
- Support for SQL, Batch, Streaming, ML, and BI workloads
- A unified governance layer
- Open standards (Delta Lake, Apache Iceberg support)
Databricks is the leading commercial implementation of the Lakehouse pattern.
Why This Matters for the Exam
The exam frequently tests whether you understand which layer (Bronze, Silver, Gold) a given operation belongs to, what Delta Lake provides over raw Parquet, and how Databricks differs from a traditional data warehouse or raw S3 data lake.
Chapter 2 — Databricks Architecture
Control Plane vs. Data Plane
This is one of the most fundamental concepts for both the exam and real-world AWS deployments.
Control Plane (Databricks-managed AWS Account)
The control plane lives in Databricks' own AWS account. It contains:
- Databricks Web Application — the UI you interact with
- Cluster Manager — orchestrates cluster lifecycle
- Job Scheduler — triggers and monitors workflow runs
- Notebook Storage — notebooks are stored in the control plane (unless using Git Folders)
- Metadata Services — the Hive Metastore (legacy) or Unity Catalog Metastore
The control plane never touches your actual data. It sends instructions to resources in your account.
Data Plane (Your AWS Account)
The data plane lives in your AWS account and contains:
- EC2 instances — actual Spark driver and worker nodes
- S3 buckets — your data storage (the DBFS root bucket and your own buckets)
- VPC, Subnets, Security Groups — network resources
- IAM Roles & Instance Profiles — permissions for cluster nodes to access S3
When a cluster is launched, the Databricks control plane communicates with your AWS account (via a cross-account IAM role) to spin up EC2 instances inside your VPC.
Why This Split Matters
Security teams love this architecture because your data never leaves your AWS account. Databricks only stores metadata and notebook code. If you use Secrets (more on that later), sensitive values are encrypted in the control plane but never transmitted in plain text.
Databricks Workspace
A workspace is the top-level organizational unit in Databricks. Think of it like an AWS account — it has its own:
- Users and groups
- Clusters
- Notebooks
- Jobs
- Data objects (catalogs, schemas, tables)
- Secrets
On AWS, a workspace is tied to a specific AWS region and a specific S3 bucket (the "root DBFS bucket"). Large organizations often have multiple workspaces (dev, staging, prod) — this is the recommended pattern.
DBFS — Databricks File System
DBFS is a virtual file system layer that abstracts over your underlying storage. It provides a unified path namespace (dbfs:/) that maps to:
- An S3 bucket managed by Databricks (the root bucket, e.g.,
s3://databricks-workspace-XXXXXXX/) - Other mounted S3 paths
- External locations (with Unity Catalog)
Key points:
- DBFS is not a separate storage system — it's a layer over S3
- The default DBFS root bucket is managed by Databricks but lives in your AWS account
- You can mount external S3 buckets to DBFS paths (legacy approach)
- With Unity Catalog, External Locations replace mounts as the preferred approach
- DBFS paths look like
dbfs:/user/hive/warehouse/ordbfs:/FileStore/
Important DBFS paths to know:
-
dbfs:/FileStore/— files uploaded via the UI (accessible via/FileStore/URL) -
dbfs:/user/hive/warehouse/— default managed table location (Hive Metastore) -
dbfs:/databricks/— internal Databricks paths -
dbfs:/mnt/<mount-name>/— mounted external storage (legacy)
Cluster Architecture
Each Databricks cluster is a standard Apache Spark cluster:
Driver Node (EC2)
|
|-- Worker Node 1 (EC2)
| |-- Executor 1
| |-- Executor 2
|
|-- Worker Node 2 (EC2)
|-- Executor 3
|-- Executor 4
- Driver — runs the main program, coordinates Spark jobs, hosts the SparkSession
- Workers — run Spark executors that process data in parallel
- Executors — JVM processes on workers that execute tasks
Chapter 3 — Compute: Clusters & Runtimes
Cluster Types
All-Purpose Clusters
- Created manually (UI, CLI, API) or via the cluster policy
- Designed for interactive development — notebooks, ad-hoc queries
- Can be shared by multiple users simultaneously
- More expensive because they run continuously (until terminated)
- Persist between runs; you can restart them
- Billed per DBU (Databricks Unit) while running
Exam tip: All-purpose clusters are for development. Never use them exclusively in production workflows without good reason — that's what job clusters are for.
Job Clusters
- Created automatically when a Databricks Job runs
- Terminated automatically when the job completes
- Cannot be shared interactively
- Much cheaper — you only pay while the job is running
- Configuration is defined in the job definition
- Each job run gets a fresh, isolated cluster
Exam tip: For production workloads, always prefer job clusters for cost efficiency and isolation.
SQL Warehouses (Databricks SQL)
- Specifically designed for SQL analytics workloads
- Used by Databricks SQL, BI tools (Tableau, Power BI), and dbt
- Serverless (Serverless SQL Warehouses) or Classic (Pro/Serverless)
- Auto-scaling and auto-stop built-in
- Not used for Spark DataFrame API or ML workloads directly
Cluster Configuration
Runtime Versions
The Databricks Runtime (DBR) is the set of software running on your cluster:
- Databricks Runtime (Standard) — Spark + Delta Lake + standard libraries
- Databricks Runtime ML — adds ML libraries (MLflow, TensorFlow, PyTorch, scikit-learn)
- Databricks Runtime for Genomics — specialized for genomics
- Photon Runtime — includes the Photon query engine (C++ vectorized execution for Delta Lake queries — much faster for SQL)
LTS versions (Long Term Support) — recommended for production. They receive bug fixes for a longer period. Example: 13.3 LTS, 12.2 LTS.
Always use the latest LTS version unless you have a specific reason to use a different version.
Cluster Modes
Standard Mode (Single User)
- One user owns the cluster
- Recommended for most production workloads
- Full Spark features available
Shared Mode
- Multiple users share the same cluster
- Isolation via Unity Catalog and fine-grained access control
- Some Spark features restricted for safety
- Requires Unity Catalog
Single Node Mode
- Driver only — no worker nodes
- Good for small data, ML model training on single machine, development
- Spark still works (local mode)
Auto-Scaling
Databricks clusters can automatically scale the number of worker nodes based on load:
- Set a minimum and maximum worker count
- Databricks monitors the job queue and scales up/down accordingly
- Scale-down has a configurable timeout (default: 120 seconds of idle)
- Does not work well with Streaming jobs — use fixed-size clusters for streaming
Auto-Termination
Set an idle timeout (e.g., 60 minutes) after which the cluster terminates automatically if no notebooks are attached or jobs are running. This is critical for cost control.
Instance Types (AWS)
For the exam, you don't need to memorize specific EC2 types, but understand:
- Memory-optimized (r5, r6i) — good for large DataFrames, joins, aggregations
- Compute-optimized (c5, c6i) — good for CPU-intensive transforms
- Storage-optimized (i3) — good for heavy shuffle operations (local NVMe SSDs)
- GPU instances (p3, g4dn) — for ML Runtime (deep learning)
Instance Pools
An instance pool maintains a set of pre-warmed EC2 instances ready to be allocated to clusters. This dramatically reduces cluster startup time (from ~5-8 minutes to ~30-60 seconds).
Key characteristics:
- Pools have a minimum idle instance count and maximum capacity
- Instances in pools are EC2 instances that are running (you pay for them)
- When a cluster uses a pool, it grabs instances from the pool
- When a cluster terminates, instances return to the pool
- Configured with a specific instance type and Databricks Runtime
Exam tip: Use instance pools when you need fast cluster startup for jobs or have many short-lived clusters.
Cluster Policies
Cluster policies are JSON configurations that restrict and/or set default values for cluster configuration. They help:
- Control costs (max DBUs per hour, max instance count)
- Enforce standards (specific runtimes, instance types)
- Simplify UX for non-technical users
- Apply governance rules
Only workspace admins can create policies. Users can only create clusters that conform to the policies assigned to them.
// Example: restrict cluster to specific runtime
{
"spark_version": {
"type": "fixed",
"value": "13.3.x-scala2.12"
},
"autotermination_minutes": {
"type": "fixed",
"value": 60
}
}
DBUs — Databricks Units
DBUs (Databricks Units) are the billing unit for Databricks. One DBU is roughly equivalent to one hour of a specific instance type doing compute. The rate varies by:
- Cluster type (All-Purpose vs. Jobs vs. SQL Warehouse)
- Workload type (Data Engineering, Data Analytics, etc.)
- Whether Photon is enabled
You pay both EC2 costs (to AWS) AND DBU costs (to Databricks).
Chapter 4 — Notebooks
What Is a Databricks Notebook?
A notebook is a web-based, interactive document containing cells of code and markdown. It's the primary development interface in Databricks. Notebooks in Databricks have superpowers compared to Jupyter notebooks — they integrate tightly with the platform.
Notebook Basics
Languages
Each notebook has a default language (Python, SQL, Scala, R). However, you can mix languages within a single notebook using magic commands:
-
%python— execute a Python cell -
%sql— execute a SQL cell -
%scala— execute a Scala cell -
%r— execute an R cell -
%md— render Markdown (documentation) -
%sh— execute shell commands on the driver node -
%fs— shortcut fordbutils.fscommands -
%run— run another notebook inline (powerful for modularization) -
%pip— install Python packages in the notebook environment -
%conda— use conda for package management (less common)
Key insight about %run: When you %run another notebook, it executes in the same SparkSession and scope as the calling notebook. Variables, functions, and DataFrames defined in the sub-notebook are available in the parent notebook. This is different from a function call — it's more like an include.
Key insight about %sh: Shell commands run on the driver node only, not on worker nodes. Output is printed inline. You can install packages with %sh pip install ... but %pip is preferred because it installs on all nodes.
Widget Parameters
Notebooks can accept input parameters using widgets:
dbutils.widgets.text("table_name", "default_value", "Table Name")
table_name = dbutils.widgets.get("table_name")
Widget types:
-
text— free-text input -
dropdown— select from a list -
combobox— dropdown with free-text option -
multiselect— select multiple values
Widgets are particularly useful when notebooks are run as jobs — you can pass parameter values at runtime.
Notebook Results & Visualization
- The last expression in a cell is displayed as output
- DataFrames are displayed as interactive tables with
display(df) -
display()supports charts — bar, line, pie, map, etc. -
print()outputs plain text - Markdown cells render formatted documentation inline
Notebook Lifecycle
Notebooks can be in these states:
- Detached — not connected to a cluster, cannot run code
- Attached — connected to a running or starting cluster
- Running — actively executing cells
You attach a notebook to a cluster by selecting the cluster from the dropdown in the top-right. Multiple notebooks can be attached to the same all-purpose cluster simultaneously.
Auto-Save and Versioning
Notebooks auto-save frequently. Databricks maintains a revision history for every notebook — you can see the full history and revert to any previous version. This is separate from Git versioning (covered later).
Exporting Notebooks
Notebooks can be exported as:
-
.ipynb— Jupyter Notebook format -
.py— Python source file (with# COMMAND ----------separators) -
.scala— Scala source -
.sql— SQL source -
.dbc— Databricks archive (proprietary, includes all cells and results) -
.html— HTML with rendered output (for sharing results)
Notebook Context & SparkSession
In every Databricks notebook, these objects are pre-initialized:
-
spark— the SparkSession -
sc— the SparkContext -
dbutils— Databricks Utilities -
display()— the display function -
displayHTML()— render HTML content
You never need to create a SparkSession manually in a notebook.
Chapter 5 — Delta Lake & Data Management
This is the most important chapter for the exam. Delta Lake underpins everything in Databricks.
What Is Delta Lake?
Delta Lake is an open-source storage layer that brings ACID transactions and reliability to data lakes. It uses standard Parquet files for data storage but adds a transaction log on top that enables:
- ACID transactions
- Schema enforcement and evolution
- Time travel (data versioning)
- Upserts and deletes (DML operations)
- Streaming and batch unification
- Optimized read/write operations
Delta Lake is the default table format in Databricks. When you create a table without specifying a format, it's Delta.
Delta Lake Architecture
The Delta Transaction Log
The transaction log (located in the _delta_log/ folder of every Delta table) is the backbone of Delta Lake. It's a folder containing:
- JSON files (0000000000.json, 0000000001.json, ...) — each represents one transaction
- Checkpoint files (.parquet) — every 10 commits, a full checkpoint is created for performance
Every operation (write, update, delete, schema change) creates a new entry in the transaction log. The log is append-only — you never modify existing log entries.
Reading a Delta table:
- Spark reads the
_delta_log/to determine the current state - It identifies which Parquet files are "live" (not removed by deletes/updates)
- It reads only those Parquet files
Writing to a Delta table:
- Spark writes new Parquet files to the table directory
- Atomically commits a new JSON entry to
_delta_log/ - The transaction either fully commits or fully fails — no partial writes
ACID Properties in Delta Lake
Atomicity — A transaction either fully succeeds or fully fails. If your Spark job fails halfway, no partial data is committed.
Consistency — Schema enforcement ensures data always conforms to the table's schema.
Isolation — Concurrent reads and writes don't interfere. Readers always see a consistent snapshot. Writers use optimistic concurrency control.
Durability — Once committed, data persists to S3.
Delta Lake Operations
Creating Delta Tables
-- SQL approach (creates a managed table in default schema)
CREATE TABLE employees (
id INT,
name STRING,
salary DOUBLE,
department STRING
)
USING DELTA;
-- Or from an existing data source
CREATE TABLE employees
USING DELTA
AS SELECT * FROM parquet.`/path/to/parquet/`;
-- CTAS (Create Table As Select)
CREATE TABLE silver_employees
AS SELECT id, name, salary FROM bronze_employees WHERE salary > 0;
Writing Data
# Overwrite entire table
df.write.format("delta").mode("overwrite").saveAsTable("employees")
# Append to table
df.write.format("delta").mode("append").saveAsTable("employees")
# Write to a path (external table pattern)
df.write.format("delta").save("s3://my-bucket/delta/employees/")
Reading Data
# Read a table
df = spark.read.table("employees")
# Read from path
df = spark.read.format("delta").load("s3://my-bucket/delta/employees/")
# SQL
spark.sql("SELECT * FROM employees WHERE department = 'Engineering'")
MERGE (Upsert)
MERGE is one of Delta Lake's most powerful features — it allows you to upsert (update + insert) data efficiently:
MERGE INTO target_table AS target
USING source_table AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET target.name = source.name,
target.salary = source.salary
WHEN NOT MATCHED THEN
INSERT (id, name, salary, department)
VALUES (source.id, source.name, source.salary, source.department)
WHEN NOT MATCHED BY SOURCE THEN
DELETE
Key MERGE clauses:
-
WHEN MATCHED— row exists in both target and source -
WHEN NOT MATCHED(orWHEN NOT MATCHED BY TARGET) — row in source but not target (INSERT) -
WHEN NOT MATCHED BY SOURCE— row in target but not in source (useful for DELETE)
You can have multiple WHEN MATCHED clauses with different conditions.
DELETE and UPDATE
-- Delete rows
DELETE FROM employees WHERE department = 'Legacy';
-- Update rows
UPDATE employees SET salary = salary * 1.10 WHERE department = 'Engineering';
These are true DML operations — not possible with raw Parquet files!
Time Travel
Time travel is the ability to query historical versions of a Delta table. Every commit creates a new version.
-- Query version 5
SELECT * FROM employees VERSION AS OF 5;
-- Query as of a timestamp
SELECT * FROM employees TIMESTAMP AS OF '2024-01-15 10:00:00';
-- In Python
df = spark.read.format("delta") \
.option("versionAsOf", 5) \
.load("/path/to/table")
Use cases:
- Audit and compliance — what did the data look like at a specific time?
- Recovery — accidentally deleted data? Roll back!
- Reproducibility — ML experiments can reference exact data versions
How long can you go back?
Time travel is limited by the data retention period (default: 30 days). Old versions are removed by the VACUUM command.
RESTORE
You can restore a table to a previous version:
RESTORE TABLE employees TO VERSION AS OF 3;
RESTORE TABLE employees TO TIMESTAMP AS OF '2024-01-10';
This creates a new commit that restores the state — it doesn't delete history.
Schema Enforcement and Evolution
Schema Enforcement (Schema Validation)
By default, Delta Lake rejects writes that don't match the table's schema. This prevents silent data corruption.
If you try to write a DataFrame with extra columns or incompatible types to a Delta table, you'll get a AnalysisException.
Schema Evolution
You can allow schema changes using mergeSchema or overwriteSchema:
# Add new columns to the table schema automatically
df.write.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save("/path/to/table")
# Completely replace the schema (dangerous!)
df.write.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.save("/path/to/table")
mergeSchema — adds new columns from the incoming data to the schema. Existing columns are preserved.
overwriteSchema — replaces the entire schema (requires overwrite mode).
For SQL, use SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name') for column renaming and dropping.
Delta Lake Table Properties and Optimization
OPTIMIZE
OPTIMIZE compacts small Parquet files into larger ones. Small files are a common performance problem — each Parquet file write from a Spark task creates a file, and tables with many small files are slow to read.
OPTIMIZE employees;
-- With Z-ORDER (more on this below)
OPTIMIZE employees ZORDER BY (department, salary);
OPTIMIZE is idempotent and safe to run at any time.
Z-Ordering
Z-ORDER is a multi-dimensional data clustering technique. It co-locates related data (based on specified columns) within Parquet files. This dramatically improves query performance when filtering on those columns.
OPTIMIZE employees ZORDER BY (department);
After Z-ordering by department, queries like WHERE department = 'Engineering' will skip most files (data skipping), reading only files that contain that department's data.
Z-ORDER is most effective for:
- High-cardinality columns used in WHERE filters
- Columns used in JOIN conditions
- Up to 3-4 columns (diminishing returns beyond that)
VACUUM
VACUUM removes old Parquet files that are no longer referenced by the current or recent transaction log entries:
-- Remove files older than the default retention period (7 days for VACUUM, 30 days for time travel)
VACUUM employees;
-- Specify a retention period
VACUUM employees RETAIN 168 HOURS; -- 7 days
-- DANGER: DRY RUN first!
VACUUM employees DRY RUN;
Important: The default spark.databricks.delta.retentionDurationCheck.enabled prevents VACUUM from running with less than 7-day retention. Override carefully.
After VACUUM, time travel beyond the retention period is impossible.
ANALYZE
ANALYZE TABLE employees COMPUTE STATISTICS;
ANALYZE TABLE employees COMPUTE STATISTICS FOR COLUMNS department, salary;
Computes statistics (min, max, null count, distinct count) that the Spark query optimizer uses for better query planning.
Auto Optimize (Databricks-specific Delta feature)
Databricks adds two auto-optimization features:
Auto Compaction — automatically runs a compaction after writes if files are too small
ALTER TABLE employees SET TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true');
Optimized Writes — repartitions data before writing to produce right-sized files
ALTER TABLE employees SET TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true');
Or enable globally in Spark config:
spark.databricks.delta.optimizeWrite.enabled = true
spark.databricks.delta.autoCompact.enabled = true
Liquid Clustering (Newer Feature)
Liquid Clustering is a newer alternative to Z-Ordering and partitioning that provides:
- Flexible, incremental clustering
- No need to know clustering columns upfront
- Better performance than Z-ORDER for most workloads
- Can change clustering columns without rewriting the whole table
CREATE TABLE employees
CLUSTER BY (department, salary)
AS SELECT * FROM source;
-- Re-cluster as data grows
OPTIMIZE employees;
Delta Table History
DESCRIBE HISTORY employees;
Shows all versions of the table with:
- Version number
- Timestamp
- User who made the change
- Operation (WRITE, DELETE, MERGE, OPTIMIZE, etc.)
- Operation parameters
Table Partitioning
Partitioning physically separates data into subdirectories based on a column value:
CREATE TABLE sales (
id INT,
date DATE,
region STRING,
amount DOUBLE
)
PARTITIONED BY (region, date);
Data is stored as:
s3://bucket/sales/region=US/date=2024-01-01/*.parquet
s3://bucket/sales/region=EU/date=2024-01-01/*.parquet
When to partition:
- Tables with billions of rows
- Columns with low cardinality (< 10,000 distinct values)
- Columns consistently used as filters
- Columns used in
GROUP BYat the partition level (e.g., date for daily processing)
When NOT to partition:
- Small tables (< 1GB)
- High-cardinality columns (creates too many small files)
- Columns rarely used as filters
Exam tip: Over-partitioning is a common mistake. Liquid Clustering is generally preferred for new tables in Databricks.
Delta Change Data Feed (CDF)
Change Data Feed (also called Change Data Capture) captures every row-level change (INSERT, UPDATE, DELETE) to a Delta table and exposes it as a readable stream or batch.
Enable it on a table:
ALTER TABLE employees SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
-- OR at creation:
CREATE TABLE employees (...) TBLPROPERTIES (delta.enableChangeDataFeed = true);
Read the changes:
# Batch - read changes between versions
changes = spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 5) \
.option("endingVersion", 10) \
.table("employees")
# Streaming - continuously process changes
stream = spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", "latest") \
.table("employees")
Each change row has additional columns:
-
_change_type—insert,update_preimage,update_postimage,delete -
_commit_version— which version of the table -
_commit_timestamp— when the change was committed
Use cases:
- Efficient Silver → Gold propagation (only process changed rows)
- Building CDC pipelines
- Audit logs
Chapter 6 — Tables, Schemas & Databases
Managed vs. External Tables
This is a critical concept for the exam.
Managed Tables
- Databricks (via the metastore) controls both the metadata AND the data files
- Data is stored in the default warehouse location (DBFS or Unity Catalog managed storage)
- When you
DROP TABLE managed_table, both metadata AND data are deleted - No
LOCATIONkeyword in the CREATE statement
CREATE TABLE managed_employees (
id INT,
name STRING
);
-- Data stored at: dbfs:/user/hive/warehouse/managed_employees/
External Tables
- Databricks controls the metadata only
- Data lives at a user-specified location (S3, ADLS, etc.)
- When you
DROP TABLE external_table, only metadata is deleted; data remains in S3 - Uses
LOCATIONkeyword
CREATE TABLE external_employees (
id INT,
name STRING
)
LOCATION 's3://my-data-bucket/employees/';
When to use external tables:
- Data shared with other tools (Athena, Glue, EMR)
- Data that should outlive the metastore
- Data owned by another team
- Large datasets already at a specific S3 path
When to use managed tables:
- Data fully managed by Databricks
- Simplicity — don't want to manage S3 paths manually
- Unity Catalog managed storage is preferred
Views
Databricks supports three view types:
Regular Views
CREATE VIEW engineering_employees AS
SELECT * FROM employees WHERE department = 'Engineering';
- Stored in the metastore
- Query is re-executed each time
- Can reference tables across schemas
Temporary Views
CREATE TEMP VIEW temp_high_earners AS
SELECT * FROM employees WHERE salary > 100000;
- Only visible to the current SparkSession (your notebook session)
- Dropped when the SparkSession ends
- Not visible to other users or notebooks
Global Temporary Views
CREATE GLOBAL TEMP VIEW global_active AS
SELECT * FROM employees WHERE status = 'active';
-- Access via global_temp schema
SELECT * FROM global_temp.global_active;
- Shared across SparkSessions on the same cluster
- Dropped when the cluster terminates
- Rarely used — prefer regular views for persistence
Schemas (Databases)
In Databricks, Schema and Database are synonymous (SQL standard uses Schema; Databricks/Hive historically uses Database). The two-level namespace is:
database.table
employees_db.employees
With Unity Catalog, it becomes three-level:
catalog.schema.table
main.employees_db.employees
Creating and Using Schemas
CREATE SCHEMA IF NOT EXISTS employees_db
COMMENT 'Employee data'
LOCATION 's3://my-bucket/employees_db/'; -- optional, for external schema
USE employees_db; -- sets default schema for subsequent queries
SHOW SCHEMAS;
DESCRIBE SCHEMA employees_db;
DROP SCHEMA employees_db CASCADE; -- CASCADE drops all tables in it
DESCRIBE Commands
Critical for exploring metadata:
-- Basic table info (column names and types)
DESCRIBE TABLE employees;
-- Detailed table info (storage location, format, partitions, etc.)
DESCRIBE DETAIL employees;
-- Table history (all versions)
DESCRIBE HISTORY employees;
-- Extended info (everything + table properties)
DESCRIBE EXTENDED employees;
SHOW Commands
SHOW TABLES; -- tables in current schema
SHOW TABLES IN employees_db; -- tables in specific schema
SHOW SCHEMAS; -- all schemas
SHOW DATABASES; -- same as SHOW SCHEMAS
SHOW COLUMNS IN employees; -- columns in a table
SHOW PARTITIONS employees; -- partitions
SHOW TBLPROPERTIES employees; -- table properties
CREATE TABLE employees SHOW CREATE TABLE employees; -- show CREATE statement
Table Cloning
Delta Lake supports two cloning modes:
Shallow Clone
CREATE TABLE employees_backup
SHALLOW CLONE employees;
- Creates a new table that references the same Parquet files as the original
- Only the metadata (transaction log) is copied — no data movement
- Extremely fast and cheap
- Good for: creating test copies, point-in-time snapshots for experiments
Important: If you run VACUUM on the original table, shallow clones may break! The underlying files may be deleted.
Deep Clone
CREATE TABLE employees_backup
DEEP CLONE employees;
- Fully copies both metadata AND data files
- Independent of the original table
- Good for: migration, backup, production copies
- Can be expensive for large tables
You can also clone incrementally:
CREATE OR REPLACE TABLE employees_backup
DEEP CLONE employees;
Run this repeatedly and it only copies new/changed files.
Chapter 7 — Databricks Utilities (dbutils)
dbutils is a utility library available in all Databricks notebooks. It provides programmatic access to many platform features.
dbutils.fs — File System Utilities
Think of this as a programmatic interface to DBFS and cloud storage.
# List files and directories
dbutils.fs.ls("dbfs:/")
dbutils.fs.ls("s3://my-bucket/data/")
# Move/rename
dbutils.fs.mv("dbfs:/source/file.txt", "dbfs:/dest/file.txt")
# Copy
dbutils.fs.cp("dbfs:/source/", "dbfs:/dest/", recurse=True)
# Delete
dbutils.fs.rm("dbfs:/path/to/file.txt")
dbutils.fs.rm("dbfs:/path/to/folder/", recurse=True)
# Create directory
dbutils.fs.mkdirs("dbfs:/my/new/directory/")
# Read file content (small files only!)
content = dbutils.fs.head("dbfs:/path/to/file.txt", maxBytes=65536)
# Mount an S3 bucket (legacy - prefer External Locations with UC)
dbutils.fs.mount(
source="s3a://my-bucket/",
mount_point="/mnt/my-bucket",
extra_configs={"fs.s3a.aws.credentials.provider": "..."}
)
# List mounts
dbutils.fs.mounts()
# Unmount
dbutils.fs.unmount("/mnt/my-bucket")
The %fs magic command is a shorthand:
%fs ls /
%fs rm -r /path/to/delete/
dbutils.secrets — Secrets Management
One of the most security-critical utilities. Secrets allow you to store sensitive values (API keys, passwords, connection strings) without hardcoding them in notebooks or code.
Secret Scopes
A secret scope is a named collection of secrets. Two types:
Databricks-backed scopes — secrets stored in the Databricks control plane, encrypted at rest.
AWS Secrets Manager-backed scopes — secrets stored in AWS Secrets Manager; Databricks fetches them dynamically.
# Create a Databricks-backed scope (Databricks CLI)
databricks secrets create-scope --scope my-scope
# Put a secret
databricks secrets put --scope my-scope --key db-password
# List secrets (only shows keys, never values)
databricks secrets list --scope my-scope
Using Secrets in Notebooks
# Retrieve a secret - returns a string
password = dbutils.secrets.get(scope="my-scope", key="db-password")
# IMPORTANT: If you print a secret, Databricks redacts it
print(password) # Output: [REDACTED]
# List scopes
dbutils.secrets.listScopes()
# List keys in a scope (doesn't show values)
dbutils.secrets.list("my-scope")
Critical exam point: You can never see the actual value of a secret from within a notebook. print() always shows [REDACTED]. This is by design for security.
dbutils.widgets — Notebook Parameters
Covered in the Notebooks chapter. Key methods:
dbutils.widgets.text("param_name", "default_value", "Display Label")
dbutils.widgets.dropdown("env", "dev", ["dev", "staging", "prod"])
dbutils.widgets.get("param_name") # retrieve value
dbutils.widgets.remove("param_name")
dbutils.widgets.removeAll()
dbutils.notebook — Notebook Orchestration
This enables calling other notebooks programmatically:
# Run another notebook with a timeout (seconds)
result = dbutils.notebook.run(
"/path/to/notebook",
timeout_seconds=300,
arguments={"param1": "value1", "param2": "value2"}
)
# Exit a notebook with a return value
dbutils.notebook.exit("SUCCESS")
dbutils.notebook.exit(json.dumps({"status": "success", "rows_processed": 1000}))
Difference between %run and dbutils.notebook.run():
| Feature | %run |
dbutils.notebook.run() |
|---|---|---|
| Executes in same session | Yes | No (new session) |
| Variables shared | Yes | No |
| Can capture return value | No | Yes |
| Parallel execution | No | Yes (multiple calls) |
| Timeout support | No | Yes |
| Pass parameters | No (use widgets) | Yes |
%run is for modular code sharing. dbutils.notebook.run() is for orchestration.
dbutils.data — Preview Data
# Preview a small sample of a DataFrame
dbutils.data.summarize(df)
dbutils.library — Library Management (Deprecated)
In newer runtimes, use %pip instead. But for completeness:
# Deprecated, use %pip
dbutils.library.installPyPI("pandas", version="1.3.0")
dbutils.jobs — Job Context (Within Job Runs)
# Get the current run ID when executing as part of a job
dbutils.jobs.taskValues.set(key="output_path", value="/path/to/output")
result = dbutils.jobs.taskValues.get(taskKey="upstream_task", key="output_path")
This is used in multi-task jobs to pass values between tasks.
Chapter 8 — Git Folders / Repos
What Are Git Folders?
Git Folders (formerly called Databricks Repos) is Databricks' native Git integration. It allows you to:
- Connect a Databricks workspace folder to a Git repository
- Pull, push, commit, create branches, and merge within the Databricks UI
- Version-control your notebooks alongside application code
- Implement CI/CD workflows for Databricks code
Supported Git Providers
- GitHub (github.com and GitHub Enterprise)
- GitLab
- Bitbucket (Cloud and Server)
- Azure DevOps
- AWS CodeCommit
Setting Up Git Integration
- Go to User Settings → Git Integration
- Select your Git provider
- Enter your personal access token (PAT) or OAuth credentials
- Databricks stores credentials in the control plane (encrypted)
Working with Git Folders
Cloning a Repo
In the Databricks UI:
- Navigate to the Repos section (or Workspace → Git Folders)
- Click "Add Repo"
- Enter the repository URL
- Optionally specify a branch
This creates a folder in your workspace that mirrors the Git repository.
Git Operations in the UI
Within a Git Folder, you can:
- Pull — fetch and merge latest changes from remote
- Push — push local commits to remote
- Commit — stage and commit changes
- Branch — create or switch branches
- Merge — merge branches
- Reset — revert uncommitted changes
What Gets Versioned
Inside a Git Folder:
-
Notebooks — stored as
.py,.sql,.scala, or.rfiles with special Databricks metadata comments -
Python modules — regular
.pyfiles -
SQL files —
.sqlfiles -
Config files —
requirements.txt,setup.py, etc.
Important: Regular workspace notebooks (outside Git Folders) are not in Git. Only notebooks inside a Git Folder are version-controlled.
CI/CD with Git Folders
A typical CI/CD workflow:
Developer (feature branch)
→ Git Folder in dev workspace
→ Runs tests in dev
→ Creates Pull Request on GitHub
→ PR review and approval
→ Merge to main
→ CI/CD pipeline (GitHub Actions / AWS CodePipeline)
→ Runs automated tests
→ Deploys to staging workspace (via Databricks CLI)
→ Deploys to production workspace (via Databricks CLI)
Databricks CLI for CI/CD
# Install CLI
pip install databricks-cli
# Configure
databricks configure --token
# Import a notebook
databricks workspace import local_notebook.py /Remote/Path/notebook --language PYTHON
# Export a notebook
databricks workspace export /Remote/Path/notebook local_notebook.py
# Deploy a job
databricks jobs create --json-file job_config.json
The Databricks Asset Bundles (DABs) is the modern approach for CI/CD — a YAML-based framework for deploying Databricks resources (jobs, pipelines, notebooks, etc.) across environments.
Databricks Asset Bundles (DABs)
DABs use a databricks.yml configuration file:
bundle:
name: my-data-pipeline
workspace:
host: https://your-workspace.azuredatabricks.net
targets:
dev:
mode: development
workspace:
host: https://dev-workspace.databricks.com
prod:
mode: production
workspace:
host: https://prod-workspace.databricks.com
resources:
jobs:
my_etl_job:
name: My ETL Job
tasks:
- task_key: ingest
notebook_task:
notebook_path: ./notebooks/ingest
Deploy with:
databricks bundle deploy --target dev
databricks bundle run my_etl_job --target dev
Chapter 9 — Unity Catalog & Governance
What Is Unity Catalog?
Unity Catalog (UC) is Databricks' unified governance solution for data and AI. It provides:
- Centralized metastore — single source of truth for all metadata across workspaces
- Fine-grained access control — table, column, and row-level security
- Data lineage — automatic tracking of data origin and transformations
- Auditing — full audit log of all data access
- Data discovery — searchable catalog with tagging and documentation
Before Unity Catalog, each Databricks workspace had its own Hive Metastore — workspace-local, no cross-workspace sharing, no fine-grained governance. Unity Catalog replaces this with a centralized, cross-workspace solution.
Three-Level Namespace
Unity Catalog introduces a three-level namespace:
catalog.schema.table
main.sales.transactions
Compared to the legacy two-level namespace:
schema.table (or database.table)
sales.transactions
Catalog — the top-level container. You can have multiple catalogs for different environments, business units, or data domains.
Schema — equivalent to a database in the old model. Contains tables, views, and volumes.
Table/View/Volume — the actual data objects.
Navigating the Namespace
-- Set the default catalog
USE CATALOG main;
-- Set the default schema
USE SCHEMA sales;
-- Fully qualified reference
SELECT * FROM main.sales.transactions;
-- Partially qualified (using defaults)
SELECT * FROM sales.transactions;
Unity Catalog Object Hierarchy
Metastore (one per region)
└── Catalog
└── Schema
├── Table (Managed or External)
├── View
├── Volume
└── Function
Metastore — The top-level Unity Catalog object. Created once per Databricks account region. All workspaces in the same account and region share the same metastore (or different metastores if you create multiple).
Catalog — First-level namespace. Create catalogs for environments (dev, staging, prod) or business units.
Schema — Second-level namespace. Organize tables by domain.
Table — Same as before, but now with UC governance features.
Volume — A governance-enabled storage abstraction for non-tabular data (files, images, audio). Similar to a managed directory with access control.
CREATE VOLUME main.raw_data.audio_files;
-- Access volume files
ls /Volumes/main/raw_data/audio_files/
Function — User-defined functions stored in the catalog.
Managed vs. External Storage in Unity Catalog
UC Managed Tables
- Data stored in the UC metastore's managed storage location (an S3 bucket you configure)
- When dropped, data is deleted
- UC manages all file organization and optimization
External Tables in Unity Catalog
- Data lives in an External Location (governed S3 path)
- When dropped, only metadata is removed; data persists
External Locations
An External Location is a registered S3 path that UC governs. It requires:
- A Storage Credential — an IAM role that Databricks can use to access S3
- An External Location — the combination of Storage Credential + S3 path
-- First, an admin creates a storage credential (IAM role ARN)
CREATE STORAGE CREDENTIAL my_s3_credential
WITH IAM_ROLE 'arn:aws:iam::123456789:role/my-databricks-s3-role';
-- Then create an external location
CREATE EXTERNAL LOCATION my_data_location
URL 's3://my-data-bucket/data/'
WITH (STORAGE CREDENTIAL my_s3_credential);
-- Now create an external table at that location
CREATE TABLE main.sales.transactions
LOCATION 's3://my-data-bucket/data/transactions/';
Benefits over DBFS mounts:
- Access control — only users granted permissions to the External Location can access data
- Audit logging — all accesses are logged
- Cross-workspace — works across all workspaces connected to the metastore
Access Control in Unity Catalog
UC uses a privilege-based model with GRANT and REVOKE:
Privilege Types
On Catalogs:
-
USE CATALOG— allows using the catalog (required for all access) -
CREATE SCHEMA— create schemas within the catalog -
ALL PRIVILEGES— grants all privileges
On Schemas:
-
USE SCHEMA— allows using the schema -
CREATE TABLE— create tables -
CREATE VIEW— create views ALL PRIVILEGES
On Tables:
-
SELECT— read data -
MODIFY— INSERT, UPDATE, DELETE ALL PRIVILEGES
On External Locations:
-
READ FILES— read files from this location -
WRITE FILES— write files to this location -
CREATE EXTERNAL TABLE— create external tables at this location
Granting Privileges
-- Grant table access to a user
GRANT SELECT ON TABLE main.sales.transactions TO `user@example.com`;
-- Grant schema access to a group
GRANT USE SCHEMA, SELECT ON SCHEMA main.sales TO `data_analysts`;
-- Grant catalog access
GRANT USE CATALOG ON CATALOG main TO `data_team`;
-- Revoke
REVOKE SELECT ON TABLE main.sales.transactions FROM `user@example.com`;
-- Show grants
SHOW GRANTS ON TABLE main.sales.transactions;
Privilege Inheritance
Privileges cascade down the hierarchy:
- Granting
ALL PRIVILEGESon a catalog implicitly covers all schemas, tables, and views within it - However,
USE CATALOGdoesn't automatically grantUSE SCHEMA— each level must be granted explicitly - To read a table, a user needs:
USE CATALOG+USE SCHEMA+SELECT ON TABLE
Row-Level Security and Column Masking
Unity Catalog supports fine-grained access control:
Row Filters — restrict which rows a user can see:
-- Create a row filter function
CREATE FUNCTION filter_by_region(region STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('us_team'), region = 'US', TRUE);
-- Apply to table
ALTER TABLE main.sales.transactions
SET ROW FILTER filter_by_region ON (region);
Column Masks — mask sensitive column values:
-- Create a mask function
CREATE FUNCTION mask_ssn(ssn STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('hr_team'), ssn, '***-**-****');
-- Apply to column
ALTER TABLE main.hr.employees
ALTER COLUMN ssn SET MASK mask_ssn;
Data Lineage
Unity Catalog automatically tracks data lineage — how data flows between tables, views, and notebooks. No configuration required.
You can view lineage in the Catalog Explorer UI:
- Table lineage — upstream and downstream tables
- Column lineage — trace a specific column back to its source
This is invaluable for:
- Impact analysis (what breaks if I change this table?)
- Debugging data quality issues
- Compliance and auditing
Audit Logs
All Unity Catalog operations are logged to a system table:
-- Query the audit log (system catalog)
SELECT *
FROM system.access.audit
WHERE event_time > '2024-01-01'
AND action_name = 'databricksUserEvent'
LIMIT 100;
Tags and Documentation
UC supports tagging objects with metadata:
ALTER TABLE main.sales.transactions
SET TAGS ('pii' = 'true', 'domain' = 'sales', 'owner' = 'sales-team@company.com');
COMMENT ON TABLE main.sales.transactions IS 'All sales transaction records from POS systems';
COMMENT ON COLUMN main.sales.transactions.customer_id IS 'Unique customer identifier, PII';
Delta Sharing
Delta Sharing is an open standard for sharing Delta Lake data across organizations without copying the data.
-- Create a share
CREATE SHARE my_data_share;
-- Add tables to the share
ALTER SHARE my_data_share
ADD TABLE main.sales.transactions;
-- Create a recipient
CREATE RECIPIENT partner_company
USING ID 'partner-databricks-id';
-- Grant access
GRANT SELECT ON SHARE my_data_share TO RECIPIENT partner_company;
The recipient accesses the data via the Delta Sharing protocol from their own Databricks (or other supported tool) — they never get direct S3 access.
Chapter 10 — Job Orchestration & Workflows
Databricks Workflows
Databricks Workflows (formerly Jobs) is the native orchestration engine for running Databricks workloads. It supports:
- Single-task or multi-task jobs
- Scheduling (cron syntax)
- Triggered runs
- Retry policies
- Alerting (email, webhooks, Slack)
- Job clusters (ephemeral) or all-purpose clusters
Job Concepts
Task
A task is the atomic unit of work in a Workflow. Each task can be:
- Notebook task — runs a Databricks notebook
-
Python script task — runs a
.pyfile - Python wheel task — runs a function from a packaged Python wheel
- Spark JAR task — runs a Java/Scala Spark application
- SQL task — runs SQL (against a SQL Warehouse)
- Delta Live Tables task — triggers a DLT pipeline update
- dbt task — runs a dbt project
- Pipeline task — runs a DLT pipeline
- Run job task — triggers another job (job chaining)
- For each task — iterates over a list and runs a sub-task for each item
Multi-Task Jobs (Task Graphs)
Jobs can have multiple tasks connected by dependencies, forming a DAG (Directed Acyclic Graph):
[ingest_raw] → [validate_data] → [transform_silver] → [aggregate_gold]
↘
[generate_report]
Tasks can run in parallel (if no dependency between them) or sequentially.
# Example job configuration structure
{
"name": "Daily ETL Pipeline",
"tasks": [
{
"task_key": "ingest",
"notebook_task": {"notebook_path": "/ETL/01_ingest"}
},
{
"task_key": "transform",
"depends_on": [{"task_key": "ingest"}],
"notebook_task": {"notebook_path": "/ETL/02_transform"}
},
{
"task_key": "aggregate",
"depends_on": [{"task_key": "transform"}],
"notebook_task": {"notebook_path": "/ETL/03_aggregate"}
}
]
}
Task Values (Passing Data Between Tasks)
Tasks in a multi-task job can pass data to downstream tasks using task values:
# In the upstream task (e.g., "ingest")
dbutils.jobs.taskValues.set(key="rows_ingested", value=1500)
dbutils.jobs.taskValues.set(key="output_path", value="/data/silver/2024-01-15/")
# In the downstream task (e.g., "transform")
rows = dbutils.jobs.taskValues.get(taskKey="ingest", key="rows_ingested", default=0)
path = dbutils.jobs.taskValues.get(taskKey="ingest", key="output_path")
Scheduling Jobs
Jobs can be triggered by:
- Manual trigger — click "Run Now" in the UI
- Cron schedule — standard cron syntax
- File arrival trigger — runs when a new file appears in a storage path
- Continuous trigger — reruns as soon as the previous run completes
- API trigger — POST to the Jobs API
# Cron examples
0 0 * * * # Daily at midnight UTC
0 6 * * 1 # Every Monday at 6 AM
*/15 * * * * # Every 15 minutes
0 2 1 * * # First of every month at 2 AM
Cluster Configuration for Jobs
Each task can have its own cluster configuration:
- Job cluster — created and terminated per-run (recommended for production)
- All-purpose cluster — use an existing running cluster (development only)
- Instance pool — use instances from a pool for faster startup
For multi-task jobs, different tasks can use different cluster configurations (e.g., a small cluster for data validation, a large cluster for heavy transformation).
Shared job clusters — multiple tasks in the same job run share a single cluster (cost-efficient when tasks don't need isolation).
Retry and Error Handling
Each task supports:
{
"max_retries": 3,
"min_retry_interval_millis": 60000,
"retry_on_timeout": true,
"timeout_seconds": 3600
}
Conditional task execution — you can run a task based on the outcome of a previous task:
-
all_success(default) — run only if all upstream tasks succeeded -
all_done— run regardless of upstream status -
at_least_one_success— run if at least one upstream succeeded -
all_failed— run only if all upstream failed (useful for cleanup/alerting on failure) -
at_least_one_failed— run if at least one upstream failed
Job Parameters
Jobs accept parameters at two levels:
Job-level parameters — available to all tasks:
{"run_date": "2024-01-15", "environment": "prod"}
Task-level parameters — specific to a task (e.g., widget values for a notebook task).
Access in notebooks via dbutils.widgets.get() or as job parameters.
Monitoring and Alerting
Job Run Status: Each run shows Pending, Running, Succeeded, Failed, Timedout, or Canceled.
Email Notifications:
{
"email_notifications": {
"on_success": ["team@company.com"],
"on_failure": ["oncall@company.com", "manager@company.com"],
"on_start": [],
"no_alert_for_skipped_runs": true
}
}
Webhook notifications — send HTTP callbacks to Slack, PagerDuty, custom endpoints on job events.
System Tables for Monitoring (Unity Catalog):
SELECT * FROM system.lakeflow.job_run_timeline
WHERE workspace_id = current_workspace_id()
AND period_start_time > '2024-01-01'
ORDER BY period_start_time DESC;
Repair Runs
If a job fails partway through a multi-task DAG, you can repair the run — re-running only the failed tasks (and their dependents) without re-running successful tasks. This is huge for cost savings in long pipelines.
Chapter 11 — Data Ingestion Patterns
Bronze-Silver-Gold (Medallion Architecture)
This is the standard data organization pattern in Databricks:
Bronze Layer (Raw Ingestion)
- Exact copy of source data — no transformations
- All formats preserved (JSON, CSV, XML, Avro, Parquet)
- Append-only (never delete from Bronze)
- May store as Delta (preferred) or original format
- Adds ingestion metadata:
ingestion_timestamp,source_file,batch_id - This is your "system of record" — if anything goes wrong downstream, reprocess from Bronze
Silver Layer (Cleansed & Enriched)
- Data validated, cleaned, and normalized
- Schema enforced (Delta)
- Deduplication applied
- Null handling and type casting done
- Joins to reference/lookup tables
- Row-level quality checks
- Still somewhat raw — business logic not fully applied
- Query-friendly but not aggregated
Gold Layer (Business-Level Aggregates)
- Aggregated, denormalized, business-specific views
- Optimized for specific use cases (reporting, ML features)
- Often organized by business domain or consumer
- Direct source for BI tools, dashboards, ML models
- Examples: daily sales totals, monthly active users, feature store tables
Batch Ingestion
Reading from S3
# Read CSV from S3
df = spark.read \
.option("header", "true") \
.option("inferSchema", "true") \
.csv("s3://my-bucket/raw/employees/*.csv")
# Read JSON
df = spark.read.json("s3://my-bucket/raw/events/")
# Read Parquet
df = spark.read.parquet("s3://my-bucket/raw/transactions/")
# Read with explicit schema (preferred - avoids schema inference overhead)
from pyspark.sql.types import *
schema = StructType([
StructField("id", IntegerType(), False),
StructField("name", StringType(), True),
StructField("amount", DoubleType(), True)
])
df = spark.read.schema(schema).json("s3://my-bucket/raw/")
Auto Loader
Auto Loader is Databricks' incremental data ingestion mechanism. It efficiently processes new files as they arrive in S3, tracking which files have been processed.
# Auto Loader with cloudFiles format
df = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaLocation", "dbfs:/checkpoints/schema/") \
.option("cloudFiles.inferColumnTypes", "true") \
.load("s3://my-bucket/raw/events/")
# Write the stream
df.writeStream \
.format("delta") \
.option("checkpointLocation", "dbfs:/checkpoints/events/") \
.option("mergeSchema", "true") \
.trigger(availableNow=True) \ # Process all available files, then stop
.table("bronze.events")
Key Auto Loader features:
- Schema inference and evolution — automatically detects and adapts to new columns
- File notification mode — uses AWS SQS/SNS for efficient, scalable file detection (recommended for large volumes)
- Directory listing mode — periodically lists the directory (simpler, lower setup, but less efficient for large buckets)
- Schema location — stores the inferred schema separately so it persists across runs
- Checkpoint — tracks exactly which files have been processed (exactly-once guarantee)
- rescuedDataColumn — captures data that didn't match the expected schema into a separate column
Auto Loader vs. COPY INTO:
| Feature | Auto Loader | COPY INTO |
|---|---|---|
| Use case | Streaming incremental loads | Batch incremental loads |
| Tracking | Checkpoint-based | Built-in idempotency |
| Scale | Millions of files | Thousands of files |
| Schema evolution | Automatic | Manual |
| API | readStream | SQL command |
COPY INTO
COPY INTO is a SQL command for idempotent batch data loading:
COPY INTO bronze.events
FROM 's3://my-bucket/raw/events/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('inferSchema' = 'true', 'mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
- Idempotent — running the same COPY INTO multiple times doesn't duplicate data; it only loads new files
- Tracks loaded files internally
- Good for scheduled batch loading
- Simpler than Auto Loader but less scalable
Streaming Ingestion
Structured Streaming
Spark Structured Streaming is the engine underlying streaming in Databricks. Write streaming logic using the same DataFrame API as batch.
# Read from Kafka (on MSK - Amazon Managed Streaming for Kafka)
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "broker1:9092,broker2:9092") \
.option("subscribe", "my-topic") \
.option("startingOffsets", "latest") \
.load()
# Parse the Kafka message value (typically JSON)
from pyspark.sql.functions import from_json, col
schema = StructType([...])
parsed = df.select(from_json(col("value").cast("string"), schema).alias("data")) \
.select("data.*")
# Write to Delta
parsed.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/checkpoints/kafka_stream/") \
.table("bronze.events")
Trigger Options
# Micro-batch triggered every 30 seconds
.trigger(processingTime='30 seconds')
# Process all available data, then stop (batch-like)
.trigger(availableNow=True)
# Continuous processing (low-latency, experimental)
.trigger(continuous='1 second')
# Default: process as fast as possible (micro-batch)
Output Modes
-
append— only new rows are written (most common for Delta) -
complete— all rows of the result are written/overwritten (for aggregations) -
update— only changed rows are written (requires Delta)
Checkpointing
Checkpoints store the streaming state — which offsets have been processed, aggregate states, etc. Always set a checkpoint location for production streaming jobs.
.option("checkpointLocation", "s3://my-bucket/checkpoints/my-stream/")
If a streaming job crashes and restarts, it resumes from the checkpoint — exactly-once processing.
foreachBatch
foreachBatch lets you apply arbitrary batch operations to each micro-batch of a stream:
def process_batch(batch_df, batch_id):
# Complex transformations, MERGE, etc.
batch_df.write.format("delta").mode("append").table("silver.events")
# Send metrics to monitoring system
stream.writeStream \
.foreachBatch(process_batch) \
.option("checkpointLocation", "/checkpoints/") \
.start()
This is powerful because you can use any batch API (including MERGE) within a stream.
Chapter 12 — Delta Live Tables (DLT) & Pipelines
What Are Delta Live Tables?
Delta Live Tables (DLT) is a declarative pipeline framework for building reliable data pipelines. Instead of writing imperative Spark code and managing the execution yourself, you declare what your tables should look like, and DLT figures out how to build them, run them, and keep them up-to-date.
Key Philosophy
In traditional ETL:
- You write code that runs sequentially
- You manage retries, error handling, checkpointing manually
- You handle schema evolution manually
- You figure out when to run each step and in what order
In DLT:
- You declare table definitions
- DLT analyzes dependencies and builds a DAG automatically
- DLT handles retries, checkpointing, error handling
- DLT enforces data quality with expectations
- DLT supports both batch and streaming seamlessly
DLT Core Concepts
Defining Tables
In DLT notebooks, you use the @dlt.table decorator (Python) or CREATE OR REFRESH LIVE TABLE (SQL):
import dlt
from pyspark.sql.functions import *
# Bronze: Raw ingestion with Auto Loader
@dlt.table(
name="raw_events",
comment="Raw event data from S3",
table_properties={"quality": "bronze"}
)
def raw_events():
return (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/checkpoints/raw_events_schema/")
.load("s3://my-bucket/raw/events/")
)
-- SQL equivalent
CREATE OR REFRESH STREAMING LIVE TABLE raw_events
COMMENT "Raw event data from S3"
AS SELECT * FROM cloud_files('s3://my-bucket/raw/events/', 'json',
map('cloudFiles.inferColumnTypes', 'true'))
Live Tables vs. Streaming Live Tables
Live Tables (Materialized Views)
- Process data in batch
- Fully recomputed each run (by default)
- Best for: aggregations, small tables, tables that need to be completely refreshed
@dlt.table
def daily_sales_summary():
return (
dlt.read("silver_transactions")
.groupBy("date", "region")
.agg(sum("amount").alias("total_sales"))
)
Streaming Live Tables
- Process data incrementally using Structured Streaming
- Only process new data each run (huge cost savings)
- Best for: high-volume append-only data (logs, events, IoT)
@dlt.table
def silver_events():
return (
dlt.read_stream("raw_events")
.filter(col("event_type").isNotNull())
.withColumn("processed_at", current_timestamp())
)
Key difference: Use dlt.read() for batch (live table) dependencies, dlt.read_stream() for streaming dependencies.
Data Quality Expectations
Expectations are DLT's data quality enforcement mechanism. They're constraints you define on your tables. DLT tracks violations and can:
-
@dlt.expect— warn on violations (data flows through, violations recorded as metrics) -
@dlt.expect_or_drop— drop violating rows (bad data doesn't contaminate downstream) -
@dlt.expect_or_fail— fail the pipeline on any violation (strict validation)
@dlt.table
@dlt.expect("valid_customer_id", "customer_id IS NOT NULL")
@dlt.expect("positive_amount", "amount > 0")
@dlt.expect_or_drop("valid_email", "email RLIKE '^[^@]+@[^@]+\\.[^@]+$'")
@dlt.expect_or_fail("no_negative_ids", "id >= 0")
def silver_transactions():
return dlt.read_stream("bronze_transactions")
-- SQL expectations
CREATE OR REFRESH STREAMING LIVE TABLE silver_transactions (
CONSTRAINT valid_customer_id EXPECT (customer_id IS NOT NULL),
CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
CONSTRAINT no_negative_ids EXPECT (id >= 0) ON VIOLATION FAIL UPDATE
)
AS SELECT * FROM STREAM(LIVE.bronze_transactions)
Viewing expectation metrics in the pipeline UI shows you:
- How many rows passed/failed each constraint
- Trends over time
Pipeline Configuration
A DLT pipeline is the deployment unit. One pipeline can contain multiple notebooks, each defining tables/views.
Key pipeline settings:
-
Pipeline mode:
TRIGGERED(batch) orCONTINUOUS(streaming, low latency) - Target schema: where to store the resulting tables
- Storage location: where DLT stores internal state, logs, and tables
- Cluster configuration: compute for the pipeline
- Notifications: email on success/failure
-
Channel:
CURRENTorPREVIEW(preview gets new DLT features first)
Triggered pipelines — runs once, processes available data, then stops. Triggered via a schedule or manually.
Continuous pipelines — runs indefinitely, processing data as it arrives. Used for near-real-time requirements.
Pipeline UI and Monitoring
The DLT UI provides:
- DAG visualization — see all tables and their dependencies as a graph
- Data quality metrics — expectation pass/fail rates per table
- Table event logs — see exactly what happened during each update
- Row statistics — rows read, written, failed per table per run
Event log is also queryable as a Delta table:
SELECT * FROM delta.`/path/to/pipeline/_delta_log_events/`
WHERE event_type = 'flow_progress'
ORDER BY timestamp DESC;
DLT with Auto Loader
Auto Loader + DLT is the canonical pattern for Bronze ingestion:
@dlt.table(
name="bronze_orders",
table_properties={
"quality": "bronze",
"pipelines.reset.allowed": "false" # prevent accidental full reload
}
)
def bronze_orders():
return (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation",
"/pipelines/my_pipeline/schema/orders/")
.option("header", "true")
.option("cloudFiles.inferColumnTypes", "true")
.load("s3://my-bucket/raw/orders/")
.withColumn("_source_file", input_file_name())
.withColumn("_ingestion_time", current_timestamp())
)
Apply Changes (CDC with DLT)
DLT has native support for Change Data Capture (CDC) via APPLY CHANGES INTO:
# Define the target table
@dlt.table
def customers():
pass # Target table definition
# Apply changes from a source (CDC stream)
dlt.apply_changes(
target="customers",
source="customers_cdc",
keys=["customer_id"],
sequence_by="update_timestamp", # determines order of changes
apply_as_deletes=expr("operation = 'DELETE'"),
except_column_list=["operation", "update_timestamp"],
stored_as_scd_type=1 # Type 1: overwrite, Type 2: keep history
)
SCD Type 1 — just overwrite the record with the latest values
SCD Type 2 — keep history by creating new records with __START_AT and __END_AT columns
-- SQL equivalent
CREATE OR REFRESH STREAMING LIVE TABLE customers;
APPLY CHANGES INTO LIVE.customers
FROM STREAM(LIVE.customers_cdc)
KEYS (customer_id)
APPLY AS DELETE WHEN operation = 'DELETE'
SEQUENCE BY update_timestamp
STORED AS SCD TYPE 1;
DLT Table Properties and Access
Tables created by DLT can be accessed by:
- Other DLT pipelines
- Notebooks (after the pipeline has run)
- SQL analytics
- Jobs
Tables are stored in the target catalog/schema you specify in the pipeline configuration.
-- After pipeline runs, query the tables normally
SELECT * FROM main.silver.orders;
Limitations of DLT
- DLT tables cannot be modified outside the pipeline (read-only externally)
- You cannot manually write to a DLT-managed table
- Full pipeline resets can be expensive (reprocess all data from scratch)
- Limited to certain compute configurations
Chapter 13 — Apache Spark: High-Level Overview
Spark is the distributed computing engine underlying everything in Databricks. For the exam, you need a solid conceptual understanding. Here are the key topics at a high level:
Core Concepts
SparkContext / SparkSession
-
SparkContext— the entry point for RDD-based Spark (legacy) -
SparkSession— the unified entry point for DataFrames, SQL, and Streaming (current) - Pre-created as
sparkin Databricks notebooks
RDDs (Resilient Distributed Datasets)
- Low-level distributed data structure (the original Spark API)
- Immutable, partitioned collection of records distributed across nodes
- Rarely used directly today — use DataFrames/Datasets instead
- Still underpins the execution engine
DataFrames
- High-level, schema-aware, distributed collection of data organized in named columns
- Think of it as a distributed SQL table
- Lazy evaluation — operations build a logical plan until an action triggers execution
Datasets (Scala/Java)
- Type-safe version of DataFrames (for JVM languages)
- Not directly applicable in Python (PySpark uses DataFrames exclusively)
Transformations vs. Actions
Transformations — lazy operations that build a logical plan:
-
select(),filter(),groupBy(),join(),withColumn(),orderBy(),limit() - Do NOT trigger computation immediately
- Can be narrow (don't require data movement between partitions) or wide (require shuffle)
Actions — trigger actual computation:
-
count(),show(),collect(),write(),save(),take() - Results are either returned to the driver or written to storage
Understanding lazy evaluation is key — Spark optimizes the entire plan before running.
Catalyst Optimizer and Tungsten
Catalyst Optimizer — Spark's query optimization engine:
- Parses logical plan from DataFrame operations
- Analyzes and resolves column references
- Optimizes the logical plan (predicate pushdown, projection pruning, constant folding)
- Generates multiple physical plans
- Selects the best physical plan using cost-based optimization
Tungsten — Memory management and code generation:
- Off-heap memory management
- Whole-stage code generation (Java bytecode generation for tight loops)
- Column-based binary format (no Java object overhead)
Photon — Databricks' native vectorized execution engine (C++):
- Dramatically faster than standard Spark for SQL and Delta queries
- Enabled via the Photon Runtime
- Drop-in replacement — same API, faster execution
Partitions and Shuffles
Partitions — how data is distributed across executors:
-
spark.default.parallelism— default partition count for RDDs -
spark.sql.shuffle.partitions— default partition count after a shuffle (default: 200) - On Databricks, Adaptive Query Execution (AQE) automatically adjusts partition counts
Shuffle — data redistribution across partitions/executors (wide transformation):
- Triggered by:
groupBy,join,orderBy,distinct,repartition - Expensive — involves writing data to disk and network transfer
- Minimize shuffles for performance
repartition() vs coalesce():
-
repartition(n)— full shuffle to exactly n partitions (use when increasing or changing distribution) -
coalesce(n)— narrow operation to reduce to n partitions (more efficient for reducing partition count, no full shuffle)
Key DataFrame Operations
Transformations:
-
select()— projection (choose columns) -
filter()/where()— row filtering -
withColumn()— add/modify a column -
withColumnRenamed()— rename a column -
drop()— remove columns -
groupBy().agg()— group and aggregate -
join()— combine two DataFrames -
union()/unionByName()— combine rows -
orderBy()/sort()— sort rows -
distinct()— deduplicate -
dropDuplicates()— deduplicate (can specify subset of columns) -
explode()— expand array/map column into rows -
pivot()— pivot rows to columns -
cache()/persist()— cache in memory
Common Functions (pyspark.sql.functions):
-
col(),lit()— column references and literals -
when().otherwise()— conditional logic (like CASE WHEN) -
coalesce()— first non-null value -
to_date(),to_timestamp()— type casting for dates -
date_format(),date_add(),datediff()— date operations -
split(),regexp_replace(),trim(),upper()— string operations -
sum(),count(),avg(),max(),min()— aggregations -
rank(),dense_rank(),row_number()— window functions -
collect_list(),collect_set()— aggregate to arrays -
from_json(),to_json()— JSON parsing -
struct(),array(),map()— complex type creation -
explode(),posexplode()— array/map to rows
Window Functions
Window functions perform calculations across a sliding window of rows:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, row_number, lag, sum
# Define window spec
window = Window.partitionBy("department").orderBy(col("salary").desc())
# Apply window function
df.withColumn("rank", rank().over(window)) \
.withColumn("running_total", sum("salary").over(
Window.partitionBy("department").orderBy("hire_date")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
))
Joins
Join types: inner, left (or left_outer), right (or right_outer), full (or full_outer), left_semi, left_anti, cross
result = orders.join(customers, "customer_id", "inner")
result = orders.join(customers, orders.customer_id == customers.id, "left")
Broadcast Join — when one table is small, broadcast it to all executors to avoid shuffle:
from pyspark.sql.functions import broadcast
result = large_table.join(broadcast(small_table), "key")
Adaptive Query Execution (AQE)
AQE (enabled by default in Databricks) optimizes query plans at runtime based on actual data statistics:
- Coalescing shuffle partitions — reduces the number of post-shuffle partitions dynamically
- Converting sort-merge join to broadcast join — if a table turns out to be small
- Skew join optimization — handles data skew by splitting large partitions
Caching
df.cache() # Cache in memory (equivalent to MEMORY_AND_DISK)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.unpersist() # Remove from cache
spark.catalog.cacheTable("my_table")
spark.catalog.uncacheTable("my_table")
Cache intermediate results that are reused multiple times in the same job.
Chapter 14 — AWS Integration Specifics
IAM and Authentication
Instance Profiles (Cluster IAM)
EC2 instances in a Databricks cluster can be assigned an IAM Instance Profile. This allows code running on the cluster to access AWS resources using AWS credentials — without hardcoding access keys.
# With an instance profile attached, this just works:
df = spark.read.parquet("s3://my-bucket/data/")
# Spark automatically uses the EC2 instance profile credentials
Setup:
- Create an IAM role with S3 permissions
- Create an instance profile from that role
- Register the instance profile ARN in Databricks (admin step)
- Users can then select that instance profile when creating a cluster
Cross-Account IAM Role
The Databricks control plane uses a cross-account IAM role in your AWS account to:
- Create/terminate EC2 instances
- Attach/detach EBS volumes
- Configure security groups
- Tag resources
This role has the minimal permissions needed — it cannot read your S3 data.
S3 Integration
Direct S3 Access (Recommended with Unity Catalog)
With Unity Catalog and External Locations, access S3 directly:
df = spark.read.format("delta").load("s3://my-bucket/delta/employees/")
DBFS Mount (Legacy)
dbutils.fs.mount(
source="s3a://my-bucket/data/",
mount_point="/mnt/my-data",
extra_configs={
"fs.s3a.aws.credentials.provider":
"com.amazonaws.auth.InstanceProfileCredentialsProvider"
}
)
# Use as:
df = spark.read.parquet("/mnt/my-data/employees/")
S3 Path Formats
-
s3://— standard AWS S3 URL scheme (used by AWS tools) -
s3a://— Hadoop's S3A connector (used by Spark, more efficient thans3n://) -
dbfs:/mnt/<mount>/— via DBFS mount
VPC Configuration
Databricks can be deployed in your VPC in two modes:
Default VPC — Databricks creates and manages a VPC in your account. Simple but less control.
Customer-managed VPC (VPC Injection) — You provide the VPC, subnets, and security groups. Required for:
- Existing security/network policies
- PrivateLink (no public internet)
- Custom DNS
- VPC peering with RDS, MSK, etc.
PrivateLink
For enterprises requiring no public internet traffic, Databricks supports AWS PrivateLink:
- Control plane to data plane communication over private AWS network
- No data leaves the AWS backbone
- Requires customer-managed VPC
AWS Services Integration
Amazon MSK (Kafka)
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "b-1.msk-cluster.kafka.us-east-1.amazonaws.com:9092") \
.option("subscribe", "transactions") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.mechanism", "AWS_MSK_IAM") \
.option("kafka.sasl.jaas.config", "software.amazon.msk.auth...") \
.load()
Amazon Kinesis
df = spark.readStream \
.format("kinesis") \
.option("streamName", "my-stream") \
.option("region", "us-east-1") \
.option("initialPosition", "TRIM_HORIZON") \
.load()
AWS Glue Data Catalog Integration
Databricks can use the AWS Glue Data Catalog as a Hive Metastore replacement:
spark.hadoop.hive.metastore.client.factory.class =
com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
This allows Databricks to discover tables registered in Glue that were created by Athena, Glue ETL, or other services.
Note: With Unity Catalog, Glue integration is less necessary — UC provides a superior alternative.
Amazon Redshift
# Read from Redshift
df = spark.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://cluster.us-east-1.redshift.amazonaws.com:5439/mydb") \
.option("dbtable", "public.sales") \
.option("tempdir", "s3://my-bucket/temp/") \
.option("aws_iam_role", "arn:aws:iam::123456789:role/RedshiftS3Role") \
.load()
Amazon RDS / Aurora
# JDBC connection
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:mysql://my-rds.cluster.us-east-1.rds.amazonaws.com:3306/mydb") \
.option("dbtable", "employees") \
.option("user", dbutils.secrets.get("scope", "rds-user")) \
.option("password", dbutils.secrets.get("scope", "rds-password")) \
.option("numPartitions", 10) \
.option("partitionColumn", "id") \
.option("lowerBound", "1") \
.option("upperBound", "1000000") \
.load()
AWS Secrets Manager Backed Secret Scope
For AWS organizations already using AWS Secrets Manager:
# Create a UC-backed scope (links to AWS Secrets Manager)
databricks secrets create-scope \
--scope aws-secrets \
--scope-backend-type AWS_SECRETS_MANAGER \
--region us-east-1
# Use secrets stored in AWS Secrets Manager
password = dbutils.secrets.get(scope="aws-secrets", key="my-secret-name")
Chapter 15 — Security & Access Control
Workspace-Level Security
User Management
- Users are managed at the Databricks account level (account console)
- SCIM (System for Cross-domain Identity Management) provisioning from identity providers (Okta, Azure AD, AWS SSO)
- Groups simplify permission management — assign users to groups, grant permissions to groups
Permission Levels
For workspace objects (notebooks, clusters, jobs):
| Level | Notebooks | Clusters | Jobs |
|---|---|---|---|
| CAN READ | View | None | View results |
| CAN RUN | Run | None | Run job |
| CAN EDIT | Edit | Can restart | Edit settings |
| CAN MANAGE | All | Full control | Full control |
| IS OWNER | All + delete | Create/delete | Create/delete |
Cluster-Level Security
- Cluster creation permission — not everyone can create clusters (controlled via admin settings or cluster policies)
- Cluster access permission — who can attach notebooks to or terminate a cluster
- Cluster policies — restrict what configurations users can apply
Table ACLs (Legacy — Pre-Unity Catalog)
Before Unity Catalog, table-level access control in the Hive Metastore was managed via GRANT statements on the cluster with table ACLs enabled. This is being replaced by Unity Catalog.
Network Security
IP Access Lists
Restrict workspace access to specific IP ranges:
- Administrators can configure allowed IP ranges
- Users outside the allowed ranges cannot access the workspace
- Configured via Admin Settings → IP Access Lists
Encryption
- Encryption at rest — S3 data is encrypted using AWS KMS (AWS-managed or customer-managed keys)
- Encryption in transit — TLS/SSL for all communication between control plane, data plane, and clients
- Customer-managed keys — for workspaces with strict security requirements, customers can manage their own KMS keys for notebook encryption and cluster EBS volumes
Compliance
Databricks supports various compliance certifications relevant for AWS:
- SOC 2 Type II
- ISO 27001
- HIPAA (with Business Associate Agreement)
- PCI DSS
- FedRAMP (government use)
Chapter 16 — Performance & Best Practices
File Size and Compaction
The "small file problem" is one of the most common performance issues:
- Each Spark task writes one output file
- Thousands of small files → slow reads (S3 overhead per file)
- Target file size: 128MB to 1GB for Parquet
Solutions:
-
OPTIMIZEcommand (manual) - Auto Compaction (automatic)
- Optimized Writes (on write)
- Liquid Clustering (ongoing)
Data Skew
Data skew occurs when one or more partitions are much larger than others, causing certain executors to take much longer (stragglers).
Detection: Look for Tasks in Spark UI where one task takes 10x longer than others.
Solutions:
- Salting — add a random key suffix to distribute skewed keys
- AQE skew optimization — automatic in Databricks (handles skew joins)
- Repartition before a join on the skewed key
- Broadcast join — if the smaller table is small enough
Caching Strategy
- Cache DataFrames that are used multiple times in the same session
- Use
delta.cache(disk-based cache for Delta files) at the cluster level for repeatedly read Delta tables - Don't cache unnecessarily — caching takes memory and initialization time
Partitioning Strategy
- Partition by date for time-series data (daily/monthly partitions)
- Avoid partitioning by high-cardinality columns
- Use Liquid Clustering as a more flexible alternative
- Keep partition count manageable (< 10,000 partitions per table)
Schema Best Practices
- Always define explicit schemas for production pipelines
- Avoid
inferSchemain production (expensive, can produce wrong types) - Use
StringTypeinitially if unsure, cast downstream - Enable
mergeSchemaonly when needed — it can hide bugs
Z-Ordering vs. Partitioning vs. Liquid Clustering
| Partitioning | Z-Order | Liquid Clustering | |
|---|---|---|---|
| Data skipping | Folder-level | File-level | File-level |
| Works on | Low-cardinality | High-cardinality | Any cardinality |
| Ongoing maintenance | None | Manual OPTIMIZE | Auto (OPTIMIZE) |
| Flexibility | Low (fixed at creation) | Medium | High |
| Recommended for | New pipelines: rarely | Existing tables | New tables |
Chapter 17 — Additional Databricks Concepts
Databricks SQL (DBSQL)
Databricks SQL is the analytics-focused layer of Databricks:
- SQL Editor — web-based SQL query interface
- SQL Warehouses — dedicated compute for SQL (serverless or classic)
- Dashboards — build visualizations and dashboards from SQL results
- Alerts — trigger notifications based on query results
- Query History — see all executed queries
Serverless SQL Warehouses — compute is managed by Databricks, instant startup, automatic scaling, no cluster configuration needed.
Databricks MLflow
MLflow is an open-source ML lifecycle platform built into Databricks:
- Tracking — log parameters, metrics, and artifacts from ML experiments
- Models — store, version, and deploy ML models
- Registry — model versioning and stage management (Staging, Production, Archived)
- Projects — package ML code for reproducibility
import mlflow
import mlflow.sklearn
with mlflow.start_run():
# Train model
model.fit(X_train, y_train)
# Log params
mlflow.log_param("n_estimators", 100)
mlflow.log_metric("accuracy", accuracy)
# Log model
mlflow.sklearn.log_model(model, "my-model")
Feature Store
Databricks Feature Store is a centralized repository for ML features:
- Store features as Delta tables
- Reuse features across teams and models
- Automatic feature lineage
- Point-in-time lookups for training data
Databricks Connect
Databricks Connect allows you to run Spark code from your local IDE (VS Code, IntelliJ, etc.) against a remote Databricks cluster:
- Write code locally with full IDE support (autocomplete, debugging)
- Execute on the remote cluster
- Great for development — combine local tooling with cloud compute
Databricks REST API
Almost everything in Databricks can be automated via the REST API:
- Create/manage clusters, jobs, notebooks
- Trigger job runs
- Manage permissions
- List and manage tables
The Databricks CLI wraps the REST API for command-line use.
Databricks Lakehouse Monitoring
Lakehouse Monitoring provides data quality monitoring for Delta tables:
- Profile data distributions over time
- Detect drift (statistical changes in data)
- Generate monitoring metrics as Delta tables
- Works for both batch and streaming tables
Data Contracts and Quality
Beyond DLT expectations, common data quality patterns:
- Great Expectations — open-source data quality library integrated with Databricks
- Delta constraints — SQL CHECK constraints on Delta tables
- Data quality monitoring — Lakehouse Monitoring
-- Add a CHECK constraint
ALTER TABLE orders ADD CONSTRAINT valid_quantity CHECK (quantity > 0);
Chapter 18 — Exam Tips & Topic Summary
Exam Structure
The Databricks Data Engineer Associate exam (exam code: Databricks-Certified-Data-Engineer-Associate) typically covers:
- ~20% Databricks Lakehouse Platform — architecture, workspace, clusters, notebooks
- ~30% ELT with Spark and Delta Lake — transformations, Delta features, SQL
- ~20% Incremental Data Processing — Auto Loader, Structured Streaming, DLT
- ~20% Production Pipelines — jobs, workflows, DLT pipelines
- ~10% Data Governance — Unity Catalog, permissions, data access
High-Frequency Exam Topics
Managed vs. External Tables
- Know the DROP behavior (metadata only vs. data+metadata)
- Know when to use each
Delta Lake Features
- Time travel (VERSION AS OF, TIMESTAMP AS OF)
- MERGE syntax (WHEN MATCHED, WHEN NOT MATCHED, WHEN NOT MATCHED BY SOURCE)
- OPTIMIZE and Z-ORDER
- VACUUM and data retention
- Schema enforcement vs. evolution
- CDF (Change Data Feed)
- SHALLOW CLONE vs. DEEP CLONE
Cluster Types
- All-Purpose for development
- Job clusters for production (auto-terminated, cost-efficient)
- SQL Warehouses for SQL analytics
- Instance pools for fast startup
%run vs. dbutils.notebook.run()
-
%runshares scope (variables, SparkSession) -
dbutils.notebook.run()runs independently, can capture return value, can run in parallel
Streaming
- Trigger types (availableNow, processingTime, continuous)
- Checkpoint location importance
- foreachBatch for complex streaming operations
DLT
-
@dlt.expect(warn) vs.@dlt.expect_or_drop(drop row) vs.@dlt.expect_or_fail(fail pipeline) - Streaming vs. batch tables in DLT
-
APPLY CHANGES INTOfor CDC
Auto Loader
-
cloudFilesformat - Schema inference and
schemaLocation - File notification vs. directory listing mode
Unity Catalog
- Three-level namespace (catalog.schema.table)
- Privilege hierarchy (must grant USE at each level)
- Managed vs. External tables in UC
- Storage Credentials and External Locations
dbutils.secrets
- Secrets are always
[REDACTED]in output - Two scope types: Databricks-backed and AWS Secrets Manager-backed
Common Traps
-
VACUUMremoves time travel capability — after VACUUM, you can't travel back to removed versions -
DROP TABLEon a managed table deletes data — external tables only delete metadata -
%runin a notebook runs in the same scope — not isolation likedbutils.notebook.run() - Auto-scaling doesn't work well with streaming — use fixed clusters
- OPTIMIZE doesn't run automatically — unless Auto Compaction is enabled
- Schema inference in production is bad — use explicit schemas
- Temp views are session-scoped — not visible to other notebooks
- Job clusters are created per-run and terminated after — you can't manually start/stop them
- DLT tables are read-only outside the pipeline — you can't write to them directly
- Z-ORDER is only effective when combined with OPTIMIZE — it's not automatic
Quick Reference: Key SQL Commands
-- Time Travel
SELECT * FROM t VERSION AS OF 5;
SELECT * FROM t TIMESTAMP AS OF '2024-01-01';
-- History and Metadata
DESCRIBE HISTORY t;
DESCRIBE DETAIL t;
DESCRIBE EXTENDED t;
-- Maintenance
OPTIMIZE t;
OPTIMIZE t ZORDER BY (col1, col2);
VACUUM t RETAIN 168 HOURS;
RESTORE TABLE t TO VERSION AS OF 3;
-- Clone
CREATE TABLE t_backup SHALLOW CLONE t;
CREATE TABLE t_backup DEEP CLONE t;
-- MERGE
MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;
-- DML
UPDATE t SET col = value WHERE condition;
DELETE FROM t WHERE condition;
-- CDC
ALTER TABLE t SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
SELECT * FROM table_changes('t', 5, 10);
-- UC Permissions
GRANT SELECT ON TABLE catalog.schema.table TO `user@company.com`;
REVOKE SELECT ON TABLE catalog.schema.table FROM `user@company.com`;
SHOW GRANTS ON TABLE catalog.schema.table;
Quick Reference: Key Python Patterns
# Auto Loader (Bronze ingestion)
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaLocation", "/checkpoints/schema/") \
.load("s3://bucket/path/")
# Streaming write
df.writeStream.format("delta") \
.option("checkpointLocation", "/checkpoints/stream/") \
.trigger(availableNow=True) \
.table("bronze.events")
# Secrets
dbutils.secrets.get(scope="my-scope", key="my-key")
# Notebook parameters
dbutils.widgets.get("param_name")
# Task values (multi-task jobs)
dbutils.jobs.taskValues.set("key", "value")
dbutils.jobs.taskValues.get(taskKey="upstream", key="key")
# Run another notebook
result = dbutils.notebook.run("/path/to/notebook", 300, {"arg": "value"})
# Delta MERGE in Python
from delta.tables import DeltaTable
dt = DeltaTable.forName(spark, "target_table")
dt.alias("target").merge(
source.alias("source"),
"target.id = source.id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
DLT Quick Reference
import dlt
@dlt.table
@dlt.expect("not_null_id", "id IS NOT NULL")
@dlt.expect_or_drop("valid_amount", "amount > 0")
@dlt.expect_or_fail("positive_id", "id > 0")
def my_silver_table():
return dlt.read_stream("my_bronze_table") \
.filter(...)
-- SQL DLT
CREATE OR REFRESH STREAMING LIVE TABLE silver_orders (
CONSTRAINT valid_id EXPECT (id IS NOT NULL),
CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW
)
AS SELECT * FROM STREAM(LIVE.bronze_orders) WHERE ...;
Conclusion
The Databricks Data Engineer Associate exam tests your ability to design, build, and maintain data pipelines on the Databricks Lakehouse Platform. The most important areas to master are:
Delta Lake fundamentals — This is everywhere. Know ACID, time travel, MERGE, OPTIMIZE/VACUUM/ZORDER inside out.
Medallion Architecture — Understand Bronze/Silver/Gold and what belongs in each layer.
DLT and streaming — Know the declarative pipeline approach, expectations, and Auto Loader.
Cluster types and their appropriate use cases — All-purpose for dev, Job clusters for prod, SQL Warehouses for analytics.
Unity Catalog — The three-level namespace, privilege model, and the difference between managed and external tables.
Orchestration — Multi-task jobs, task dependencies, retry policies, and task values.
dbutils — Know every submodule:
fs,secrets,widgets,notebook, andjobs.
The exam is scenario-based — it will describe a business requirement and ask you to identify the correct Databricks tool or approach. Read each question carefully, eliminate clearly wrong answers, and apply the principles covered in this guide.
Good luck with your Databricks Data Engineer Associate certification! 🚀
This guide covers the Databricks Data Engineer Associate exam curriculum. Always check the official Databricks certification exam guide for the latest topic list.
Top comments (0)