Gowtham Potureddi

Posted on Jun 11

Python timedelta & datetime for Data Engineers: Time Math, Windows & Time Zones

#python #sql #dataengineering #interview

python timedelta sounds like a trivia question — until the on-call pager fires at 02:30 because a "+24 hours" calculation crossed a DST boundary and silently re-processed yesterday's revenue. Time math is the single largest category of silent correctness bugs in analytics pipelines, and the only durable defence is to learn the four primitives the standard library actually ships, the three states a timestamp can live in (naive, aware-local, UTC), and the two ergonomics layers (pandas, polars) that wrap them for vectorised work.

This guide is the cheat sheet you wished existed the first time datetime + timedelta(days=1) produced a row that was one hour off in production. It walks through the object anatomy of datetime / date / time / timedelta, the modern zoneinfo story versus the legacy pytz API, the UTC-normalisation contract every warehouse-bound pipeline must enforce, the pandas-versus-polars trade-off for vectorised time arithmetic, and the three time-window patterns — tumbling, hopping, and session — that cover ~95% of streaming-style DE problems. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

When you want hands-on reps immediately after reading, drill the date-time practice library →, rehearse on time-series problems →, and stack the windowing muscles with sliding-window drills →.

On this page

Why time math is the silent bug magnet
timedelta vs datetime — the object anatomy
Time zones done right — zoneinfo, pytz, UTC
pandas Timedelta vs polars Duration
Sliding, tumbling & session windows
Cheat sheet — Python time-math recipes
Frequently asked questions
Practice on PipeCode

1. Why time math is the silent bug magnet in data pipelines

Time bugs are not edge cases — they are the predictable consequence of mixing three timestamp states and pretending the calendar is a simple number line

The one-sentence invariant: every timestamp in your pipeline lives in exactly one of three states — naive (no timezone), aware-local (wall clock plus IANA zone), or UTC — and every silent time bug is caused by an implicit, undeclared conversion between two of those states. Once you treat the state transitions as explicit contracts, the entire family of "the report is off by one hour" tickets becomes a small set of mechanical checks instead of a guessing game.

The three classes of time bugs.

Naive-mixing bugs. Two timestamps are subtracted; one has tzinfo, the other does not. Python raises TypeError: can't subtract offset-naive and offset-aware datetimes if you are lucky, and silently produces a wrong answer (via .replace(tzinfo=None)) if you are not.
DST jump bugs. A pipeline adds timedelta(days=1) to a local-time anchor and lands one hour off on March or November weekends. The arithmetic is correct; the semantics of "the same wall clock time tomorrow" cannot be expressed as a fixed-length duration.
Off-by-one window bugs. A "last 24 hours" window is computed using <= today and >= today - 1 day, but the comparison is against a date while the data carries a datetime. The window silently includes 25 hours of data (or 23, depending on which end is open).

Why "just add 24 hours" breaks during DST transitions.

Two days a year in every DST-observing jurisdiction, a wall clock loses (or gains) an hour. The naive arithmetic dt + timedelta(hours=24) always advances the underlying instant by exactly 24 hours — which is not the same as "the same wall-clock time tomorrow." On the spring-forward day, that means 02:30 + 24 hours is 03:30 the next day, not 02:30. If the next-day batch keys on "process all rows where ts is between 02:30 yesterday and 02:30 today," the batch under-counts by one hour. The fix is to anchor the math in UTC, advance there, and convert back at display time — or, when "same wall clock tomorrow" is the literal business contract, use dateutil.relativedelta(days=1) which is calendar-aware.

The cost of mixing aware and naive datetimes.

A pipeline that stores created_at as UTC-aware in the warehouse and writes a downstream report that compares it against datetime.utcnow() (which is naive in the standard library) will raise TypeError on every join. Teams "fix" this by calling .replace(tzinfo=None) on the warehouse column — which strips the contract and silently lies about the timestamp. Six months later, an analyst plots "events per hour by region" and discovers every region was actually using UTC because the strip happened before the region-based timezone conversion.

The 4 primitives every Python DE must master.

datetime.datetime — the workhorse: year, month, day, hour, minute, second, microsecond, plus optional tzinfo. Represents a point in time. Supports arithmetic with timedelta.
datetime.date — calendar date only: year, month, day. No time, no timezone. Useful for partition keys, billing periods, and report dates.
datetime.time — clock time only: hour, minute, second, microsecond, optional tzinfo. Rarely useful alone — almost always you want a full datetime.
datetime.timedelta — a duration: days, seconds, microseconds. Notably, nothing larger than days — no months, no years (because calendar units have variable length).

What interviewers listen for.

Do you say "store in UTC, display in local TZ" the moment a timestamp question lands? — required answer.
Do you reach for zoneinfo.ZoneInfo before pytz.timezone when asked to attach a zone? — modern-stack signal.
Do you mention that datetime.utcnow() returns a naive datetime as a known footgun? — senior signal.
Do you distinguish tumbling, hopping, and session windows by name when asked to bucket events? — streaming-DE signal.

The 2026 reality.

zoneinfo (Python 3.9+) is the modern, standard-library answer for IANA time zones. pytz is deprecated as the default but still ubiquitous in legacy code.
pandas 2.x uses nanosecond-resolution Timestamp and Timedelta by default; polars matches with pl.Datetime and pl.Duration and adds time-aware group_by_dynamic for tumbling and hopping windows out of the box.
dateutil.relativedelta is the canonical calendar-aware delta — the only safe way to add "one month" or "one year" without writing your own calendar arithmetic.

Worked example — the four object types in one shell session

Detailed explanation. The fastest way to internalise the object model is to call type() on every intermediate value in a short arithmetic sequence. Interviewers love asking "what is the type of dt - dt?" because juniors say "a number" and seniors say "a timedelta."

Question. Show that datetime - datetime is a timedelta, that datetime + timedelta is a datetime, and that date + timedelta(days=1) is a date — but date + timedelta(seconds=3600) is also a date (it silently ignores the sub-day component).

Input. No table — just two literal datetimes.

Code.

from datetime import datetime, date, timedelta

a = datetime(2026, 6, 5, 10, 0, 0)
b = datetime(2026, 6, 4, 22, 0, 0)

delta = a - b                # timedelta
print(type(delta), delta)    # <class 'datetime.timedelta'> 0:12:00:00

next_event = a + timedelta(hours=6)
print(type(next_event), next_event)  # <class 'datetime.datetime'> 2026-06-05 16:00:00

d = date(2026, 6, 5)
print(d + timedelta(days=1))         # 2026-06-06 (a date)
print(d + timedelta(seconds=3600))   # 2026-06-05 (still a date — hours dropped!)

Step-by-step explanation.

a - b produces a timedelta of 12 hours. Python expresses that internally as (days=0, seconds=43200, microseconds=0).
a + timedelta(hours=6) advances the point in time by six hours; the result is a datetime, not a timedelta.
date + timedelta(days=1) advances the calendar by one day; the result is a date.
date + timedelta(seconds=3600) is the surprising case: date has no sub-day component, so the seconds part of the delta is silently rounded toward zero. The result is still 2026-06-05. This is a frequent off-by-one trap when a function expects a datetime and receives a date.

Output.

Expression	Type	Value
`a - b`	`timedelta`	`0:12:00`
`a + timedelta(hours=6)`	`datetime`	`2026-06-05 16:00:00`
`date + timedelta(days=1)`	`date`	`2026-06-06`
`date + timedelta(seconds=3600)`	`date`	`2026-06-05`

Rule of thumb. When the question involves hours, minutes, or any sub-day arithmetic, work in datetime, not date. If your function signature accepts "a timestamp," type-annotate it as datetime and convert at the boundary — never let a date sneak into sub-day math.

Worked example — a churn pipeline that drifted by one day per quarter

Detailed explanation. A SaaS team computes "users active in the last 90 days" by subtracting 90 days from today and filtering events. The pipeline runs in America/New_York, the events are stored in UTC, and today is date.today() — which uses the system timezone of the box (the prod box is UTC, but the analyst's laptop is Eastern). Over a quarter, the report drifted by one day; nobody could reproduce it locally because each environment used a different today.

Question. Rewrite the "active in the last 90 days" computation so it is fully UTC-anchored and reproducible across boxes regardless of the system timezone.

Input. Conceptual — an events table with event_ts in UTC.

Code.

from datetime import datetime, timedelta, timezone

# BROKEN — today() uses the system timezone of the host
def cutoff_broken():
    from datetime import date
    return date.today() - timedelta(days=90)   # date in host TZ

# CORRECT — anchor in UTC explicitly, then take the date
def cutoff_correct():
    now_utc = datetime.now(timezone.utc)
    return (now_utc - timedelta(days=90)).date()

Step-by-step explanation.

date.today() calls the OS for the wall-clock date in the host's local timezone. On a UTC-configured box at 01:00 UTC, this returns 2026-06-05; on an Eastern-configured analyst laptop at the same instant (which is 20:00 EDT on June 4), it returns 2026-06-04. Same instant, different today.
datetime.now(timezone.utc) returns the current UTC instant as a tz-aware datetime. Subtracting 90 days produces a UTC instant 90 days earlier; calling .date() gives the UTC calendar date — identical on every host.
The cutoff is now a host-independent contract: the report uses the UTC calendar 90 days ago, regardless of where it runs.
Document the contract in the report header so downstream consumers know the report is anchored in UTC, not local time. The choice is a policy decision; the goal is to make it explicit.

Output.

Function	Run on UTC host	Run on Eastern host
`cutoff_broken()`	`2026-03-07`	`2026-03-06`
`cutoff_correct()`	`2026-03-07`	`2026-03-07`

Rule of thumb. Anywhere a report uses today as part of a calculation, write it as datetime.now(timezone.utc).date() — never date.today() — and put a one-line comment naming UTC. The cost is two extra words; the savings is one quarter of debugging at year-end.

Worked example — timedelta has no months, and why that is correct

Detailed explanation. A new analyst writes dt + timedelta(months=1) and gets TypeError: 'months' is an invalid keyword argument. They reach for timedelta(days=30) as the "obvious" workaround — and silently land one to three days off depending on which month it is. The fix is dateutil.relativedelta.

Question. Add one month to 2026-01-31. Show why timedelta(days=30) is wrong and how relativedelta(months=1) is right.

Input. A single literal anchor: 2026-01-31.

Code.

from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta

anchor = datetime(2026, 1, 31)

# BROKEN — "30 days" is the average month, not the actual one
naive_plus_month = anchor + timedelta(days=30)
print(naive_plus_month)       # 2026-03-02 (wrong — landed in March)

# CORRECT — calendar-aware addition
correct_plus_month = anchor + relativedelta(months=1)
print(correct_plus_month)     # 2026-02-28 (clamped to end of Feb)

Step-by-step explanation.

timedelta(days=30) is a fixed-length duration of 30 days. Adding it to January 31 advances exactly 30 days, landing on March 2 — past the entire month of February.
relativedelta(months=1) is a calendar-aware delta. It increments the month field by 1, then clamps the day to the last valid day of the target month. January 31 plus one month yields February 28 (or 29 in a leap year).
The same logic applies to "one year ago": relativedelta(years=1) handles leap-year edge cases (Feb 29 + 1 year → Feb 28).
timedelta deliberately rejects months and years because their length is not constant. The standard library forces you to be explicit about which semantic you want.

Output.

Expression	Result
`2026-01-31 + timedelta(days=30)`	`2026-03-02`
`2026-01-31 + relativedelta(months=1)`	`2026-02-28`

Rule of thumb. The moment a business spec says "month" or "year," reach for dateutil.relativedelta. timedelta is only safe for fixed-length durations: hours, minutes, seconds, microseconds, and "exactly N days."

Python interview question on a drifting time-window pipeline

A senior interviewer often opens with: "Walk me through every place a time-window pipeline can silently produce a wrong row count — naive timestamps, DST jumps, host timezone, calendar deltas, off-by-one window edges — and show how you would catch each one in a code review." It blends the three classes of time bugs and the four time primitives into a single audit.

Solution Using a UTC-anchored, explicit-contract pattern

from datetime import datetime, timedelta, timezone
from zoneinfo import ZoneInfo
from dateutil.relativedelta import relativedelta

# 1) Always anchor "now" in UTC, never the host
now_utc = datetime.now(timezone.utc)

# 2) Build the window in UTC with explicit timedeltas
window_end   = now_utc
window_start = window_end - timedelta(days=7)

# 3) For "one month ago," use relativedelta — never timedelta(days=30)
month_ago = now_utc - relativedelta(months=1)

# 4) Convert to local only at display, never at filter
display_tz   = ZoneInfo("America/New_York")
display_end  = window_end.astimezone(display_tz)
display_start = window_start.astimezone(display_tz)

Step-by-step trace.

Step	Value at 2026-06-05 09:00 UTC
`now_utc`	`2026-06-05 09:00:00+00:00`
`window_end`	`2026-06-05 09:00:00+00:00`
`window_start`	`2026-05-29 09:00:00+00:00`
`month_ago`	`2026-05-05 09:00:00+00:00`
`display_end` (EDT)	`2026-06-05 05:00:00-04:00`
`display_start` (EDT)	`2026-05-29 05:00:00-04:00`

After the audit, the team rewrites every "now" call to datetime.now(timezone.utc) and adds a CI check that fails any PR that introduces a bare datetime.now() without an argument.

Output:

Metric	Value
window length (hours)	168
month_ago day-of-month preserved	yes
display TZ offset (EDT)	-04:00
host-dependent calls remaining	0

Why this works — concept by concept:

UTC as the storage and arithmetic layer — every comparison, subtraction, and window boundary is computed in UTC. Local timezones only enter at display time, when the human-facing TZ is known.
datetime.now(timezone.utc) returns aware — unlike datetime.utcnow() (naive) or datetime.now() (host-local), this idiom is unambiguous and self-documenting.
timedelta for fixed-length durations — seven days, 168 hours, 60 minutes — all safe. The library refuses to encode variable-length units, which forces the engineer to make a choice.
relativedelta for calendar arithmetic — months, years, day-of-week math. Calendar-aware, leap-year aware, end-of-month aware.
astimezone() for display — the conversion preserves the underlying instant and changes only the wall-clock representation. The audit forbids .replace(tzinfo=…) for conversion because that changes the instant.
Cost — every operation is O(1) scalar; no I/O, no allocations beyond the new datetime object. Cheap insurance against an entire family of silent bugs.

Python
Topic — date-time
Date-time problems (Python)

Practice →

2. timedelta vs datetime — the object anatomy every DE must know

`python timedelta` is a duration; `datetime.datetime` is a point in time — keeping the two types straight closes 80% of time-math bugs

The mental model in one line: datetime answers when; timedelta answers how long — and every legal arithmetic combination produces one of those two shapes, never a raw number. Once you commit to "duration vs instant" as the type discipline, the entire arithmetic surface of datetime - datetime, datetime + timedelta, and timedelta * 2 becomes a small finite set of rules you can carry in your head.

The four objects in one table.

Object	Fields	Tz-aware?	Purpose
`datetime.datetime`	year, month, day, hour, minute, second, microsecond, tzinfo	yes (optional)	a point in time
`datetime.date`	year, month, day	no	a calendar date
`datetime.time`	hour, minute, second, microsecond, tzinfo	yes (optional)	a clock time
`datetime.timedelta`	days, seconds, microseconds	n/a	a duration

The arithmetic rules.

datetime - datetime → timedelta. Both must be tz-aware or both must be naive — mixing raises TypeError.
datetime + timedelta → datetime. The result preserves the timezone of the input.
timedelta + timedelta → timedelta. Pure duration math.
timedelta * 3 → timedelta. Scaling a duration is legal.
datetime + datetime → TypeError. There is no semantic for "two points in time added together."
date + timedelta(days=...) → date. Sub-day components of the delta are silently dropped.

Why timedelta has no months or years.

Months range from 28 to 31 days; years are 365 or 366. A duration must have a fixed length to behave like a number. Encoding "one month" as a fixed delta would be lying. The standard library forces you to choose between a fixed duration (timedelta) and a calendar delta (dateutil.relativedelta).

total_seconds() — the safe duration comparator.

timedelta exposes three integer fields internally — .days, .seconds (0–86399), .microseconds — but comparing two deltas by looking at one field is a bug magnet. td.total_seconds() collapses the whole duration into a single float; comparing two floats is unambiguous. Use this every time you need to threshold ("more than 30 minutes?") or rank durations.

Common interview probes on type discipline.

"What is datetime(2026,1,1) - datetime(2025,1,1)?" — answer: timedelta(days=365) (no tz, fixed-length).
"How do you get the number of hours in a timedelta?" — answer: td.total_seconds() / 3600 — not td.seconds / 3600 (which ignores the days field).
"Why does date + timedelta(hours=1) return the same date?" — because date has no time component; the hours part is silently truncated.
"What is the type of datetime - timedelta?" — datetime (preserving tz).

Worked example — parsing, adding, subtracting across the four types

Detailed explanation. A pipeline ingests two ISO 8601 strings, parses them into datetime, computes the duration between them, advances by a fixed timedelta, and finally extracts only the date for partitioning. Walking through the type at each step is the cleanest way to learn the contracts.

Question. Given two ISO timestamps, parse them, compute the gap in hours, advance the earlier one by 36 hours, and extract its calendar date. Show the type of every intermediate.

Input. Two strings: "2026-06-04T10:00:00" and "2026-06-05T16:00:00".

Code.

from datetime import datetime, timedelta

s1 = "2026-06-04T10:00:00"
s2 = "2026-06-05T16:00:00"

a = datetime.fromisoformat(s1)
b = datetime.fromisoformat(s2)

gap = b - a                           # timedelta
gap_hours = gap.total_seconds() / 3600

forward = a + timedelta(hours=36)     # datetime
partition_day = forward.date()        # date

print(type(a).__name__, a)
print(type(gap).__name__, gap, gap_hours, "hours")
print(type(forward).__name__, forward)
print(type(partition_day).__name__, partition_day)

Step-by-step explanation.

datetime.fromisoformat parses both ISO strings into naive datetime objects. (Both strings lack an offset, so both are naive — and naive-with-naive subtraction is legal.)
b - a produces a timedelta of one day and six hours.
.total_seconds() / 3600 converts the duration into a float number of hours (30.0). Never use .seconds / 3600 here — that would ignore the .days part and return 6.0.
a + timedelta(hours=36) produces a new datetime 36 hours after a: 2026-06-05 22:00:00.
.date() strips the time component and returns a date (2026-06-05). Useful for warehouse partition keys.

Output.

Step	Type	Value
parse `s1`	`datetime`	`2026-06-04 10:00:00`
`b - a`	`timedelta`	`1 day, 6:00:00`
gap_hours	`float`	`30.0`
`a + timedelta(hours=36)`	`datetime`	`2026-06-05 22:00:00`
`.date()`	`date`	`2026-06-05`

Rule of thumb. Every time you cross a type boundary (string → datetime, datetime → date, datetime → timedelta), write the type next to the variable in a comment for the first few lines of a new pipeline. Once the contracts are clear, the comments can go — but they save reviewers from guessing.

Worked example — total_seconds() is the only safe comparison

Detailed explanation. Two timedelta instances compare correctly with < and > because Python normalises them — but the moment you start displaying them or thresholding them, attribute access on .seconds or .days becomes a trap. total_seconds() is unambiguous.

Question. Given a stream of API-call durations as timedelta objects, write a function that returns the share of calls slower than 250ms. Show the wrong-attribute trap and the safe form.

Input. A list of timedelta durations (mix of sub-second and multi-second).

Code.

from datetime import timedelta

durations = [
    timedelta(milliseconds=120),
    timedelta(milliseconds=300),
    timedelta(seconds=2, milliseconds=50),
    timedelta(microseconds=99_000),    # 99 ms
]

threshold = timedelta(milliseconds=250)

# BROKEN — .seconds ignores microseconds < 1s and the .days field
slow_broken = sum(1 for d in durations if d.seconds > 0)   # only multi-second

# CORRECT — total_seconds() is the single source of truth
slow_correct = sum(1 for d in durations if d.total_seconds() > threshold.total_seconds())

print(slow_broken, slow_correct, len(durations))

Step-by-step explanation.

d.seconds returns the seconds part of the internal representation in the range [0, 86400). A duration of 120 milliseconds has seconds=0 and microseconds=120_000. So d.seconds > 0 is FALSE for the sub-second durations — the broken filter undercounts.
d.total_seconds() returns the whole duration as a float: 0.12, 0.3, 2.05, 0.099. Comparing each against 0.25 correctly classifies two as "slow" (300ms and 2050ms).
The same trap applies to .days — a timedelta of 36 hours has .days = 1, not 1.5. Always go through total_seconds() (or for very long deltas, divide by 86400 for days as a float).
The Python operators < and > between two timedelta instances do compare correctly because the dunder methods normalise; the danger only appears when one side is an attribute access.

Output.

Method	Slow count
`.seconds > 0` (broken)	1
`.total_seconds() > 0.25` (correct)	2

Rule of thumb. Treat .days, .seconds, .microseconds as implementation details of timedelta. The public API for comparison and conversion is total_seconds(). The cost is one method call per row; the savings is "your threshold function is not silently wrong on millisecond data."

Worked example — the date + timedelta(hours=…) trap

Detailed explanation. A common helper takes a date and an hour offset to land on "noon next Friday." If the caller accidentally passes a date (instead of a datetime) and a timedelta(hours=12), the helper silently returns the same date — because date + timedelta ignores sub-day components.

Question. Show that date(2026, 6, 5) + timedelta(hours=12) is the same date, and rewrite the helper so it accepts a date and produces a datetime at the requested local hour.

Input. A single date and an integer hour.

Code.

from datetime import date, datetime, timedelta, time

d = date(2026, 6, 5)

# Silent bug — sub-day component dropped, result is still a date at 00:00
result_bug = d + timedelta(hours=12)
print(type(result_bug).__name__, result_bug)   # date 2026-06-05

# Correct — combine the date with a time(hour=12) to produce a datetime
result_ok = datetime.combine(d, time(hour=12))
print(type(result_ok).__name__, result_ok)     # datetime 2026-06-05 12:00:00

Step-by-step explanation.

date + timedelta(hours=12): the timedelta has .days = 0 and .seconds = 43200. Adding it to a date advances the .days part by zero — the .seconds part is silently dropped because date cannot represent hours.
The bug: result_bug is 2026-06-05 (the same date), not 2026-06-05 12:00:00. The caller asked for "noon today" and got "today" — type-wise still a date, with no time component.
The fix: datetime.combine(date, time) is the canonical way to lift a date into a datetime at a specific clock time. The result is a proper datetime and downstream arithmetic in hours works.
Optionally annotate the helper with def at_hour(d: date, hour: int) -> datetime: and call combine internally; the type hint blocks accidental date-in / date-out pairs at code-review time.

Output.

Expression	Type	Value
`d + timedelta(hours=12)`	`date`	`2026-06-05`
`datetime.combine(d, time(12))`	`datetime`	`2026-06-05 12:00:00`

Rule of thumb. If a helper signature accepts a date but the body does sub-day math, the body should call datetime.combine(d, time(hour=...)) at the very top — or the signature should accept a datetime instead. Type confusion is the bug.

Python interview question on duration math correctness

A senior interviewer might frame this as: "Given a sequence of session start and end timestamps, compute the median session length in seconds. Your function must handle empty input and naive-versus-aware mixing safely." This is a probe on total_seconds(), type discipline, and edge-case handling.

Solution Using `timedelta.total_seconds()` and a median helper

from datetime import datetime, timedelta
from statistics import median

def median_session_seconds(sessions: list[tuple[datetime, datetime]]) -> float | None:
    """Return median session length in seconds, or None on empty input."""
    if not sessions:
        return None
    durations: list[float] = []
    for start, end in sessions:
        if (start.tzinfo is None) != (end.tzinfo is None):
            raise ValueError("Mixed naive and aware datetimes in same session")
        delta: timedelta = end - start
        durations.append(delta.total_seconds())
    return median(durations)

Step-by-step trace.

start	end	delta	total_seconds
09:00	09:30	0:30:00	1800.0
09:05	10:35	1:30:00	5400.0
10:00	10:02	0:02:00	120.0

Sorted: [120.0, 1800.0, 5400.0]. Median = the middle element = 1800.0.

Output:

Metric	Value
sessions	3
median_session_seconds	1800.0

Why this works — concept by concept:

end - start yields a timedelta — leveraging the arithmetic contract; no manual second math, no error-prone field access.
total_seconds() is unit-safe — produces a float in seconds. Multiplying by 1/60 gives minutes, by 1/3600 gives hours — no ambiguity.
Naive-vs-aware guard — the explicit check raises a clear error on the boundary case instead of relying on Python's TypeError propagation, which is harder to debug at the call site.
statistics.median for the aggregate — pure-Python, sorted-list median, O(n log n). For large inputs, switch to numpy's np.median (O(n) average via partial sort).
None on empty input — explicit "no data" sentinel that the caller can pattern-match; better than raising or returning 0.0 (which would look like a valid metric).
Cost — O(n) for the delta loop, O(n log n) for the median. Memory is O(n) for the durations list; for streaming, switch to a reservoir or a running-quantile sketch.

Python
Topic — time-series
Time-series problems (Python)

Practice →

3. Time zones done right — zoneinfo, pytz, UTC normalization

`zoneinfo` is the modern answer; `pytz` is the legacy survivor — and "store in UTC, display in local" is the single rule that defuses 90% of DST bugs

The mental model in one line: every timestamp in your pipeline must be either UTC-aware in storage or local-aware at display — and the conversion happens at exactly one boundary, never in the middle. Once you internalise the layered architecture (ingest → normalise → store → convert at display), the entire family of "the report says 23:00 but the dashboard says 00:00 next day" tickets stops appearing.

Naive vs aware in one paragraph.

A datetime is naive if its tzinfo is None, and aware otherwise. Naive datetimes are "wall clock with no contract" — they could be UTC, local, or anything in between, and the only way to know is to read the surrounding code (or a comment that may or may not be accurate). Aware datetimes carry a tzinfo (typically a ZoneInfo instance) that encodes the IANA zone and lets Python do the right thing on DST transitions. The golden rule: once a timestamp crosses into your codebase, attach a tzinfo immediately, and never let a naive one back out.

The golden rule, expanded.

Store in UTC. Warehouse columns should be TIMESTAMPTZ (Postgres / Snowflake / BigQuery) or TIMESTAMP_TZ with an explicit UTC normalisation at ingest.
Display in local. When a human reads the value, convert to their local zone via dt.astimezone(user_tz). The conversion happens at the renderer, never in the warehouse.
Never mix. A query that joins a UTC column to a local-time column without a conversion is always a bug — even if it accidentally produces the right answer for the first few rows.

zoneinfo.ZoneInfo — the 2026 default.

Stdlib. from zoneinfo import ZoneInfo (Python 3.9+). No external dependency, no pip install.
IANA database. Backed by the system's tzdata or the tzdata PyPI package on Windows. Updates with the OS.
Construction. ZoneInfo("America/New_York"), ZoneInfo("Asia/Tokyo"), ZoneInfo("UTC").
Usage. Attach via dt.replace(tzinfo=ZoneInfo(...)) only if the wall clock you have is in that zone; convert via dt.astimezone(ZoneInfo(...)) to change which zone the display uses.

pytz — the legacy survivor.

pytz.timezone(...).localize(naive_dt) is the correct way to attach a pytz zone — never use naive_dt.replace(tzinfo=pytz.timezone(...)) because that produces the historic LMT offset (e.g. -04:56 for New York instead of -05:00).
is_dst flag. During the fall-back hour, localize(naive_dt, is_dst=True) and localize(naive_dt, is_dst=False) disambiguate the two valid wall-clock interpretations.
astimezone works the same as in zoneinfo once the timestamp is properly attached.
Why it survives. Older codebases still depend on it, and migration is gradual — but for new code, use zoneinfo.

The conversion flow in detail.

Parse the input string to a naive or partially-aware datetime.
Attach the source timezone (the zone the wall clock represents).
Convert to UTC for storage and arithmetic.
Store in a TIMESTAMPTZ column or as an ISO 8601 string with offset.
On read, convert back to the user's local zone with astimezone(user_tz) before display.

DST traps.

Spring forward (skipped hour). On the second Sunday of March in US Eastern, 02:00 → 03:00 — the clock skips an hour. 02:30 does not exist on that day. zoneinfo raises NonExistentTimeError when you try to construct an aware datetime at that wall time; pytz requires is_dst handling.
Fall back (ambiguous hour). On the first Sunday of November in US Eastern, 02:00 → 01:00 — the clock repeats an hour. 01:30 exists twice. zoneinfo uses the fold attribute (fold=0 is the first occurrence, fold=1 is the second); pytz uses is_dst=True/False.

Why ISO 8601 with offset is the only safe wire format.

A timestamp on the wire is just a string — and the only string format that survives every system, library, and human reading is ISO 8601 with an explicit offset (2026-06-05T09:00:00+00:00 or 2026-06-05T09:00:00Z). Stripping the offset is the single most common cause of "wait, why does this look like 5am instead of 9am?" bugs.

Worked example — attaching zoneinfo and converting to UTC

Detailed explanation. A CSV from a vendor contains timestamps in America/New_York local time, with no offset attached. The pipeline must parse, attach the source zone, and normalise to UTC for warehouse storage.

Question. Given the string "2026-06-05 09:00:00" representing 9 AM in New York, attach the right tzinfo and convert to UTC.

Input. A single naive timestamp string "2026-06-05 09:00:00" known to be New York wall-clock.

Code.

from datetime import datetime
from zoneinfo import ZoneInfo

raw = "2026-06-05 09:00:00"
naive = datetime.fromisoformat(raw)

# Attach the source zone (does NOT change the wall clock)
ny_aware = naive.replace(tzinfo=ZoneInfo("America/New_York"))
print(ny_aware)                    # 2026-06-05 09:00:00-04:00

# Convert to UTC for storage (changes the wall clock to UTC)
utc = ny_aware.astimezone(ZoneInfo("UTC"))
print(utc)                         # 2026-06-05 13:00:00+00:00

Step-by-step explanation.

datetime.fromisoformat parses the raw string into a naive datetime: 2026-06-05 09:00:00 with no offset.
naive.replace(tzinfo=ZoneInfo("America/New_York")) attaches the timezone without changing the wall clock. The display now reads 09:00:00-04:00 — the same 9am, but now Python knows it represents Eastern Daylight Time (UTC-4 in June).
.astimezone(ZoneInfo("UTC")) converts to a different zone, preserving the underlying instant and changing the wall clock. 09:00 EDT is 13:00 UTC. The result is fully aware and safe to store.
The two operations are distinct: replace says "this wall clock was in zone X"; astimezone says "what is this instant's wall clock in zone Y?" Never use replace to convert.

Output.

Step	Display
naive parse	`2026-06-05 09:00:00`
`replace(tzinfo=ZoneInfo("America/New_York"))`	`2026-06-05 09:00:00-04:00`
`.astimezone(ZoneInfo("UTC"))`	`2026-06-05 13:00:00+00:00`

Rule of thumb. replace(tzinfo=…) is for labelling a naive datetime with its source zone. astimezone(…) is for converting an aware datetime to a different zone. If you confuse the two, you will silently shift every timestamp by the source-zone offset.

Worked example — DST spring forward (skipped hour)

Detailed explanation. On 2026-03-08 (the second Sunday of March), New York wall clocks jump from 01:59:59 to 03:00:00. The wall time 02:30 does not exist. Constructing an aware datetime at that wall time is a contract violation that zoneinfo surfaces immediately.

Question. Show what happens when you try to construct datetime(2026, 3, 8, 2, 30, tzinfo=ZoneInfo("America/New_York")), and how to detect and resolve it.

Input. A bad wall clock: 2026-03-08 02:30.

Code.

from datetime import datetime
from zoneinfo import ZoneInfo

ny = ZoneInfo("America/New_York")
bad = datetime(2026, 3, 8, 2, 30, tzinfo=ny)
print(bad)                # zoneinfo does not raise on construction —
                          # it picks the next-valid offset (-04:00 = post-jump)
print(bad.utcoffset())    # -1 day, 20:00:00 → -04:00 (already past DST)

# Round-trip: convert to UTC then back. The wall clock changes!
roundtrip = bad.astimezone(ZoneInfo("UTC")).astimezone(ny)
print(roundtrip)          # 2026-03-08 02:30:00-04:00 (same — already post-jump)

Step-by-step explanation.

zoneinfo chooses the post-jump offset (-04:00) for a wall clock that falls inside the skipped hour. The constructed datetime is "consistent" but represents an instant that does not match the wall clock the user typed.
Converting to UTC and back returns the same 02:30-04:00 — proving that the constructed value is internally consistent, just not what the user intended.
To detect a skipped wall clock, compare dt.utcoffset() before and after the DST jump and look for a discontinuity, or query the IANA database for transitions in the local day.
In practice, never accept a naive wall clock that falls in the DST-skipped band. Sanitise the input: round forward to 03:00, raise an error, or surface the ambiguity to the user.

Output.

Step	Display
naive `2026-03-08 02:30` + NY	`2026-03-08 02:30:00-04:00`
utcoffset	`-04:00` (post-jump)
round-trip UTC → NY	`2026-03-08 02:30:00-04:00`

Rule of thumb. Spring-forward wall clocks are user errors, not engine bugs. The fix is at the data-ingestion layer: validate that every naive wall clock the system accepts is a valid local time on its date, and surface the ambiguity rather than silently picking an offset.

Worked example — DST fall back (ambiguous hour)

Detailed explanation. On 2026-11-01 (the first Sunday of November), New York wall clocks repeat 01:00 → 02:00. The wall time 01:30 happens twice — once as EDT (-04:00) and once as EST (-05:00). Without the fold attribute, you have a 50/50 chance of being right.

Question. Construct both occurrences of 2026-11-01 01:30 in New York and confirm they map to different UTC instants.

Input. A repeated wall clock: 2026-11-01 01:30.

Code.

from datetime import datetime
from zoneinfo import ZoneInfo

ny = ZoneInfo("America/New_York")
first  = datetime(2026, 11, 1, 1, 30, tzinfo=ny, fold=0)  # the EDT occurrence
second = datetime(2026, 11, 1, 1, 30, tzinfo=ny, fold=1)  # the EST occurrence

print(first.utcoffset(),  first.astimezone(ZoneInfo("UTC")))
print(second.utcoffset(), second.astimezone(ZoneInfo("UTC")))

Step-by-step explanation.

fold=0 selects the first occurrence — the EDT one with offset -04:00. Its UTC equivalent is 05:30 UTC.
fold=1 selects the second occurrence — the EST one with offset -05:00. Its UTC equivalent is 06:30 UTC.
The two aware datetimes are different points in time, separated by exactly one hour, even though their wall-clock display is identical.
pytz callers use tz.localize(naive, is_dst=True) (EDT) or is_dst=False (EST) — the same disambiguation, different API surface.

Output.

Variant	utcoffset	UTC equivalent
fold=0 (EDT)	`-04:00`	`2026-11-01 05:30:00+00:00`
fold=1 (EST)	`-05:00`	`2026-11-01 06:30:00+00:00`

Rule of thumb. When ingesting fall-back wall clocks, default to fold=0 (the first / "before transition" interpretation) and surface the ambiguity in a log line. If the source system records the offset separately, use the offset to choose the fold — never guess.

Worked example — `datetime.utcnow()` is naive (and dangerous)

Detailed explanation. datetime.utcnow() returns the current UTC wall clock as a naive datetime — no tzinfo attached. Code that compares this to a UTC-aware warehouse column raises TypeError. Worse, the "fix" of attaching UTC via .replace(tzinfo=ZoneInfo("UTC")) works correctly only because the value happened to be UTC — and that contract is invisible to readers.

Question. Show the difference between datetime.utcnow() and datetime.now(timezone.utc), and the correct idiom for "current instant as a UTC-aware datetime."

Input. None — same instant under both calls.

Code.

from datetime import datetime, timezone

# BROKEN — utcnow returns naive
naive_utc = datetime.utcnow()
print(naive_utc.tzinfo)       # None  ← contract not encoded

# CORRECT — now(tz) returns aware
aware_utc = datetime.now(timezone.utc)
print(aware_utc.tzinfo)       # datetime.timezone.utc

# Comparing the two raises
try:
    aware_utc - naive_utc
except TypeError as e:
    print("TypeError:", e)

Step-by-step explanation.

datetime.utcnow() returns the current UTC wall clock but does not attach tzinfo. Any downstream code that does not know the convention will treat it as "naive local."
datetime.now(timezone.utc) (or equivalently datetime.now(ZoneInfo("UTC"))) attaches UTC to the result. The contract is encoded in the value itself, not in a comment.
Subtracting the two raises TypeError because mixing naive and aware is illegal. This is a feature, not a bug — it forces the contract to be made explicit.
New code should prefer datetime.now(timezone.utc); legacy code that uses utcnow() should wrap with .replace(tzinfo=timezone.utc) at the boundary, ideally with a comment naming the convention.

Output.

Expression	Result
`datetime.utcnow().tzinfo`	`None`
`datetime.now(timezone.utc).tzinfo`	`datetime.timezone.utc`
`aware - naive`	`TypeError`

Rule of thumb. Treat datetime.utcnow() as a deprecated API. The replacement is datetime.now(timezone.utc) — same instant, but the timezone is encoded in the value.

Python interview question on ingesting mixed naive and aware timestamps

A senior interviewer often frames this as: "Your ingestion job pulls CSVs from three vendors. Vendor A sends UTC ISO strings with Z. Vendor B sends naive local-time strings labelled America/New_York. Vendor C sends ISO strings with offsets. Write a single function that normalises every input to UTC-aware."

Solution Using a single normalisation funnel with `zoneinfo`

from datetime import datetime, timezone
from zoneinfo import ZoneInfo

def normalise_to_utc(s: str, source_zone: str | None = None) -> datetime:
    """Parse a timestamp string and return a UTC-aware datetime.

    Rules:
      - If the string carries an offset (ISO 8601), use it.
      - Else, if source_zone is given, attach it and convert to UTC.
      - Else, raise — refuse to guess.
    """
    dt = datetime.fromisoformat(s.replace("Z", "+00:00"))
    if dt.tzinfo is not None:
        return dt.astimezone(timezone.utc)
    if source_zone is None:
        raise ValueError(f"Naive datetime {s!r} requires source_zone")
    aware = dt.replace(tzinfo=ZoneInfo(source_zone))
    return aware.astimezone(timezone.utc)

Step-by-step trace.

Vendor	Raw	source_zone	Parsed	Result (UTC)
A	`2026-06-05T09:00:00Z`	n/a	aware UTC	`2026-06-05 09:00:00+00:00`
B	`2026-06-05 09:00:00`	`America/New_York`	naive	`2026-06-05 13:00:00+00:00`
C	`2026-06-05T09:00:00+05:30`	n/a	aware IST	`2026-06-05 03:30:00+00:00`

The function rejects any naive input without an explicit source_zone, forcing the caller to declare the convention rather than silently guessing.

Output:

Vendor	Normalised UTC
A	`2026-06-05 09:00:00+00:00`
B	`2026-06-05 13:00:00+00:00`
C	`2026-06-05 03:30:00+00:00`

Why this works — concept by concept:

One funnel, one output type — every path returns a UTC-aware datetime. No conditional types, no Optional[tzinfo] downstream, no per-vendor handling in the pipeline body.
fromisoformat for ISO 8601 — handles offsets, microseconds, and the Z suffix (after the .replace("Z", "+00:00") shim). Stdlib, no extra dependency.
Explicit source_zone for naive inputs — refuses to guess. Every caller declares the convention; ambiguity becomes a compile-time choice, not a runtime bug.
astimezone(timezone.utc) for normalisation — preserves the underlying instant; changes the wall clock to UTC. The warehouse-bound value is always UTC-aware.
No pytz in the new code path — zoneinfo is stdlib, modern, and avoids the localize / LMT historical-offset trap.
Cost — O(1) per timestamp; the parse and convert are both constant-time scalar operations. The function is safe to call millions of times per ingestion run.

Python
Topic — date-time
Date-time problems (Python)

Practice →

4. pandas Timedelta vs polars Duration — vectorized time math

Both engines speak the same time-math language — the choice between them is one of scale, ergonomics, and ecosystem, not correctness

The mental model in one line: pandas and polars expose the same five time primitives — datetime, duration, bucket, window, timezone — under different keywords; if you can name the primitive, you can do the operation in either engine. Once you know the rosetta-stone mapping, swapping engines becomes a syntax exercise instead of a redesign.

The pandas trio.

pd.Timestamp — pandas' replacement for datetime. Nanosecond resolution by default (since pandas 2.0). Indexable, vectorisable.
pd.Timedelta — pandas' replacement for timedelta. Same nanosecond resolution. Accepts strings: pd.Timedelta("7D"), pd.Timedelta("1H 30min").
pd.DatetimeIndex — a sorted, dtyped index of Timestamp. Backs every time-based pandas operation: resample, rolling, between_time, tz_convert.

The polars equivalents.

pl.Datetime — polars' datetime type. Nanosecond, microsecond, or millisecond precision (configurable per column).
pl.Duration — polars' duration type. Same precision options. Constructed via pl.duration(days=…, hours=…) or from a string.
group_by_dynamic — polars' time-aware group-by. Supports tumbling, hopping, and (with helper) session windows out of the box.

Resample and rolling — the pandas workhorses.

df.resample("1H").sum() — tumbling 1-hour buckets. The index must be a DatetimeIndex. Aliases: "1H", "1D", "1W-MON", "15min", "5T".
df.rolling("7D").mean() — sliding 7-day window. The window is time-based when the index is datetime — it adapts to irregular timestamps instead of using fixed row counts.
df.between_time("09:00", "17:00") — filter rows by clock-time only, ignoring date. Useful for business-hours metrics.

Bucketing operations.

pandas: s.dt.floor("1H"), s.dt.ceil("15min"), s.dt.round("D"). Returns a Series.
polars: col("ts").dt.truncate("1h"), col("ts").dt.round("15m"). Returns an Expr.
Use: pre-bucket a timestamp column before joining or grouping. floor is the workhorse; round only for symmetric thresholds.

Time-zone aware columns.

pandas: s.dt.tz_localize("UTC") attaches UTC; s.dt.tz_convert("America/NY") converts. Both must be called explicitly — a DatetimeIndex is naive by default.
polars: col("ts").dt.replace_time_zone("UTC") then dt.convert_time_zone("America/NY"). Same two-step ritual.
Gotcha: concatenating a tz-aware Series with a tz-naive Series in pandas silently coerces to object dtype, killing every vectorised operation downstream.

Performance — the polars edge.

Lazy execution. pl.scan_csv(...).filter(...).group_by_dynamic(...).collect() builds a query plan, optimises it, and executes only the needed columns. pandas executes eagerly row-by-row from the first operation.
Multi-threaded. polars uses all CPU cores by default. pandas is single-threaded except for specific Cython kernels.
Real-world. On a 50 M-row datetime column with a group_by_dynamic aggregation, polars typically runs 10–50× faster than the pandas equivalent — the gap widens with row count.

Mixing tz-aware and tz-naive in pandas — the silent dtype trap.

import pandas as pd
a = pd.Series(pd.to_datetime(["2026-06-05 09:00"]).tz_localize("UTC"))
b = pd.Series(pd.to_datetime(["2026-06-05 09:00"]))   # naive
mixed = pd.concat([a, b])
print(mixed.dtype)   # object  ← every vectorised dt.* operation now silently broken

Common interview probes.

"Resample vs rolling?" — resample is tumbling (non-overlapping); rolling is sliding (overlapping, time-based windows when the index is datetime).
"How do you convert a timezone in polars?" — col("ts").dt.convert_time_zone("America/NY") on an aware column; replace_time_zone first if naive.
"Why does my pandas datetime column suddenly become object dtype?" — almost always tz-aware + tz-naive concat.
"How do you do a 5-minute hopping window with a 1-hour period in pandas?" — there is no direct primitive; you build it via pd.Grouper plus merge_asof or use polars' group_by_dynamic(every="5m", period="1h") directly.

Worked example — same 7-day rolling revenue in pandas and polars

Detailed explanation. A team computes 7-day rolling revenue per day from a revenue table. Writing it in both engines side by side is the cleanest way to learn the rosetta stone: same primitives, different syntax.

Question. Given an events table with ts and revenue, compute 7-day rolling revenue. Show both the pandas and the polars solution.

Input.

ts	revenue
2026-06-01	100
2026-06-02	150
2026-06-03	200
2026-06-04	50
2026-06-05	300

Code.

import pandas as pd
import polars as pl

# pandas — set DatetimeIndex, then rolling("7D")
pdf = pd.DataFrame({"ts": pd.to_datetime(["2026-06-01","2026-06-02","2026-06-03","2026-06-04","2026-06-05"]),
                    "revenue": [100, 150, 200, 50, 300]}).set_index("ts")
pdf["rev_7d"] = pdf["revenue"].rolling("7D").sum()

# polars — group_by_dynamic with every=1d period=7d
ldf = pl.DataFrame({"ts": ["2026-06-01","2026-06-02","2026-06-03","2026-06-04","2026-06-05"],
                    "revenue": [100, 150, 200, 50, 300]}).with_columns(pl.col("ts").str.to_date())
out = ldf.group_by_dynamic("ts", every="1d", period="7d").agg(pl.col("revenue").sum().alias("rev_7d"))

Step-by-step explanation.

The pandas form sets a DatetimeIndex and uses rolling("7D"). The string "7D" makes the window time-based (not row-count-based); it includes every row whose timestamp falls within 7 days of the current row's timestamp.
The polars form uses group_by_dynamic with every="1d" (one bucket per day) and period="7d" (each bucket spans 7 days). The behaviour is equivalent: each output row covers a 7-day window ending at its ts.
Both compute the same revenue: a running window sum, anchored to each input timestamp. The arithmetic is identical; the spelling differs.
On 1 M rows of input, polars typically runs ~10× faster because of lazy planning, multi-threading, and SIMD; on the 5-row toy example above, the difference is negligible.

Output.

ts	revenue	rev_7d
2026-06-01	100	100
2026-06-02	150	250
2026-06-03	200	450
2026-06-04	50	500
2026-06-05	300	800

Rule of thumb. When the team's stack already includes pandas and the volume is small, stick with rolling("7D"). When the volume exceeds ~10 M rows or the latency budget is tight, port to group_by_dynamic in polars — the rewrite is mechanical, the speedup is significant.

Worked example — tz_localize then tz_convert (pandas)

Detailed explanation. A pandas DataFrame ingested from a CSV has naive timestamps known to be in America/New_York. To store them in UTC, you must tz_localize first (attach the source zone), then tz_convert (change to UTC). Skipping the localize step raises an error.

Question. Given a pandas Timestamp column of naive New York local times, convert it to UTC-aware.

Input. Three naive timestamps representing NY local time.

Code.

import pandas as pd

s = pd.Series(pd.to_datetime([
    "2026-06-05 09:00",
    "2026-06-05 10:00",
    "2026-06-05 11:00",
]))
print(s.dt.tz)            # None — naive

# Step 1 — attach the source zone (does not change wall clock)
s_ny = s.dt.tz_localize("America/New_York")
print(s_ny.dt.tz)         # America/New_York

# Step 2 — convert to UTC
s_utc = s_ny.dt.tz_convert("UTC")
print(s_utc)              # 13:00, 14:00, 15:00 UTC

Step-by-step explanation.

The raw column is naive — dt.tz is None. Calling dt.tz_convert(...) directly raises TypeError: Cannot convert tz-naive Timestamps, use tz_localize to localize.
dt.tz_localize("America/New_York") attaches the zone without changing the wall clock. The display now shows the same hours with -04:00 (EDT in June).
dt.tz_convert("UTC") converts to UTC: each timestamp's wall clock shifts forward by 4 hours, and the offset becomes +00:00.
Storing s_utc in the warehouse (e.g. as TIMESTAMPTZ) preserves the contract end-to-end. On read, the consuming dashboard can convert back via dt.tz_convert(user_tz).

Output.

Step	First value
naive	`2026-06-05 09:00:00`
`tz_localize("America/New_York")`	`2026-06-05 09:00:00-04:00`
`tz_convert("UTC")`	`2026-06-05 13:00:00+00:00`

Rule of thumb. tz_localize answers "what zone was this naive wall clock in?"; tz_convert answers "what is this aware instant's wall clock in zone Y?" Skipping localize on a naive column is a hard error; using tz_convert to "attach" a zone is a logic bug.

Worked example — floor to hour, then group (polars)

Detailed explanation. A common ETL step is to bucket timestamps to the hour, then count events per bucket. Polars' dt.truncate is the analog of pandas' dt.floor; both round down to the nearest unit. Polars expresses the whole pipeline as a single expression chain.

Question. Given a polars DataFrame of events with sub-second timestamps, compute the count of events per hourly bucket.

Input.

ts	event
2026-06-05 09:03:11	login
2026-06-05 09:47:02	login
2026-06-05 10:05:00	login
2026-06-05 10:55:32	login

Code.

import polars as pl
from datetime import datetime

df = pl.DataFrame({
    "ts": [datetime(2026,6,5,9,3,11), datetime(2026,6,5,9,47,2),
           datetime(2026,6,5,10,5,0), datetime(2026,6,5,10,55,32)],
    "event": ["login"] * 4,
})

out = (df
       .with_columns(pl.col("ts").dt.truncate("1h").alias("hour_bucket"))
       .group_by("hour_bucket")
       .agg(pl.len().alias("n_events"))
       .sort("hour_bucket"))

Step-by-step explanation.

dt.truncate("1h") rounds every ts down to the nearest hour. 09:47:02 becomes 09:00:00; 10:55:32 becomes 10:00:00.
group_by("hour_bucket") collapses identical bucket values into a single group key.
pl.len().alias("n_events") counts rows per group. Each output row corresponds to one bucket.
sort("hour_bucket") ensures the output is chronological. group_by is unordered in polars, so an explicit sort is required for downstream consumers that assume time order.

Output.

hour_bucket	n_events
2026-06-05 09:00:00	2
2026-06-05 10:00:00	2

Rule of thumb. Use dt.truncate("1h") (polars) or dt.floor("1H") (pandas) for tumbling buckets. Reach for group_by_dynamic (polars) or resample (pandas) when you need hopping or sliding behaviour, not just floor-then-group.

Python interview question on a vectorised time-series rollup

A senior interviewer might say: "You have a 50 M-row events table with ts (UTC) and revenue. Compute hourly revenue per region, with the result sorted by region then hour. Write it in both pandas and polars and tell me which scales better."

Solution Using `group_by_dynamic` in polars and `groupby + resample` in pandas

import polars as pl
import pandas as pd

# polars — lazy, multi-threaded
out_pl = (pl.scan_parquet("events.parquet")
            .group_by_dynamic("ts", every="1h", by="region")
            .agg(pl.col("revenue").sum().alias("rev"))
            .sort(["region", "ts"])
            .collect())

# pandas — eager
df = pd.read_parquet("events.parquet")
out_pd = (df.set_index("ts")
            .groupby("region")["revenue"]
            .resample("1H").sum()
            .reset_index()
            .rename(columns={"revenue": "rev"})
            .sort_values(["region", "ts"]))

Step-by-step trace.

Engine	Plan	Row count after group-by
polars (lazy)	scan → group_by_dynamic → sort	24 × N_regions per day
pandas (eager)	read → set_index → groupby → resample	24 × N_regions per day

The polars plan is built lazily and optimised before execution; the pandas pipeline materialises after set_index and again after groupby. The arithmetic outcome is identical.

Output:

region	ts	rev
EU	2026-06-05 00:00:00	12 340.50
EU	2026-06-05 01:00:00	8 901.25
US	2026-06-05 00:00:00	22 105.75
US	2026-06-05 01:00:00	17 882.10

Why this works — concept by concept:

group_by_dynamic as the time-aware group-by — combines the bucketing (every="1h") and the group key (by="region") into one operation, planned together by the polars optimiser.
pandas resample after groupby — each group's index is independently resampled to hourly. The two-step pipeline is more verbose but conceptually equivalent.
Lazy execution scales — pl.scan_parquet reads only the columns the plan needs; set_index in pandas materialises the entire frame in memory before the resample.
Multi-threading — polars distributes the per-region aggregation across cores; pandas single-threads through one Python kernel.
Sorted output contract — both pipelines end with an explicit sort. Group-by results are unordered in polars and order-by-key in pandas; downstream consumers that rely on (region, ts) order need the sort to be explicit.
Cost — polars: O(n) for the scan, O(buckets × regions) for the aggregate, multi-threaded → wall-clock scales sub-linearly. pandas: O(n) read, O(n) groupby, O(buckets) per group → linear single-thread → 10–50× slower at 50 M rows.

Python
Topic — time-series
Time-series problems (Python)

Practice →

5. Sliding, tumbling & session windows — the time-window patterns

Three window shapes cover ~95% of streaming-style DE problems — and the only thing separating "easy" from "interview-hard" is whether you can name the gap semantic

The mental model in one line: a time window has a length, a step, and a closing rule — tumbling is length=step, hopping is length≥step, and session is "closed when the gap exceeds X". Once you can quote those three shapes by name, every "bucket events into…" interview prompt collapses to "which shape does the business want?"

Tumbling windows.

Definition. Fixed-length, non-overlapping buckets that tile the timeline. every == period.
Examples. Hourly revenue, daily DAU, 15-minute event counts.
pandas: df.resample("1H").sum() on a DatetimeIndex.
polars: df.group_by_dynamic("ts", every="1h").agg(...).
Semantics. Each event belongs to exactly one bucket. Closed on the left, open on the right by default in both engines.

Hopping / sliding windows.

Definition. Fixed-length, overlapping buckets. period > every. Synonyms: hopping (in Flink / Beam) and sliding (in some libraries) refer to the same shape.
Examples. "1-hour window every 5 minutes" for trailing-hour metrics; "7-day rolling revenue updated daily."
pandas: rolling("1H") on a DatetimeIndex for the time-based sliding; no native every parameter — you bucket via Grouper then merge_asof if you need true hopping.
polars: group_by_dynamic("ts", every="5m", period="1h") is the direct primitive.
Semantics. Each event belongs to multiple buckets (period / every of them on average). Aggregates can be heavy on memory; use incremental algorithms for high-frequency hops.

Session windows.

Definition. Variable-length, gap-closed buckets. A session is "events with no inter-event gap larger than X."
Examples. User sessions on a website (30-min inactivity gap), conversation threads (10-min gap), trade bursts (1-min gap).
pandas: No direct primitive; compute via diff + cumsum of a "new session" boolean.
polars: Same pattern via expression chain; also supports group_by_dynamic with an artificial session_id column.
Semantics. Each event belongs to exactly one session. Session boundaries depend on the data, not the clock — two users with different click patterns get different session counts on the same day.

Watermarks and late data.

Watermark. "No more events older than W will arrive." In streaming systems (Flink, Beam, Spark Structured Streaming), the watermark advances as event time progresses, and windows close once the watermark passes their right edge.
Late-arriving data. Events whose timestamps are older than the watermark when they arrive. They either re-open the closed window (idempotent re-aggregation) or are dropped to a side-output (lossy but bounded latency).
Batch implication. Even pure-batch pipelines have a watermark — implicit in "we re-run the previous 7 days every night to absorb late corrections." Naming the watermark explicitly makes the contract auditable.

Anti-pattern: building windows with Python loops.

# DON'T do this
windows = []
for ts in event_ts:
    start = ts - timedelta(hours=1)
    windows.append(sum(e.amount for e in events if start <= e.ts <= ts))

This is O(n²) for n events and a nested for over events per row. Vectorised forms via groupby + cumsum, rolling, or group_by_dynamic are O(n log n) at worst and 100–1000× faster.

Common interview probes on windows.

"Tumbling vs hopping?" — tumbling is every == period; hopping is period > every.
"How do you detect a session in pure SQL?" — LAG(ts) OVER (PARTITION BY user ORDER BY ts); flag rows where the gap exceeds the threshold; SUM(flag) OVER (...) as the session id.
"What is event-time vs processing-time?" — event-time is when the event happened; processing-time is when the system saw it. Always window on event-time for correctness.
"Why are session windows harder than tumbling?" — variable length means each session's boundaries depend on its own data; you cannot pre-bucket.

Worked example — tumbling: hourly event count

Detailed explanation. The canonical first window question: "how many events per hour?" Both engines have a direct primitive; the answer is a one-liner.

Question. Given an events DataFrame with sub-second timestamps, compute the event count per hour. Show the polars solution.

Input.

ts	event
2026-06-05 09:03:11	x
2026-06-05 09:47:02	x
2026-06-05 10:05:00	x
2026-06-05 11:55:32	x

Code.

import polars as pl
from datetime import datetime

df = pl.DataFrame({
    "ts": [datetime(2026,6,5,9,3,11), datetime(2026,6,5,9,47,2),
           datetime(2026,6,5,10,5,0), datetime(2026,6,5,11,55,32)],
    "event": ["x"] * 4,
})

out = (df
       .group_by_dynamic("ts", every="1h")
       .agg(pl.len().alias("n"))
       .sort("ts"))

Step-by-step explanation.

group_by_dynamic("ts", every="1h") declares: "bucket the ts column into 1-hour tumbling windows." Each bucket starts at the previous hour boundary (floor) and is non-overlapping with its neighbours.
pl.len().alias("n") counts the rows in each bucket. The expression evaluates per bucket, not per row.
The output is one row per non-empty bucket. Empty buckets (e.g. the hour where no events occurred) are not emitted by default — use closed="left" and explicit reindexing if you need them.
sort("ts") produces chronological output; group_by_dynamic does not guarantee order in polars.

Output.

ts	n
2026-06-05 09:00:00	2
2026-06-05 10:00:00	1
2026-06-05 11:00:00	1

Rule of thumb. When the spec says "per hour," "per day," or "per N minutes," and the buckets are non-overlapping, the answer is tumbling — resample in pandas, group_by_dynamic(every=…) in polars. Anything else is over-engineering.

Worked example — hopping: 1-hour window every 5 minutes

Detailed explanation. A real-time dashboard shows "events in the last hour, refreshed every 5 minutes." Each output row spans 60 minutes but the rows are spaced 5 minutes apart — so each event appears in ~12 output rows.

Question. Given the same events DataFrame, compute a 1-hour window every 5 minutes.

Input. Same as the previous example.

Code.

out = (df
       .group_by_dynamic("ts", every="5m", period="1h")
       .agg(pl.len().alias("n_last_hour"))
       .sort("ts"))

Step-by-step explanation.

every="5m" declares the step between consecutive bucket starts. The dashboard refreshes every 5 minutes.
period="1h" declares the length of each bucket. Each row aggregates the last hour of events.
Each event appears in period / every = 60min / 5min = 12 buckets on average. Memory and compute scale with that multiplier — for high-frequency hops, pick incremental algorithms (cumulative sums) or accept the cost.
The output's left edge (first bucket) is the floor of the smallest ts to the nearest 5-minute boundary; the right edge of each bucket is bucket_start + period.

Output (showing first three buckets).

ts (bucket start)	n_last_hour
2026-06-05 08:05:00	0 (events from 08:05 to 09:05)
...	...
2026-06-05 09:00:00	2
2026-06-05 09:50:00	3

Rule of thumb. Hopping is the right answer when the spec contains both a window length ("last hour") and a refresh cadence ("every 5 minutes"). If the lengths are equal, you have tumbling; if period > every, you have hopping.

Worked example — session: 30-minute inactivity gap

Detailed explanation. A web analytics team defines a "user session" as a sequence of events with no gap longer than 30 minutes. Two events 31 minutes apart belong to different sessions; two events 29 minutes apart belong to the same.

Question. Given a clicks DataFrame with user_id and ts, assign a session_id such that consecutive events within 30 minutes share an id.

Input.

user_id	ts
1	2026-06-05 09:00:00
1	2026-06-05 09:15:00
1	2026-06-05 09:50:00
1	2026-06-05 10:45:00
2	2026-06-05 09:05:00

Code.

import polars as pl

df = pl.DataFrame({
    "user_id": [1, 1, 1, 1, 2],
    "ts": ["2026-06-05 09:00:00", "2026-06-05 09:15:00",
           "2026-06-05 09:50:00", "2026-06-05 10:45:00",
           "2026-06-05 09:05:00"],
}).with_columns(pl.col("ts").str.to_datetime())

out = (df.sort(["user_id", "ts"])
         .with_columns([
             (pl.col("ts").diff().over("user_id")
              > pl.duration(minutes=30)).fill_null(True).alias("new_session"),
         ])
         .with_columns(
             pl.col("new_session").cum_sum().over("user_id").alias("session_id"),
         ))

Step-by-step explanation.

Sort by (user_id, ts) so the diff is sensible within each user.
pl.col("ts").diff().over("user_id") computes the inter-event gap within each user; the first event per user has a NULL diff.
> pl.duration(minutes=30) produces a boolean: TRUE when the gap exceeds 30 minutes (new session boundary), FALSE otherwise. fill_null(True) treats each user's first event as a new session start.
cum_sum().over("user_id") assigns 1 to the first event, 2 to the next session-start, and so on — giving a contiguous session_id per user.

Output.

user_id	ts	new_session	session_id
1	09:00:00	TRUE	1
1	09:15:00	FALSE	1
1	09:50:00	FALSE	1
1	10:45:00	TRUE	2
2	09:05:00	TRUE	1

Rule of thumb. Session windows always reduce to "boolean flag + cumsum." Once you have the session_id, downstream aggregates (session length, session revenue) are plain group-bys.

Worked example — event-time vs processing-time

Detailed explanation. A late-arriving event from a mobile client lands in the pipeline 90 minutes after it actually happened. Bucketing by processing-time puts it in the wrong hour; bucketing by event-time puts it where it semantically belongs. The difference matters for revenue accounting, anomaly detection, and any user-visible metric.

Question. Given two columns event_ts and processing_ts, show how the hourly count differs when you group by each.

Input.

event_ts	processing_ts
2026-06-05 09:30	2026-06-05 09:31
2026-06-05 09:45	2026-06-05 09:46
2026-06-05 09:55	2026-06-05 11:25 (late!)

Code.

out_event = (df.group_by_dynamic("event_ts", every="1h")
               .agg(pl.len().alias("n"))
               .sort("event_ts"))

out_proc  = (df.group_by_dynamic("processing_ts", every="1h")
               .agg(pl.len().alias("n"))
               .sort("processing_ts"))

Step-by-step explanation.

event_ts reflects when the user actually clicked. Every event belongs to the 09:00 hour.
processing_ts reflects when the pipeline saw the event. The late one lands in the 11:00 hour even though it was generated at 09:55.
Bucketing by event-time gives 09:00 → 3 events; bucketing by processing-time gives 09:00 → 2 and 11:00 → 1 — a misleading view of "what happened."
The right answer for user-visible metrics is event-time, with a watermark policy for closing late-arriving buckets.

Output (event-time):

event_ts (hour)	n
09:00	3

Output (processing-time):

processing_ts (hour)	n
09:00	2
11:00	1

Rule of thumb. Always default to event-time for correctness. Use processing-time only when the question is operational ("how loaded was the ingest pipeline this hour?"), not analytical.

Python interview question on session detection

A senior interviewer might frame this as: "Given a clicks DataFrame with user_id and ts, define a user session as consecutive events within 30 minutes. Return a DataFrame of (user_id, session_id, session_start, session_end, n_events)."

Solution Using `diff + cum_sum` for session_id then `group_by`

import polars as pl

def sessionise(df: pl.DataFrame, gap_minutes: int = 30) -> pl.DataFrame:
    s = (df.sort(["user_id", "ts"])
           .with_columns(
               (pl.col("ts").diff().over("user_id")
                > pl.duration(minutes=gap_minutes))
               .fill_null(True).alias("new_sess"))
           .with_columns(
               pl.col("new_sess").cum_sum().over("user_id").alias("session_id")))
    return (s.group_by(["user_id", "session_id"])
              .agg([pl.col("ts").min().alias("session_start"),
                    pl.col("ts").max().alias("session_end"),
                    pl.len().alias("n_events")])
              .sort(["user_id", "session_id"]))

Step-by-step trace.

user_id	ts	gap	new_sess	session_id
1	09:00	NULL	TRUE	1
1	09:15	15m	FALSE	1
1	09:50	35m	TRUE	2
1	10:45	55m	TRUE	3
2	09:05	NULL	TRUE	1

cum_sum over user_id increments only on the TRUE flags, producing a contiguous integer id per user that matches the gap-closure semantics.

Output:

user_id	session_id	session_start	session_end	n_events
1	1	09:00	09:15	2
1	2	09:50	09:50	1
1	3	10:45	10:45	1
2	1	09:05	09:05	1

Why this works — concept by concept:

diff().over(user_id) — computes the inter-event gap within each user. Polars handles the partition implicitly; pandas would write groupby("user_id")["ts"].diff().
Boolean flag + cum_sum — the classic session-detection idiom. Every TRUE marks a new session; the running sum assigns a contiguous id.
fill_null(True) — the first event per user has a NULL diff; treating it as True (new session) seeds the id correctly.
group_by for the summary — once session_id exists, the rest is a plain group-by with min, max, and len. No special window primitive needed.
Linear cost — diff is O(1) per row; cum_sum is O(n) over a sorted partition; group_by is O(n). Whole pipeline is O(n log n) for the initial sort, O(n) thereafter — orders of magnitude faster than a Python loop.
Cost — O(n log n) sort + O(n) for the rest. On 10 M clicks across 100 K users, this runs in seconds on a laptop with polars.

Python
Topic — sliding-window
Sliding-window problems (Python)

Practice →

Worked example — late data and watermarks

Detailed explanation. A streaming job emits hourly revenue rollups. The watermark advances as event-time progresses; once it passes 10:00, the 09:00 bucket is "closed." A 09:55 event that arrives at 11:30 is late — the job's policy determines whether it reopens the 09:00 bucket (with a re-emit) or is dropped to a side-output.

Question. Given the events below with event_ts and arrival_ts, mark which events arrived after the 1-hour-late watermark.

Input.

event_ts	arrival_ts
09:30	09:31
09:55	11:30 (late!)
10:10	10:11

Code.

from datetime import timedelta
import polars as pl

# Watermark: events more than 1 hour late are flagged
df_marked = df.with_columns(
    (pl.col("arrival_ts") - pl.col("event_ts") > pl.duration(hours=1)).alias("is_late"),
)

Step-by-step explanation.

The watermark contract: "we consider an event 'on time' if arrival_ts - event_ts <= 1 hour." Anything else is late.
The 09:30 event arrives at 09:31 — 1 minute late. is_late = FALSE.
The 09:55 event arrives at 11:30 — 1h35m late. is_late = TRUE. The 09:00 bucket has already been emitted; the policy choice is reopen-and-re-emit or drop-to-side-output.
The 10:10 event arrives at 10:11 — 1 minute late, before the 10:00 bucket closes. is_late = FALSE.

Output.

event_ts	arrival_ts	is_late
09:30	09:31	FALSE
09:55	11:30	TRUE
10:10	10:11	FALSE

Rule of thumb. State the watermark policy in the pipeline's documentation, not in a comment. Late data is a business decision (drop vs reopen vs side-output), not an engineering one — and the policy must be auditable.

Cheat sheet — Python time-math recipes

Parse ISO 8601. datetime.fromisoformat(s) — handles offsets, microseconds, and (after replacing Z with +00:00) Zulu time.
Current UTC, aware. datetime.now(timezone.utc) — never datetime.utcnow() (naive) or datetime.now() (host-local).
Attach a source zone. dt.replace(tzinfo=ZoneInfo("America/New_York")) — labels a naive wall clock without changing it.
Convert to UTC. dt.astimezone(ZoneInfo("UTC")) — preserves the instant, changes the wall clock to UTC.
Convert to display zone. dt.astimezone(ZoneInfo("America/Tokyo")) — same operation, different target zone.
Add 30 days (fixed length). dt + timedelta(days=30).
Add 1 month (calendar). dt + relativedelta(months=1) — leap-year and end-of-month aware.
Difference in seconds. (a - b).total_seconds() — never (a - b).seconds (ignores .days).
Round / floor / ceil to hour. pandas: s.dt.floor("1H"); polars: pl.col("ts").dt.truncate("1h").
Tumbling 1-hour rollup. pandas: df.resample("1H").sum(); polars: df.group_by_dynamic("ts", every="1h").agg(...).
Hopping window. polars: df.group_by_dynamic("ts", every="5m", period="1h").agg(...).
Sliding (time-based) window. pandas: df.rolling("7D").sum() on a DatetimeIndex.
Session detection. (diff > threshold).cum_sum().over("user") → contiguous session_id.
Localize tz-naive pandas Series. s.dt.tz_localize("UTC") then dt.tz_convert("America/NY") — never .tz_convert on a naive Series.
DST-safe future timestamp. anchor in UTC, add timedelta, convert at display.
Count NULL ts. (s.isna() | s.isnull()).sum() in pandas; df.select(pl.col("ts").is_null().sum()) in polars.
Ingestion contract. require ISO 8601 with offset; reject naive strings without an explicit source_zone.
Pandas object-dtype trap. never concat a tz-aware and a tz-naive Series — explicitly localize the naive one first.

Frequently asked questions

Should I use `pytz` or `zoneinfo` in 2026?

Use zoneinfo — it is part of the standard library since Python 3.9, has no pip install cost, and uses the system's IANA tzdata (or the tzdata PyPI package on Windows). pytz is deprecated as the default and survives only in legacy code. The two key differences: zoneinfo uses the fold attribute to disambiguate the fall-back hour (fold=0 for the first occurrence, fold=1 for the second), while pytz uses is_dst=True/False; zoneinfo lets you attach a zone via dt.replace(tzinfo=ZoneInfo(...)) directly, while pytz requires tz.localize(dt) to avoid the historical LMT offset trap. For new code, default to zoneinfo and migrate pytz callsites at your leisure.

Why does `datetime.utcnow()` return a naive datetime?

Because the standard library, when datetime was designed in 2002, treated tzinfo as optional and utcnow() was specified to return the wall-clock time in UTC without attaching a tzinfo — the rationale being that "you obviously already know it is UTC because you called utcnow()." In practice this convention has been a footgun for two decades because the contract is encoded in the function name, not the value, so any downstream code that lost track of the source paths will treat it as naive-local. The 2026 best practice is to call datetime.now(timezone.utc) instead — same instant, but the timezone is part of the value, and any comparison or subtraction with another aware datetime works without TypeError. New code should never use utcnow(); legacy code should wrap it with .replace(tzinfo=timezone.utc) at the boundary, ideally behind a thin helper that documents the convention.

How do I add months to a datetime if `timedelta` has no months?

Use dateutil.relativedelta: from dateutil.relativedelta import relativedelta; dt + relativedelta(months=3) is the canonical answer. relativedelta is calendar-aware — it increments the month field by 3 and clamps the day to the last valid day of the target month, handling leap years and end-of-month edge cases correctly. timedelta(days=30 * 3) is not equivalent — it advances exactly 90 days, which is one to three days different from "three calendar months later" depending on which months are involved. The reason timedelta deliberately refuses months and years is that those units have variable length (28-31 days for a month; 365 or 366 days for a year); encoding them as a fixed-duration delta would be a lie. The standard library forces you to choose between a fixed duration (timedelta) and a calendar delta (relativedelta), which is exactly the choice you want to make explicitly.

What's the difference between `pd.Timedelta` and `datetime.timedelta`?

Both represent a duration; pd.Timedelta is pandas' version with nanosecond resolution (since pandas 2.0) and vectorised operations across whole columns, while datetime.timedelta is the standard library's version with microsecond resolution and scalar-only behaviour. Functionally they interop: pd.Timedelta(timedelta(hours=1)) round-trips cleanly, and arithmetic between a pd.Timestamp and a datetime.timedelta works. The reason to prefer pd.Timedelta inside a pandas pipeline is that the operations vectorise — s + pd.Timedelta("1D") is a single C-level loop across the whole column, while a Python timedelta would force per-row Python dispatch. The reason to use plain datetime.timedelta outside pandas is that it is stdlib, dependency-free, and forces the same arithmetic discipline (total_seconds() instead of .seconds, no months/years). The polars equivalent is pl.duration(hours=1), which has the same vectorised semantics.

How do I handle DST transitions safely?

The single rule that defuses DST: store and compute in UTC; convert at display time only. UTC has no DST, so any arithmetic on UTC instants behaves consistently year-round. When you need to express a business rule in local time ("trigger at 09:00 New York every day"), convert the next trigger time into UTC and store it as a UTC instant — re-compute it from the local rule on every iteration so you never end up with a stale UTC timestamp that drifted across a DST boundary. For the spring-forward gap (02:30 does not exist), zoneinfo will silently pick the post-jump offset, so validate naive wall clocks at the ingestion layer and reject inputs that fall in the skipped band. For the fall-back ambiguity (01:30 happens twice), use fold=0 (first occurrence) by default and surface the ambiguity in a log line; if the source system records the offset separately, use the offset to choose the fold. Never use timedelta(days=1) to mean "the same wall-clock time tomorrow" in local time — that semantic requires dateutil.relativedelta(days=1) plus a local-zone-aware datetime.

Why is my pandas DataFrame's datetime column suddenly object dtype?

Almost always because a tz-aware Series and a tz-naive Series were concatenated, joined, or stacked. pandas refuses to coerce the two into a single tz-aware or tz-naive column (the contracts conflict), so it falls back to object dtype — which silently breaks every downstream dt.* accessor and most arithmetic. The diagnosis is one line: print(s.dtype) should print datetime64[ns, UTC] or datetime64[ns] — if it prints object, you have the bug. The fix is to localize every naive Series at its source (s.dt.tz_localize("UTC") or whatever the convention is) before the concat. The defensive pattern is a CI lint that scans for pd.concat calls and flags any path where the inputs have mismatched dt.tz. The same trap exists in polars but is louder: polars raises a type error on the concat instead of silently coercing, so it surfaces at the first run rather than at the dashboard render.

Practice on PipeCode

Drill the date-time practice library → for datetime / timedelta / zoneinfo reps.
Rehearse on time-series problems (Python) → for the pd.resample / polars group_by_dynamic family.
Sharpen sliding-window drills (Python) → for hopping windows and session-detection patterns.
Layer the window functions library → for SQL-side LAG / LEAD / partitioned time math.
Stack the Python language practice library → for end-to-end ingestion-pipeline reps.
For the broader surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Sharpen the Python axis with the Python for data engineering interviews course →.
For long-form schema craft, work through data modelling for DE interviews →.

Pipecode.ai is Leetcode for Data Engineering — every time-math recipe above ships with hands-on practice rooms where you write the UTC normalisation funnel, the `group_by_dynamic` rollup, and the session-detection `diff + cum_sum` against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your fix to a DST off-by-one actually behaves the same on pandas as on polars.

Practice date-time problems now →
Sliding-window drills →

1. Why time math is the silent bug magnet in data pipelines

Time bugs are not edge cases — they are the predictable consequence of mixing three timestamp states and pretending the calendar is a simple number line

Worked example — the four object types in one shell session

Worked example — a churn pipeline that drifted by one day per quarter

Worked example — timedelta has no months, and why that is correct

Python interview question on a drifting time-window pipeline

Solution Using a UTC-anchored, explicit-contract pattern

2. timedelta vs datetime — the object anatomy every DE must know

python timedelta is a duration; datetime.datetime is a point in time — keeping the two types straight closes 80% of time-math bugs

Worked example — parsing, adding, subtracting across the four types

Worked example — total_seconds() is the only safe comparison

Worked example — the date + timedelta(hours=…) trap

Python interview question on duration math correctness

Solution Using timedelta.total_seconds() and a median helper

3. Time zones done right — zoneinfo, pytz, UTC normalization

zoneinfo is the modern answer; pytz is the legacy survivor — and "store in UTC, display in local" is the single rule that defuses 90% of DST bugs

Worked example — attaching zoneinfo and converting to UTC

Worked example — DST spring forward (skipped hour)

Worked example — DST fall back (ambiguous hour)

Worked example — datetime.utcnow() is naive (and dangerous)

Python interview question on ingesting mixed naive and aware timestamps

Solution Using a single normalisation funnel with zoneinfo

4. pandas Timedelta vs polars Duration — vectorized time math

Both engines speak the same time-math language — the choice between them is one of scale, ergonomics, and ecosystem, not correctness

Worked example — same 7-day rolling revenue in pandas and polars

Worked example — tz_localize then tz_convert (pandas)

Worked example — floor to hour, then group (polars)

Python interview question on a vectorised time-series rollup

Solution Using group_by_dynamic in polars and groupby + resample in pandas

5. Sliding, tumbling & session windows — the time-window patterns

Three window shapes cover ~95% of streaming-style DE problems — and the only thing separating "easy" from "interview-hard" is whether you can name the gap semantic

Worked example — tumbling: hourly event count

Worked example — hopping: 1-hour window every 5 minutes

Worked example — session: 30-minute inactivity gap

Worked example — event-time vs processing-time

Python interview question on session detection

Solution Using diff + cum_sum for session_id then group_by

Worked example — late data and watermarks

Cheat sheet — Python time-math recipes

Frequently asked questions

Should I use pytz or zoneinfo in 2026?

Why does datetime.utcnow() return a naive datetime?

How do I add months to a datetime if timedelta has no months?

What's the difference between pd.Timedelta and datetime.timedelta?

How do I handle DST transitions safely?

Why is my pandas DataFrame's datetime column suddenly object dtype?

Practice on PipeCode

`python timedelta` is a duration; `datetime.datetime` is a point in time — keeping the two types straight closes 80% of time-math bugs

Solution Using `timedelta.total_seconds()` and a median helper

`zoneinfo` is the modern answer; `pytz` is the legacy survivor — and "store in UTC, display in local" is the single rule that defuses 90% of DST bugs

Worked example — `datetime.utcnow()` is naive (and dangerous)

Solution Using a single normalisation funnel with `zoneinfo`

Solution Using `group_by_dynamic` in polars and `groupby + resample` in pandas

Solution Using `diff + cum_sum` for session_id then `group_by`

Should I use `pytz` or `zoneinfo` in 2026?

Why does `datetime.utcnow()` return a naive datetime?

How do I add months to a datetime if `timedelta` has no months?

What's the difference between `pd.Timedelta` and `datetime.timedelta`?