python timedelta sounds like a trivia question — until the on-call pager fires at 02:30 because a "+24 hours" calculation crossed a DST boundary and silently re-processed yesterday's revenue. Time math is the single largest category of silent correctness bugs in analytics pipelines, and the only durable defence is to learn the four primitives the standard library actually ships, the three states a timestamp can live in (naive, aware-local, UTC), and the two ergonomics layers (pandas, polars) that wrap them for vectorised work.
This guide is the cheat sheet you wished existed the first time datetime + timedelta(days=1) produced a row that was one hour off in production. It walks through the object anatomy of datetime / date / time / timedelta, the modern zoneinfo story versus the legacy pytz API, the UTC-normalisation contract every warehouse-bound pipeline must enforce, the pandas-versus-polars trade-off for vectorised time arithmetic, and the three time-window patterns — tumbling, hopping, and session — that cover ~95% of streaming-style DE problems. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the date-time practice library →, rehearse on time-series problems →, and stack the windowing muscles with sliding-window drills →.
On this page
- Why time math is the silent bug magnet
- timedelta vs datetime — the object anatomy
- Time zones done right — zoneinfo, pytz, UTC
- pandas Timedelta vs polars Duration
- Sliding, tumbling & session windows
- Cheat sheet — Python time-math recipes
- Frequently asked questions
- Practice on PipeCode
1. Why time math is the silent bug magnet in data pipelines
Time bugs are not edge cases — they are the predictable consequence of mixing three timestamp states and pretending the calendar is a simple number line
The one-sentence invariant: every timestamp in your pipeline lives in exactly one of three states — naive (no timezone), aware-local (wall clock plus IANA zone), or UTC — and every silent time bug is caused by an implicit, undeclared conversion between two of those states. Once you treat the state transitions as explicit contracts, the entire family of "the report is off by one hour" tickets becomes a small set of mechanical checks instead of a guessing game.
The three classes of time bugs.
-
Naive-mixing bugs. Two timestamps are subtracted; one has
tzinfo, the other does not. Python raisesTypeError: can't subtract offset-naive and offset-aware datetimesif you are lucky, and silently produces a wrong answer (via.replace(tzinfo=None)) if you are not. -
DST jump bugs. A pipeline adds
timedelta(days=1)to a local-time anchor and lands one hour off on March or November weekends. The arithmetic is correct; the semantics of "the same wall clock time tomorrow" cannot be expressed as a fixed-length duration. -
Off-by-one window bugs. A "last 24 hours" window is computed using
<= todayand>= today - 1 day, but the comparison is against adatewhile the data carries adatetime. The window silently includes 25 hours of data (or 23, depending on which end is open).
Why "just add 24 hours" breaks during DST transitions.
Two days a year in every DST-observing jurisdiction, a wall clock loses (or gains) an hour. The naive arithmetic dt + timedelta(hours=24) always advances the underlying instant by exactly 24 hours — which is not the same as "the same wall-clock time tomorrow." On the spring-forward day, that means 02:30 + 24 hours is 03:30 the next day, not 02:30. If the next-day batch keys on "process all rows where ts is between 02:30 yesterday and 02:30 today," the batch under-counts by one hour. The fix is to anchor the math in UTC, advance there, and convert back at display time — or, when "same wall clock tomorrow" is the literal business contract, use dateutil.relativedelta(days=1) which is calendar-aware.
The cost of mixing aware and naive datetimes.
A pipeline that stores created_at as UTC-aware in the warehouse and writes a downstream report that compares it against datetime.utcnow() (which is naive in the standard library) will raise TypeError on every join. Teams "fix" this by calling .replace(tzinfo=None) on the warehouse column — which strips the contract and silently lies about the timestamp. Six months later, an analyst plots "events per hour by region" and discovers every region was actually using UTC because the strip happened before the region-based timezone conversion.
The 4 primitives every Python DE must master.
-
datetime.datetime— the workhorse: year, month, day, hour, minute, second, microsecond, plus optionaltzinfo. Represents a point in time. Supports arithmetic withtimedelta. -
datetime.date— calendar date only: year, month, day. No time, no timezone. Useful for partition keys, billing periods, and report dates. -
datetime.time— clock time only: hour, minute, second, microsecond, optionaltzinfo. Rarely useful alone — almost always you want a fulldatetime. -
datetime.timedelta— a duration:days,seconds,microseconds. Notably, nothing larger than days — no months, no years (because calendar units have variable length).
What interviewers listen for.
- Do you say "store in UTC, display in local TZ" the moment a timestamp question lands? — required answer.
- Do you reach for
zoneinfo.ZoneInfobeforepytz.timezonewhen asked to attach a zone? — modern-stack signal. - Do you mention that
datetime.utcnow()returns a naive datetime as a known footgun? — senior signal. - Do you distinguish tumbling, hopping, and session windows by name when asked to bucket events? — streaming-DE signal.
The 2026 reality.
-
zoneinfo(Python 3.9+) is the modern, standard-library answer for IANA time zones.pytzis deprecated as the default but still ubiquitous in legacy code. -
pandas2.x uses nanosecond-resolutionTimestampandTimedeltaby default;polarsmatches withpl.Datetimeandpl.Durationand adds time-awaregroup_by_dynamicfor tumbling and hopping windows out of the box. -
dateutil.relativedeltais the canonical calendar-aware delta — the only safe way to add "one month" or "one year" without writing your own calendar arithmetic.
Worked example — the four object types in one shell session
Detailed explanation. The fastest way to internalise the object model is to call type() on every intermediate value in a short arithmetic sequence. Interviewers love asking "what is the type of dt - dt?" because juniors say "a number" and seniors say "a timedelta."
Question. Show that datetime - datetime is a timedelta, that datetime + timedelta is a datetime, and that date + timedelta(days=1) is a date — but date + timedelta(seconds=3600) is also a date (it silently ignores the sub-day component).
Input. No table — just two literal datetimes.
Code.
from datetime import datetime, date, timedelta
a = datetime(2026, 6, 5, 10, 0, 0)
b = datetime(2026, 6, 4, 22, 0, 0)
delta = a - b # timedelta
print(type(delta), delta) # <class 'datetime.timedelta'> 0:12:00:00
next_event = a + timedelta(hours=6)
print(type(next_event), next_event) # <class 'datetime.datetime'> 2026-06-05 16:00:00
d = date(2026, 6, 5)
print(d + timedelta(days=1)) # 2026-06-06 (a date)
print(d + timedelta(seconds=3600)) # 2026-06-05 (still a date — hours dropped!)
Step-by-step explanation.
-
a - bproduces atimedeltaof 12 hours. Python expresses that internally as(days=0, seconds=43200, microseconds=0). -
a + timedelta(hours=6)advances the point in time by six hours; the result is adatetime, not atimedelta. -
date + timedelta(days=1)advances the calendar by one day; the result is adate. -
date + timedelta(seconds=3600)is the surprising case:datehas no sub-day component, so thesecondspart of the delta is silently rounded toward zero. The result is still2026-06-05. This is a frequent off-by-one trap when a function expects adatetimeand receives adate.
Output.
| Expression | Type | Value |
|---|---|---|
a - b |
timedelta |
0:12:00 |
a + timedelta(hours=6) |
datetime |
2026-06-05 16:00:00 |
date + timedelta(days=1) |
date |
2026-06-06 |
date + timedelta(seconds=3600) |
date |
2026-06-05 |
Rule of thumb. When the question involves hours, minutes, or any sub-day arithmetic, work in datetime, not date. If your function signature accepts "a timestamp," type-annotate it as datetime and convert at the boundary — never let a date sneak into sub-day math.
Worked example — a churn pipeline that drifted by one day per quarter
Detailed explanation. A SaaS team computes "users active in the last 90 days" by subtracting 90 days from today and filtering events. The pipeline runs in America/New_York, the events are stored in UTC, and today is date.today() — which uses the system timezone of the box (the prod box is UTC, but the analyst's laptop is Eastern). Over a quarter, the report drifted by one day; nobody could reproduce it locally because each environment used a different today.
Question. Rewrite the "active in the last 90 days" computation so it is fully UTC-anchored and reproducible across boxes regardless of the system timezone.
Input. Conceptual — an events table with event_ts in UTC.
Code.
from datetime import datetime, timedelta, timezone
# BROKEN — today() uses the system timezone of the host
def cutoff_broken():
from datetime import date
return date.today() - timedelta(days=90) # date in host TZ
# CORRECT — anchor in UTC explicitly, then take the date
def cutoff_correct():
now_utc = datetime.now(timezone.utc)
return (now_utc - timedelta(days=90)).date()
Step-by-step explanation.
-
date.today()calls the OS for the wall-clock date in the host's local timezone. On a UTC-configured box at 01:00 UTC, this returns2026-06-05; on an Eastern-configured analyst laptop at the same instant (which is 20:00 EDT on June 4), it returns2026-06-04. Same instant, differenttoday. -
datetime.now(timezone.utc)returns the current UTC instant as a tz-aware datetime. Subtracting 90 days produces a UTC instant 90 days earlier; calling.date()gives the UTC calendar date — identical on every host. - The cutoff is now a host-independent contract: the report uses the UTC calendar 90 days ago, regardless of where it runs.
- Document the contract in the report header so downstream consumers know the report is anchored in UTC, not local time. The choice is a policy decision; the goal is to make it explicit.
Output.
| Function | Run on UTC host | Run on Eastern host |
|---|---|---|
cutoff_broken() |
2026-03-07 |
2026-03-06 |
cutoff_correct() |
2026-03-07 |
2026-03-07 |
Rule of thumb. Anywhere a report uses today as part of a calculation, write it as datetime.now(timezone.utc).date() — never date.today() — and put a one-line comment naming UTC. The cost is two extra words; the savings is one quarter of debugging at year-end.
Worked example — timedelta has no months, and why that is correct
Detailed explanation. A new analyst writes dt + timedelta(months=1) and gets TypeError: 'months' is an invalid keyword argument. They reach for timedelta(days=30) as the "obvious" workaround — and silently land one to three days off depending on which month it is. The fix is dateutil.relativedelta.
Question. Add one month to 2026-01-31. Show why timedelta(days=30) is wrong and how relativedelta(months=1) is right.
Input. A single literal anchor: 2026-01-31.
Code.
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
anchor = datetime(2026, 1, 31)
# BROKEN — "30 days" is the average month, not the actual one
naive_plus_month = anchor + timedelta(days=30)
print(naive_plus_month) # 2026-03-02 (wrong — landed in March)
# CORRECT — calendar-aware addition
correct_plus_month = anchor + relativedelta(months=1)
print(correct_plus_month) # 2026-02-28 (clamped to end of Feb)
Step-by-step explanation.
-
timedelta(days=30)is a fixed-length duration of 30 days. Adding it to January 31 advances exactly 30 days, landing on March 2 — past the entire month of February. -
relativedelta(months=1)is a calendar-aware delta. It increments the month field by 1, then clamps the day to the last valid day of the target month. January 31 plus one month yields February 28 (or 29 in a leap year). - The same logic applies to "one year ago":
relativedelta(years=1)handles leap-year edge cases (Feb 29 + 1 year → Feb 28). -
timedeltadeliberately rejectsmonthsandyearsbecause their length is not constant. The standard library forces you to be explicit about which semantic you want.
Output.
| Expression | Result |
|---|---|
2026-01-31 + timedelta(days=30) |
2026-03-02 |
2026-01-31 + relativedelta(months=1) |
2026-02-28 |
Rule of thumb. The moment a business spec says "month" or "year," reach for dateutil.relativedelta. timedelta is only safe for fixed-length durations: hours, minutes, seconds, microseconds, and "exactly N days."
Python interview question on a drifting time-window pipeline
A senior interviewer often opens with: "Walk me through every place a time-window pipeline can silently produce a wrong row count — naive timestamps, DST jumps, host timezone, calendar deltas, off-by-one window edges — and show how you would catch each one in a code review." It blends the three classes of time bugs and the four time primitives into a single audit.
Solution Using a UTC-anchored, explicit-contract pattern
from datetime import datetime, timedelta, timezone
from zoneinfo import ZoneInfo
from dateutil.relativedelta import relativedelta
# 1) Always anchor "now" in UTC, never the host
now_utc = datetime.now(timezone.utc)
# 2) Build the window in UTC with explicit timedeltas
window_end = now_utc
window_start = window_end - timedelta(days=7)
# 3) For "one month ago," use relativedelta — never timedelta(days=30)
month_ago = now_utc - relativedelta(months=1)
# 4) Convert to local only at display, never at filter
display_tz = ZoneInfo("America/New_York")
display_end = window_end.astimezone(display_tz)
display_start = window_start.astimezone(display_tz)
Step-by-step trace.
| Step | Value at 2026-06-05 09:00 UTC |
|---|---|
now_utc |
2026-06-05 09:00:00+00:00 |
window_end |
2026-06-05 09:00:00+00:00 |
window_start |
2026-05-29 09:00:00+00:00 |
month_ago |
2026-05-05 09:00:00+00:00 |
display_end (EDT) |
2026-06-05 05:00:00-04:00 |
display_start (EDT) |
2026-05-29 05:00:00-04:00 |
After the audit, the team rewrites every "now" call to datetime.now(timezone.utc) and adds a CI check that fails any PR that introduces a bare datetime.now() without an argument.
Output:
| Metric | Value |
|---|---|
| window length (hours) | 168 |
| month_ago day-of-month preserved | yes |
| display TZ offset (EDT) | -04:00 |
| host-dependent calls remaining | 0 |
Why this works — concept by concept:
- UTC as the storage and arithmetic layer — every comparison, subtraction, and window boundary is computed in UTC. Local timezones only enter at display time, when the human-facing TZ is known.
-
datetime.now(timezone.utc)returns aware — unlikedatetime.utcnow()(naive) ordatetime.now()(host-local), this idiom is unambiguous and self-documenting. -
timedeltafor fixed-length durations — seven days, 168 hours, 60 minutes — all safe. The library refuses to encode variable-length units, which forces the engineer to make a choice. -
relativedeltafor calendar arithmetic — months, years, day-of-week math. Calendar-aware, leap-year aware, end-of-month aware. -
astimezone()for display — the conversion preserves the underlying instant and changes only the wall-clock representation. The audit forbids.replace(tzinfo=…)for conversion because that changes the instant. - Cost — every operation is O(1) scalar; no I/O, no allocations beyond the new datetime object. Cheap insurance against an entire family of silent bugs.
Python
Topic — date-time
Date-time problems (Python)
2. timedelta vs datetime — the object anatomy every DE must know
python timedelta is a duration; datetime.datetime is a point in time — keeping the two types straight closes 80% of time-math bugs
The mental model in one line: datetime answers when; timedelta answers how long — and every legal arithmetic combination produces one of those two shapes, never a raw number. Once you commit to "duration vs instant" as the type discipline, the entire arithmetic surface of datetime - datetime, datetime + timedelta, and timedelta * 2 becomes a small finite set of rules you can carry in your head.
The four objects in one table.
| Object | Fields | Tz-aware? | Purpose |
|---|---|---|---|
datetime.datetime |
year, month, day, hour, minute, second, microsecond, tzinfo | yes (optional) | a point in time |
datetime.date |
year, month, day | no | a calendar date |
datetime.time |
hour, minute, second, microsecond, tzinfo | yes (optional) | a clock time |
datetime.timedelta |
days, seconds, microseconds | n/a | a duration |
The arithmetic rules.
-
datetime - datetime→timedelta. Both must be tz-aware or both must be naive — mixing raisesTypeError. -
datetime + timedelta→datetime. The result preserves the timezone of the input. -
timedelta + timedelta→timedelta. Pure duration math. -
timedelta * 3→timedelta. Scaling a duration is legal. -
datetime + datetime→TypeError. There is no semantic for "two points in time added together." -
date + timedelta(days=...)→date. Sub-day components of the delta are silently dropped.
Why timedelta has no months or years.
Months range from 28 to 31 days; years are 365 or 366. A duration must have a fixed length to behave like a number. Encoding "one month" as a fixed delta would be lying. The standard library forces you to choose between a fixed duration (timedelta) and a calendar delta (dateutil.relativedelta).
total_seconds() — the safe duration comparator.
timedelta exposes three integer fields internally — .days, .seconds (0–86399), .microseconds — but comparing two deltas by looking at one field is a bug magnet. td.total_seconds() collapses the whole duration into a single float; comparing two floats is unambiguous. Use this every time you need to threshold ("more than 30 minutes?") or rank durations.
Common interview probes on type discipline.
- "What is
datetime(2026,1,1) - datetime(2025,1,1)?" — answer:timedelta(days=365)(no tz, fixed-length). - "How do you get the number of hours in a
timedelta?" — answer:td.total_seconds() / 3600— nottd.seconds / 3600(which ignores thedaysfield). - "Why does
date + timedelta(hours=1)return the same date?" — becausedatehas no time component; thehourspart is silently truncated. - "What is the type of
datetime - timedelta?" —datetime(preserving tz).
Worked example — parsing, adding, subtracting across the four types
Detailed explanation. A pipeline ingests two ISO 8601 strings, parses them into datetime, computes the duration between them, advances by a fixed timedelta, and finally extracts only the date for partitioning. Walking through the type at each step is the cleanest way to learn the contracts.
Question. Given two ISO timestamps, parse them, compute the gap in hours, advance the earlier one by 36 hours, and extract its calendar date. Show the type of every intermediate.
Input. Two strings: "2026-06-04T10:00:00" and "2026-06-05T16:00:00".
Code.
from datetime import datetime, timedelta
s1 = "2026-06-04T10:00:00"
s2 = "2026-06-05T16:00:00"
a = datetime.fromisoformat(s1)
b = datetime.fromisoformat(s2)
gap = b - a # timedelta
gap_hours = gap.total_seconds() / 3600
forward = a + timedelta(hours=36) # datetime
partition_day = forward.date() # date
print(type(a).__name__, a)
print(type(gap).__name__, gap, gap_hours, "hours")
print(type(forward).__name__, forward)
print(type(partition_day).__name__, partition_day)
Step-by-step explanation.
-
datetime.fromisoformatparses both ISO strings into naivedatetimeobjects. (Both strings lack an offset, so both are naive — and naive-with-naive subtraction is legal.) -
b - aproduces atimedeltaof one day and six hours. -
.total_seconds() / 3600converts the duration into a float number of hours (30.0). Never use.seconds / 3600here — that would ignore the.dayspart and return 6.0. -
a + timedelta(hours=36)produces a newdatetime36 hours aftera:2026-06-05 22:00:00. -
.date()strips the time component and returns adate(2026-06-05). Useful for warehouse partition keys.
Output.
| Step | Type | Value |
|---|---|---|
parse s1
|
datetime |
2026-06-04 10:00:00 |
b - a |
timedelta |
1 day, 6:00:00 |
| gap_hours | float |
30.0 |
a + timedelta(hours=36) |
datetime |
2026-06-05 22:00:00 |
.date() |
date |
2026-06-05 |
Rule of thumb. Every time you cross a type boundary (string → datetime, datetime → date, datetime → timedelta), write the type next to the variable in a comment for the first few lines of a new pipeline. Once the contracts are clear, the comments can go — but they save reviewers from guessing.
Worked example — total_seconds() is the only safe comparison
Detailed explanation. Two timedelta instances compare correctly with < and > because Python normalises them — but the moment you start displaying them or thresholding them, attribute access on .seconds or .days becomes a trap. total_seconds() is unambiguous.
Question. Given a stream of API-call durations as timedelta objects, write a function that returns the share of calls slower than 250ms. Show the wrong-attribute trap and the safe form.
Input. A list of timedelta durations (mix of sub-second and multi-second).
Code.
from datetime import timedelta
durations = [
timedelta(milliseconds=120),
timedelta(milliseconds=300),
timedelta(seconds=2, milliseconds=50),
timedelta(microseconds=99_000), # 99 ms
]
threshold = timedelta(milliseconds=250)
# BROKEN — .seconds ignores microseconds < 1s and the .days field
slow_broken = sum(1 for d in durations if d.seconds > 0) # only multi-second
# CORRECT — total_seconds() is the single source of truth
slow_correct = sum(1 for d in durations if d.total_seconds() > threshold.total_seconds())
print(slow_broken, slow_correct, len(durations))
Step-by-step explanation.
-
d.secondsreturns the seconds part of the internal representation in the range[0, 86400). A duration of 120 milliseconds hasseconds=0andmicroseconds=120_000. Sod.seconds > 0is FALSE for the sub-second durations — the broken filter undercounts. -
d.total_seconds()returns the whole duration as a float:0.12, 0.3, 2.05, 0.099. Comparing each against0.25correctly classifies two as "slow" (300ms and 2050ms). - The same trap applies to
.days— atimedeltaof 36 hours has.days = 1, not1.5. Always go throughtotal_seconds()(or for very long deltas, divide by 86400 for days as a float). - The Python operators
<and>between twotimedeltainstances do compare correctly because the dunder methods normalise; the danger only appears when one side is an attribute access.
Output.
| Method | Slow count |
|---|---|
.seconds > 0 (broken) |
1 |
.total_seconds() > 0.25 (correct) |
2 |
Rule of thumb. Treat .days, .seconds, .microseconds as implementation details of timedelta. The public API for comparison and conversion is total_seconds(). The cost is one method call per row; the savings is "your threshold function is not silently wrong on millisecond data."
Worked example — the date + timedelta(hours=…) trap
Detailed explanation. A common helper takes a date and an hour offset to land on "noon next Friday." If the caller accidentally passes a date (instead of a datetime) and a timedelta(hours=12), the helper silently returns the same date — because date + timedelta ignores sub-day components.
Question. Show that date(2026, 6, 5) + timedelta(hours=12) is the same date, and rewrite the helper so it accepts a date and produces a datetime at the requested local hour.
Input. A single date and an integer hour.
Code.
from datetime import date, datetime, timedelta, time
d = date(2026, 6, 5)
# Silent bug — sub-day component dropped, result is still a date at 00:00
result_bug = d + timedelta(hours=12)
print(type(result_bug).__name__, result_bug) # date 2026-06-05
# Correct — combine the date with a time(hour=12) to produce a datetime
result_ok = datetime.combine(d, time(hour=12))
print(type(result_ok).__name__, result_ok) # datetime 2026-06-05 12:00:00
Step-by-step explanation.
-
date + timedelta(hours=12): thetimedeltahas.days = 0and.seconds = 43200. Adding it to adateadvances the.dayspart by zero — the.secondspart is silently dropped becausedatecannot represent hours. - The bug:
result_bugis2026-06-05(the same date), not2026-06-05 12:00:00. The caller asked for "noon today" and got "today" — type-wise still a date, with no time component. - The fix:
datetime.combine(date, time)is the canonical way to lift adateinto adatetimeat a specific clock time. The result is a properdatetimeand downstream arithmetic in hours works. - Optionally annotate the helper with
def at_hour(d: date, hour: int) -> datetime:and callcombineinternally; the type hint blocks accidentaldate-in /date-out pairs at code-review time.
Output.
| Expression | Type | Value |
|---|---|---|
d + timedelta(hours=12) |
date |
2026-06-05 |
datetime.combine(d, time(12)) |
datetime |
2026-06-05 12:00:00 |
Rule of thumb. If a helper signature accepts a date but the body does sub-day math, the body should call datetime.combine(d, time(hour=...)) at the very top — or the signature should accept a datetime instead. Type confusion is the bug.
Python interview question on duration math correctness
A senior interviewer might frame this as: "Given a sequence of session start and end timestamps, compute the median session length in seconds. Your function must handle empty input and naive-versus-aware mixing safely." This is a probe on total_seconds(), type discipline, and edge-case handling.
Solution Using timedelta.total_seconds() and a median helper
from datetime import datetime, timedelta
from statistics import median
def median_session_seconds(sessions: list[tuple[datetime, datetime]]) -> float | None:
"""Return median session length in seconds, or None on empty input."""
if not sessions:
return None
durations: list[float] = []
for start, end in sessions:
if (start.tzinfo is None) != (end.tzinfo is None):
raise ValueError("Mixed naive and aware datetimes in same session")
delta: timedelta = end - start
durations.append(delta.total_seconds())
return median(durations)
Step-by-step trace.
| start | end | delta | total_seconds |
|---|---|---|---|
| 09:00 | 09:30 | 0:30:00 | 1800.0 |
| 09:05 | 10:35 | 1:30:00 | 5400.0 |
| 10:00 | 10:02 | 0:02:00 | 120.0 |
Sorted: [120.0, 1800.0, 5400.0]. Median = the middle element = 1800.0.
Output:
| Metric | Value |
|---|---|
| sessions | 3 |
| median_session_seconds | 1800.0 |
Why this works — concept by concept:
-
end - startyields a timedelta — leveraging the arithmetic contract; no manual second math, no error-prone field access. -
total_seconds()is unit-safe — produces a float in seconds. Multiplying by 1/60 gives minutes, by 1/3600 gives hours — no ambiguity. -
Naive-vs-aware guard — the explicit check raises a clear error on the boundary case instead of relying on Python's
TypeErrorpropagation, which is harder to debug at the call site. -
statistics.medianfor the aggregate — pure-Python, sorted-list median, O(n log n). For large inputs, switch to numpy'snp.median(O(n) average via partial sort). -
Noneon empty input — explicit "no data" sentinel that the caller can pattern-match; better than raising or returning0.0(which would look like a valid metric). - Cost — O(n) for the delta loop, O(n log n) for the median. Memory is O(n) for the durations list; for streaming, switch to a reservoir or a running-quantile sketch.
Python
Topic — time-series
Time-series problems (Python)
3. Time zones done right — zoneinfo, pytz, UTC normalization
zoneinfo is the modern answer; pytz is the legacy survivor — and "store in UTC, display in local" is the single rule that defuses 90% of DST bugs
The mental model in one line: every timestamp in your pipeline must be either UTC-aware in storage or local-aware at display — and the conversion happens at exactly one boundary, never in the middle. Once you internalise the layered architecture (ingest → normalise → store → convert at display), the entire family of "the report says 23:00 but the dashboard says 00:00 next day" tickets stops appearing.
Naive vs aware in one paragraph.
A datetime is naive if its tzinfo is None, and aware otherwise. Naive datetimes are "wall clock with no contract" — they could be UTC, local, or anything in between, and the only way to know is to read the surrounding code (or a comment that may or may not be accurate). Aware datetimes carry a tzinfo (typically a ZoneInfo instance) that encodes the IANA zone and lets Python do the right thing on DST transitions. The golden rule: once a timestamp crosses into your codebase, attach a tzinfo immediately, and never let a naive one back out.
The golden rule, expanded.
-
Store in UTC. Warehouse columns should be
TIMESTAMPTZ(Postgres / Snowflake / BigQuery) orTIMESTAMP_TZwith an explicit UTC normalisation at ingest. -
Display in local. When a human reads the value, convert to their local zone via
dt.astimezone(user_tz). The conversion happens at the renderer, never in the warehouse. - Never mix. A query that joins a UTC column to a local-time column without a conversion is always a bug — even if it accidentally produces the right answer for the first few rows.
zoneinfo.ZoneInfo — the 2026 default.
-
Stdlib.
from zoneinfo import ZoneInfo(Python 3.9+). No external dependency, nopip install. -
IANA database. Backed by the system's tzdata or the
tzdataPyPI package on Windows. Updates with the OS. -
Construction.
ZoneInfo("America/New_York"),ZoneInfo("Asia/Tokyo"),ZoneInfo("UTC"). -
Usage. Attach via
dt.replace(tzinfo=ZoneInfo(...))only if the wall clock you have is in that zone; convert viadt.astimezone(ZoneInfo(...))to change which zone the display uses.
pytz — the legacy survivor.
-
pytz.timezone(...).localize(naive_dt)is the correct way to attach apytzzone — never usenaive_dt.replace(tzinfo=pytz.timezone(...))because that produces the historic LMT offset (e.g.-04:56for New York instead of-05:00). -
is_dstflag. During the fall-back hour,localize(naive_dt, is_dst=True)andlocalize(naive_dt, is_dst=False)disambiguate the two valid wall-clock interpretations. -
astimezoneworks the same as inzoneinfoonce the timestamp is properly attached. -
Why it survives. Older codebases still depend on it, and migration is gradual — but for new code, use
zoneinfo.
The conversion flow in detail.
- Parse the input string to a naive or partially-aware datetime.
- Attach the source timezone (the zone the wall clock represents).
- Convert to UTC for storage and arithmetic.
-
Store in a
TIMESTAMPTZcolumn or as an ISO 8601 string with offset. -
On read, convert back to the user's local zone with
astimezone(user_tz)before display.
DST traps.
-
Spring forward (skipped hour). On the second Sunday of March in US Eastern, 02:00 → 03:00 — the clock skips an hour.
02:30does not exist on that day.zoneinforaisesNonExistentTimeErrorwhen you try to construct an aware datetime at that wall time;pytzrequiresis_dsthandling. -
Fall back (ambiguous hour). On the first Sunday of November in US Eastern, 02:00 → 01:00 — the clock repeats an hour.
01:30exists twice.zoneinfouses thefoldattribute (fold=0is the first occurrence,fold=1is the second);pytzusesis_dst=True/False.
Why ISO 8601 with offset is the only safe wire format.
A timestamp on the wire is just a string — and the only string format that survives every system, library, and human reading is ISO 8601 with an explicit offset (2026-06-05T09:00:00+00:00 or 2026-06-05T09:00:00Z). Stripping the offset is the single most common cause of "wait, why does this look like 5am instead of 9am?" bugs.
Worked example — attaching zoneinfo and converting to UTC
Detailed explanation. A CSV from a vendor contains timestamps in America/New_York local time, with no offset attached. The pipeline must parse, attach the source zone, and normalise to UTC for warehouse storage.
Question. Given the string "2026-06-05 09:00:00" representing 9 AM in New York, attach the right tzinfo and convert to UTC.
Input. A single naive timestamp string "2026-06-05 09:00:00" known to be New York wall-clock.
Code.
from datetime import datetime
from zoneinfo import ZoneInfo
raw = "2026-06-05 09:00:00"
naive = datetime.fromisoformat(raw)
# Attach the source zone (does NOT change the wall clock)
ny_aware = naive.replace(tzinfo=ZoneInfo("America/New_York"))
print(ny_aware) # 2026-06-05 09:00:00-04:00
# Convert to UTC for storage (changes the wall clock to UTC)
utc = ny_aware.astimezone(ZoneInfo("UTC"))
print(utc) # 2026-06-05 13:00:00+00:00
Step-by-step explanation.
-
datetime.fromisoformatparses the raw string into a naive datetime:2026-06-05 09:00:00with no offset. -
naive.replace(tzinfo=ZoneInfo("America/New_York"))attaches the timezone without changing the wall clock. The display now reads09:00:00-04:00— the same 9am, but now Python knows it represents Eastern Daylight Time (UTC-4 in June). -
.astimezone(ZoneInfo("UTC"))converts to a different zone, preserving the underlying instant and changing the wall clock. 09:00 EDT is 13:00 UTC. The result is fully aware and safe to store. - The two operations are distinct:
replacesays "this wall clock was in zone X";astimezonesays "what is this instant's wall clock in zone Y?" Never usereplaceto convert.
Output.
| Step | Display |
|---|---|
| naive parse | 2026-06-05 09:00:00 |
replace(tzinfo=ZoneInfo("America/New_York")) |
2026-06-05 09:00:00-04:00 |
.astimezone(ZoneInfo("UTC")) |
2026-06-05 13:00:00+00:00 |
Rule of thumb. replace(tzinfo=…) is for labelling a naive datetime with its source zone. astimezone(…) is for converting an aware datetime to a different zone. If you confuse the two, you will silently shift every timestamp by the source-zone offset.
Worked example — DST spring forward (skipped hour)
Detailed explanation. On 2026-03-08 (the second Sunday of March), New York wall clocks jump from 01:59:59 to 03:00:00. The wall time 02:30 does not exist. Constructing an aware datetime at that wall time is a contract violation that zoneinfo surfaces immediately.
Question. Show what happens when you try to construct datetime(2026, 3, 8, 2, 30, tzinfo=ZoneInfo("America/New_York")), and how to detect and resolve it.
Input. A bad wall clock: 2026-03-08 02:30.
Code.
from datetime import datetime
from zoneinfo import ZoneInfo
ny = ZoneInfo("America/New_York")
bad = datetime(2026, 3, 8, 2, 30, tzinfo=ny)
print(bad) # zoneinfo does not raise on construction —
# it picks the next-valid offset (-04:00 = post-jump)
print(bad.utcoffset()) # -1 day, 20:00:00 → -04:00 (already past DST)
# Round-trip: convert to UTC then back. The wall clock changes!
roundtrip = bad.astimezone(ZoneInfo("UTC")).astimezone(ny)
print(roundtrip) # 2026-03-08 02:30:00-04:00 (same — already post-jump)
Step-by-step explanation.
-
zoneinfochooses the post-jump offset (-04:00) for a wall clock that falls inside the skipped hour. The constructed datetime is "consistent" but represents an instant that does not match the wall clock the user typed. - Converting to UTC and back returns the same
02:30-04:00— proving that the constructed value is internally consistent, just not what the user intended. - To detect a skipped wall clock, compare
dt.utcoffset()before and after the DST jump and look for a discontinuity, or query the IANA database for transitions in the local day. - In practice, never accept a naive wall clock that falls in the DST-skipped band. Sanitise the input: round forward to 03:00, raise an error, or surface the ambiguity to the user.
Output.
| Step | Display |
|---|---|
naive 2026-03-08 02:30 + NY |
2026-03-08 02:30:00-04:00 |
| utcoffset |
-04:00 (post-jump) |
| round-trip UTC → NY | 2026-03-08 02:30:00-04:00 |
Rule of thumb. Spring-forward wall clocks are user errors, not engine bugs. The fix is at the data-ingestion layer: validate that every naive wall clock the system accepts is a valid local time on its date, and surface the ambiguity rather than silently picking an offset.
Worked example — DST fall back (ambiguous hour)
Detailed explanation. On 2026-11-01 (the first Sunday of November), New York wall clocks repeat 01:00 → 02:00. The wall time 01:30 happens twice — once as EDT (-04:00) and once as EST (-05:00). Without the fold attribute, you have a 50/50 chance of being right.
Question. Construct both occurrences of 2026-11-01 01:30 in New York and confirm they map to different UTC instants.
Input. A repeated wall clock: 2026-11-01 01:30.
Code.
from datetime import datetime
from zoneinfo import ZoneInfo
ny = ZoneInfo("America/New_York")
first = datetime(2026, 11, 1, 1, 30, tzinfo=ny, fold=0) # the EDT occurrence
second = datetime(2026, 11, 1, 1, 30, tzinfo=ny, fold=1) # the EST occurrence
print(first.utcoffset(), first.astimezone(ZoneInfo("UTC")))
print(second.utcoffset(), second.astimezone(ZoneInfo("UTC")))
Step-by-step explanation.
-
fold=0selects the first occurrence — the EDT one with offset-04:00. Its UTC equivalent is05:30 UTC. -
fold=1selects the second occurrence — the EST one with offset-05:00. Its UTC equivalent is06:30 UTC. - The two aware datetimes are different points in time, separated by exactly one hour, even though their wall-clock display is identical.
-
pytzcallers usetz.localize(naive, is_dst=True)(EDT) oris_dst=False(EST) — the same disambiguation, different API surface.
Output.
| Variant | utcoffset | UTC equivalent |
|---|---|---|
| fold=0 (EDT) | -04:00 |
2026-11-01 05:30:00+00:00 |
| fold=1 (EST) | -05:00 |
2026-11-01 06:30:00+00:00 |
Rule of thumb. When ingesting fall-back wall clocks, default to fold=0 (the first / "before transition" interpretation) and surface the ambiguity in a log line. If the source system records the offset separately, use the offset to choose the fold — never guess.
Worked example — datetime.utcnow() is naive (and dangerous)
Detailed explanation. datetime.utcnow() returns the current UTC wall clock as a naive datetime — no tzinfo attached. Code that compares this to a UTC-aware warehouse column raises TypeError. Worse, the "fix" of attaching UTC via .replace(tzinfo=ZoneInfo("UTC")) works correctly only because the value happened to be UTC — and that contract is invisible to readers.
Question. Show the difference between datetime.utcnow() and datetime.now(timezone.utc), and the correct idiom for "current instant as a UTC-aware datetime."
Input. None — same instant under both calls.
Code.
from datetime import datetime, timezone
# BROKEN — utcnow returns naive
naive_utc = datetime.utcnow()
print(naive_utc.tzinfo) # None ← contract not encoded
# CORRECT — now(tz) returns aware
aware_utc = datetime.now(timezone.utc)
print(aware_utc.tzinfo) # datetime.timezone.utc
# Comparing the two raises
try:
aware_utc - naive_utc
except TypeError as e:
print("TypeError:", e)
Step-by-step explanation.
-
datetime.utcnow()returns the current UTC wall clock but does not attachtzinfo. Any downstream code that does not know the convention will treat it as "naive local." -
datetime.now(timezone.utc)(or equivalentlydatetime.now(ZoneInfo("UTC"))) attaches UTC to the result. The contract is encoded in the value itself, not in a comment. - Subtracting the two raises
TypeErrorbecause mixing naive and aware is illegal. This is a feature, not a bug — it forces the contract to be made explicit. - New code should prefer
datetime.now(timezone.utc); legacy code that usesutcnow()should wrap with.replace(tzinfo=timezone.utc)at the boundary, ideally with a comment naming the convention.
Output.
| Expression | Result |
|---|---|
datetime.utcnow().tzinfo |
None |
datetime.now(timezone.utc).tzinfo |
datetime.timezone.utc |
aware - naive |
TypeError |
Rule of thumb. Treat datetime.utcnow() as a deprecated API. The replacement is datetime.now(timezone.utc) — same instant, but the timezone is encoded in the value.
Python interview question on ingesting mixed naive and aware timestamps
A senior interviewer often frames this as: "Your ingestion job pulls CSVs from three vendors. Vendor A sends UTC ISO strings with Z. Vendor B sends naive local-time strings labelled America/New_York. Vendor C sends ISO strings with offsets. Write a single function that normalises every input to UTC-aware."
Solution Using a single normalisation funnel with zoneinfo
from datetime import datetime, timezone
from zoneinfo import ZoneInfo
def normalise_to_utc(s: str, source_zone: str | None = None) -> datetime:
"""Parse a timestamp string and return a UTC-aware datetime.
Rules:
- If the string carries an offset (ISO 8601), use it.
- Else, if source_zone is given, attach it and convert to UTC.
- Else, raise — refuse to guess.
"""
dt = datetime.fromisoformat(s.replace("Z", "+00:00"))
if dt.tzinfo is not None:
return dt.astimezone(timezone.utc)
if source_zone is None:
raise ValueError(f"Naive datetime {s!r} requires source_zone")
aware = dt.replace(tzinfo=ZoneInfo(source_zone))
return aware.astimezone(timezone.utc)
Step-by-step trace.
| Vendor | Raw | source_zone | Parsed | Result (UTC) |
|---|---|---|---|---|
| A | 2026-06-05T09:00:00Z |
n/a | aware UTC | 2026-06-05 09:00:00+00:00 |
| B | 2026-06-05 09:00:00 |
America/New_York |
naive | 2026-06-05 13:00:00+00:00 |
| C | 2026-06-05T09:00:00+05:30 |
n/a | aware IST | 2026-06-05 03:30:00+00:00 |
The function rejects any naive input without an explicit source_zone, forcing the caller to declare the convention rather than silently guessing.
Output:
| Vendor | Normalised UTC |
|---|---|
| A | 2026-06-05 09:00:00+00:00 |
| B | 2026-06-05 13:00:00+00:00 |
| C | 2026-06-05 03:30:00+00:00 |
Why this works — concept by concept:
-
One funnel, one output type — every path returns a UTC-aware datetime. No conditional types, no
Optional[tzinfo]downstream, no per-vendor handling in the pipeline body. -
fromisoformatfor ISO 8601 — handles offsets, microseconds, and theZsuffix (after the.replace("Z", "+00:00")shim). Stdlib, no extra dependency. -
Explicit
source_zonefor naive inputs — refuses to guess. Every caller declares the convention; ambiguity becomes a compile-time choice, not a runtime bug. -
astimezone(timezone.utc)for normalisation — preserves the underlying instant; changes the wall clock to UTC. The warehouse-bound value is always UTC-aware. -
No
pytzin the new code path —zoneinfois stdlib, modern, and avoids thelocalize/LMThistorical-offset trap. - Cost — O(1) per timestamp; the parse and convert are both constant-time scalar operations. The function is safe to call millions of times per ingestion run.
Python
Topic — date-time
Date-time problems (Python)
4. pandas Timedelta vs polars Duration — vectorized time math
Both engines speak the same time-math language — the choice between them is one of scale, ergonomics, and ecosystem, not correctness
The mental model in one line: pandas and polars expose the same five time primitives — datetime, duration, bucket, window, timezone — under different keywords; if you can name the primitive, you can do the operation in either engine. Once you know the rosetta-stone mapping, swapping engines becomes a syntax exercise instead of a redesign.
The pandas trio.
-
pd.Timestamp— pandas' replacement fordatetime. Nanosecond resolution by default (since pandas 2.0). Indexable, vectorisable. -
pd.Timedelta— pandas' replacement fortimedelta. Same nanosecond resolution. Accepts strings:pd.Timedelta("7D"),pd.Timedelta("1H 30min"). -
pd.DatetimeIndex— a sorted, dtyped index ofTimestamp. Backs every time-based pandas operation:resample,rolling,between_time,tz_convert.
The polars equivalents.
-
pl.Datetime— polars' datetime type. Nanosecond, microsecond, or millisecond precision (configurable per column). -
pl.Duration— polars' duration type. Same precision options. Constructed viapl.duration(days=…, hours=…)or from a string. -
group_by_dynamic— polars' time-aware group-by. Supports tumbling, hopping, and (with helper) session windows out of the box.
Resample and rolling — the pandas workhorses.
-
df.resample("1H").sum()— tumbling 1-hour buckets. The index must be aDatetimeIndex. Aliases:"1H","1D","1W-MON","15min","5T". -
df.rolling("7D").mean()— sliding 7-day window. The window is time-based when the index is datetime — it adapts to irregular timestamps instead of using fixed row counts. -
df.between_time("09:00", "17:00")— filter rows by clock-time only, ignoring date. Useful for business-hours metrics.
Bucketing operations.
-
pandas:
s.dt.floor("1H"),s.dt.ceil("15min"),s.dt.round("D"). Returns a Series. -
polars:
col("ts").dt.truncate("1h"),col("ts").dt.round("15m"). Returns an Expr. -
Use: pre-bucket a timestamp column before joining or grouping.
flooris the workhorse;roundonly for symmetric thresholds.
Time-zone aware columns.
-
pandas:
s.dt.tz_localize("UTC")attaches UTC;s.dt.tz_convert("America/NY")converts. Both must be called explicitly — aDatetimeIndexis naive by default. -
polars:
col("ts").dt.replace_time_zone("UTC")thendt.convert_time_zone("America/NY"). Same two-step ritual. - Gotcha: concatenating a tz-aware Series with a tz-naive Series in pandas silently coerces to object dtype, killing every vectorised operation downstream.
Performance — the polars edge.
-
Lazy execution.
pl.scan_csv(...).filter(...).group_by_dynamic(...).collect()builds a query plan, optimises it, and executes only the needed columns. pandas executes eagerly row-by-row from the first operation. - Multi-threaded. polars uses all CPU cores by default. pandas is single-threaded except for specific Cython kernels.
-
Real-world. On a 50 M-row datetime column with a
group_by_dynamicaggregation, polars typically runs 10–50× faster than the pandas equivalent — the gap widens with row count.
Mixing tz-aware and tz-naive in pandas — the silent dtype trap.
import pandas as pd
a = pd.Series(pd.to_datetime(["2026-06-05 09:00"]).tz_localize("UTC"))
b = pd.Series(pd.to_datetime(["2026-06-05 09:00"])) # naive
mixed = pd.concat([a, b])
print(mixed.dtype) # object ← every vectorised dt.* operation now silently broken
Common interview probes.
- "Resample vs rolling?" — resample is tumbling (non-overlapping); rolling is sliding (overlapping, time-based windows when the index is datetime).
- "How do you convert a timezone in polars?" —
col("ts").dt.convert_time_zone("America/NY")on an aware column;replace_time_zonefirst if naive. - "Why does my pandas datetime column suddenly become object dtype?" — almost always tz-aware + tz-naive concat.
- "How do you do a 5-minute hopping window with a 1-hour period in pandas?" — there is no direct primitive; you build it via
pd.Grouperplusmerge_asofor use polars'group_by_dynamic(every="5m", period="1h")directly.
Worked example — same 7-day rolling revenue in pandas and polars
Detailed explanation. A team computes 7-day rolling revenue per day from a revenue table. Writing it in both engines side by side is the cleanest way to learn the rosetta stone: same primitives, different syntax.
Question. Given an events table with ts and revenue, compute 7-day rolling revenue. Show both the pandas and the polars solution.
Input.
| ts | revenue |
|---|---|
| 2026-06-01 | 100 |
| 2026-06-02 | 150 |
| 2026-06-03 | 200 |
| 2026-06-04 | 50 |
| 2026-06-05 | 300 |
Code.
import pandas as pd
import polars as pl
# pandas — set DatetimeIndex, then rolling("7D")
pdf = pd.DataFrame({"ts": pd.to_datetime(["2026-06-01","2026-06-02","2026-06-03","2026-06-04","2026-06-05"]),
"revenue": [100, 150, 200, 50, 300]}).set_index("ts")
pdf["rev_7d"] = pdf["revenue"].rolling("7D").sum()
# polars — group_by_dynamic with every=1d period=7d
ldf = pl.DataFrame({"ts": ["2026-06-01","2026-06-02","2026-06-03","2026-06-04","2026-06-05"],
"revenue": [100, 150, 200, 50, 300]}).with_columns(pl.col("ts").str.to_date())
out = ldf.group_by_dynamic("ts", every="1d", period="7d").agg(pl.col("revenue").sum().alias("rev_7d"))
Step-by-step explanation.
- The pandas form sets a
DatetimeIndexand usesrolling("7D"). The string"7D"makes the window time-based (not row-count-based); it includes every row whose timestamp falls within 7 days of the current row's timestamp. - The polars form uses
group_by_dynamicwithevery="1d"(one bucket per day) andperiod="7d"(each bucket spans 7 days). The behaviour is equivalent: each output row covers a 7-day window ending at itsts. - Both compute the same revenue: a running window sum, anchored to each input timestamp. The arithmetic is identical; the spelling differs.
- On 1 M rows of input, polars typically runs ~10× faster because of lazy planning, multi-threading, and SIMD; on the 5-row toy example above, the difference is negligible.
Output.
| ts | revenue | rev_7d |
|---|---|---|
| 2026-06-01 | 100 | 100 |
| 2026-06-02 | 150 | 250 |
| 2026-06-03 | 200 | 450 |
| 2026-06-04 | 50 | 500 |
| 2026-06-05 | 300 | 800 |
Rule of thumb. When the team's stack already includes pandas and the volume is small, stick with rolling("7D"). When the volume exceeds ~10 M rows or the latency budget is tight, port to group_by_dynamic in polars — the rewrite is mechanical, the speedup is significant.
Worked example — tz_localize then tz_convert (pandas)
Detailed explanation. A pandas DataFrame ingested from a CSV has naive timestamps known to be in America/New_York. To store them in UTC, you must tz_localize first (attach the source zone), then tz_convert (change to UTC). Skipping the localize step raises an error.
Question. Given a pandas Timestamp column of naive New York local times, convert it to UTC-aware.
Input. Three naive timestamps representing NY local time.
Code.
import pandas as pd
s = pd.Series(pd.to_datetime([
"2026-06-05 09:00",
"2026-06-05 10:00",
"2026-06-05 11:00",
]))
print(s.dt.tz) # None — naive
# Step 1 — attach the source zone (does not change wall clock)
s_ny = s.dt.tz_localize("America/New_York")
print(s_ny.dt.tz) # America/New_York
# Step 2 — convert to UTC
s_utc = s_ny.dt.tz_convert("UTC")
print(s_utc) # 13:00, 14:00, 15:00 UTC
Step-by-step explanation.
- The raw column is naive —
dt.tzisNone. Callingdt.tz_convert(...)directly raisesTypeError: Cannot convert tz-naive Timestamps, use tz_localize to localize. -
dt.tz_localize("America/New_York")attaches the zone without changing the wall clock. The display now shows the same hours with-04:00(EDT in June). -
dt.tz_convert("UTC")converts to UTC: each timestamp's wall clock shifts forward by 4 hours, and the offset becomes+00:00. - Storing
s_utcin the warehouse (e.g. asTIMESTAMPTZ) preserves the contract end-to-end. On read, the consuming dashboard can convert back viadt.tz_convert(user_tz).
Output.
| Step | First value |
|---|---|
| naive | 2026-06-05 09:00:00 |
tz_localize("America/New_York") |
2026-06-05 09:00:00-04:00 |
tz_convert("UTC") |
2026-06-05 13:00:00+00:00 |
Rule of thumb. tz_localize answers "what zone was this naive wall clock in?"; tz_convert answers "what is this aware instant's wall clock in zone Y?" Skipping localize on a naive column is a hard error; using tz_convert to "attach" a zone is a logic bug.
Worked example — floor to hour, then group (polars)
Detailed explanation. A common ETL step is to bucket timestamps to the hour, then count events per bucket. Polars' dt.truncate is the analog of pandas' dt.floor; both round down to the nearest unit. Polars expresses the whole pipeline as a single expression chain.
Question. Given a polars DataFrame of events with sub-second timestamps, compute the count of events per hourly bucket.
Input.
| ts | event |
|---|---|
| 2026-06-05 09:03:11 | login |
| 2026-06-05 09:47:02 | login |
| 2026-06-05 10:05:00 | login |
| 2026-06-05 10:55:32 | login |
Code.
import polars as pl
from datetime import datetime
df = pl.DataFrame({
"ts": [datetime(2026,6,5,9,3,11), datetime(2026,6,5,9,47,2),
datetime(2026,6,5,10,5,0), datetime(2026,6,5,10,55,32)],
"event": ["login"] * 4,
})
out = (df
.with_columns(pl.col("ts").dt.truncate("1h").alias("hour_bucket"))
.group_by("hour_bucket")
.agg(pl.len().alias("n_events"))
.sort("hour_bucket"))
Step-by-step explanation.
-
dt.truncate("1h")rounds everytsdown to the nearest hour.09:47:02becomes09:00:00;10:55:32becomes10:00:00. -
group_by("hour_bucket")collapses identical bucket values into a single group key. -
pl.len().alias("n_events")counts rows per group. Each output row corresponds to one bucket. -
sort("hour_bucket")ensures the output is chronological.group_byis unordered in polars, so an explicit sort is required for downstream consumers that assume time order.
Output.
| hour_bucket | n_events |
|---|---|
| 2026-06-05 09:00:00 | 2 |
| 2026-06-05 10:00:00 | 2 |
Rule of thumb. Use dt.truncate("1h") (polars) or dt.floor("1H") (pandas) for tumbling buckets. Reach for group_by_dynamic (polars) or resample (pandas) when you need hopping or sliding behaviour, not just floor-then-group.
Python interview question on a vectorised time-series rollup
A senior interviewer might say: "You have a 50 M-row events table with ts (UTC) and revenue. Compute hourly revenue per region, with the result sorted by region then hour. Write it in both pandas and polars and tell me which scales better."
Solution Using group_by_dynamic in polars and groupby + resample in pandas
import polars as pl
import pandas as pd
# polars — lazy, multi-threaded
out_pl = (pl.scan_parquet("events.parquet")
.group_by_dynamic("ts", every="1h", by="region")
.agg(pl.col("revenue").sum().alias("rev"))
.sort(["region", "ts"])
.collect())
# pandas — eager
df = pd.read_parquet("events.parquet")
out_pd = (df.set_index("ts")
.groupby("region")["revenue"]
.resample("1H").sum()
.reset_index()
.rename(columns={"revenue": "rev"})
.sort_values(["region", "ts"]))
Step-by-step trace.
| Engine | Plan | Row count after group-by |
|---|---|---|
| polars (lazy) | scan → group_by_dynamic → sort | 24 × N_regions per day |
| pandas (eager) | read → set_index → groupby → resample | 24 × N_regions per day |
The polars plan is built lazily and optimised before execution; the pandas pipeline materialises after set_index and again after groupby. The arithmetic outcome is identical.
Output:
| region | ts | rev |
|---|---|---|
| EU | 2026-06-05 00:00:00 | 12 340.50 |
| EU | 2026-06-05 01:00:00 | 8 901.25 |
| US | 2026-06-05 00:00:00 | 22 105.75 |
| US | 2026-06-05 01:00:00 | 17 882.10 |
Why this works — concept by concept:
-
group_by_dynamicas the time-aware group-by — combines the bucketing (every="1h") and the group key (by="region") into one operation, planned together by the polars optimiser. -
pandas
resampleaftergroupby— each group's index is independently resampled to hourly. The two-step pipeline is more verbose but conceptually equivalent. -
Lazy execution scales —
pl.scan_parquetreads only the columns the plan needs;set_indexin pandas materialises the entire frame in memory before the resample. - Multi-threading — polars distributes the per-region aggregation across cores; pandas single-threads through one Python kernel.
-
Sorted output contract — both pipelines end with an explicit
sort. Group-by results are unordered in polars and order-by-key in pandas; downstream consumers that rely on(region, ts)order need the sort to be explicit. - Cost — polars: O(n) for the scan, O(buckets × regions) for the aggregate, multi-threaded → wall-clock scales sub-linearly. pandas: O(n) read, O(n) groupby, O(buckets) per group → linear single-thread → 10–50× slower at 50 M rows.
Python
Topic — time-series
Time-series problems (Python)
5. Sliding, tumbling & session windows — the time-window patterns
Three window shapes cover ~95% of streaming-style DE problems — and the only thing separating "easy" from "interview-hard" is whether you can name the gap semantic
The mental model in one line: a time window has a length, a step, and a closing rule — tumbling is length=step, hopping is length≥step, and session is "closed when the gap exceeds X". Once you can quote those three shapes by name, every "bucket events into…" interview prompt collapses to "which shape does the business want?"
Tumbling windows.
-
Definition. Fixed-length, non-overlapping buckets that tile the timeline.
every == period. - Examples. Hourly revenue, daily DAU, 15-minute event counts.
-
pandas:
df.resample("1H").sum()on aDatetimeIndex. -
polars:
df.group_by_dynamic("ts", every="1h").agg(...). - Semantics. Each event belongs to exactly one bucket. Closed on the left, open on the right by default in both engines.
Hopping / sliding windows.
-
Definition. Fixed-length, overlapping buckets.
period > every. Synonyms: hopping (in Flink / Beam) and sliding (in some libraries) refer to the same shape. - Examples. "1-hour window every 5 minutes" for trailing-hour metrics; "7-day rolling revenue updated daily."
-
pandas:
rolling("1H")on aDatetimeIndexfor the time-based sliding; no nativeeveryparameter — you bucket viaGrouperthenmerge_asofif you need true hopping. -
polars:
group_by_dynamic("ts", every="5m", period="1h")is the direct primitive. -
Semantics. Each event belongs to multiple buckets (
period / everyof them on average). Aggregates can be heavy on memory; use incremental algorithms for high-frequency hops.
Session windows.
- Definition. Variable-length, gap-closed buckets. A session is "events with no inter-event gap larger than X."
- Examples. User sessions on a website (30-min inactivity gap), conversation threads (10-min gap), trade bursts (1-min gap).
-
pandas: No direct primitive; compute via
diff+cumsumof a "new session" boolean. -
polars: Same pattern via expression chain; also supports
group_by_dynamicwith an artificial session_id column. - Semantics. Each event belongs to exactly one session. Session boundaries depend on the data, not the clock — two users with different click patterns get different session counts on the same day.
Watermarks and late data.
- Watermark. "No more events older than W will arrive." In streaming systems (Flink, Beam, Spark Structured Streaming), the watermark advances as event time progresses, and windows close once the watermark passes their right edge.
- Late-arriving data. Events whose timestamps are older than the watermark when they arrive. They either re-open the closed window (idempotent re-aggregation) or are dropped to a side-output (lossy but bounded latency).
- Batch implication. Even pure-batch pipelines have a watermark — implicit in "we re-run the previous 7 days every night to absorb late corrections." Naming the watermark explicitly makes the contract auditable.
Anti-pattern: building windows with Python loops.
# DON'T do this
windows = []
for ts in event_ts:
start = ts - timedelta(hours=1)
windows.append(sum(e.amount for e in events if start <= e.ts <= ts))
This is O(n²) for n events and a nested for over events per row. Vectorised forms via groupby + cumsum, rolling, or group_by_dynamic are O(n log n) at worst and 100–1000× faster.
Common interview probes on windows.
- "Tumbling vs hopping?" — tumbling is
every == period; hopping isperiod > every. - "How do you detect a session in pure SQL?" —
LAG(ts) OVER (PARTITION BY user ORDER BY ts); flag rows where the gap exceeds the threshold;SUM(flag) OVER (...)as the session id. - "What is event-time vs processing-time?" — event-time is when the event happened; processing-time is when the system saw it. Always window on event-time for correctness.
- "Why are session windows harder than tumbling?" — variable length means each session's boundaries depend on its own data; you cannot pre-bucket.
Worked example — tumbling: hourly event count
Detailed explanation. The canonical first window question: "how many events per hour?" Both engines have a direct primitive; the answer is a one-liner.
Question. Given an events DataFrame with sub-second timestamps, compute the event count per hour. Show the polars solution.
Input.
| ts | event |
|---|---|
| 2026-06-05 09:03:11 | x |
| 2026-06-05 09:47:02 | x |
| 2026-06-05 10:05:00 | x |
| 2026-06-05 11:55:32 | x |
Code.
import polars as pl
from datetime import datetime
df = pl.DataFrame({
"ts": [datetime(2026,6,5,9,3,11), datetime(2026,6,5,9,47,2),
datetime(2026,6,5,10,5,0), datetime(2026,6,5,11,55,32)],
"event": ["x"] * 4,
})
out = (df
.group_by_dynamic("ts", every="1h")
.agg(pl.len().alias("n"))
.sort("ts"))
Step-by-step explanation.
-
group_by_dynamic("ts", every="1h")declares: "bucket thetscolumn into 1-hour tumbling windows." Each bucket starts at the previous hour boundary (floor) and is non-overlapping with its neighbours. -
pl.len().alias("n")counts the rows in each bucket. The expression evaluates per bucket, not per row. - The output is one row per non-empty bucket. Empty buckets (e.g. the hour where no events occurred) are not emitted by default — use
closed="left"and explicit reindexing if you need them. -
sort("ts")produces chronological output;group_by_dynamicdoes not guarantee order in polars.
Output.
| ts | n |
|---|---|
| 2026-06-05 09:00:00 | 2 |
| 2026-06-05 10:00:00 | 1 |
| 2026-06-05 11:00:00 | 1 |
Rule of thumb. When the spec says "per hour," "per day," or "per N minutes," and the buckets are non-overlapping, the answer is tumbling — resample in pandas, group_by_dynamic(every=…) in polars. Anything else is over-engineering.
Worked example — hopping: 1-hour window every 5 minutes
Detailed explanation. A real-time dashboard shows "events in the last hour, refreshed every 5 minutes." Each output row spans 60 minutes but the rows are spaced 5 minutes apart — so each event appears in ~12 output rows.
Question. Given the same events DataFrame, compute a 1-hour window every 5 minutes.
Input. Same as the previous example.
Code.
out = (df
.group_by_dynamic("ts", every="5m", period="1h")
.agg(pl.len().alias("n_last_hour"))
.sort("ts"))
Step-by-step explanation.
-
every="5m"declares the step between consecutive bucket starts. The dashboard refreshes every 5 minutes. -
period="1h"declares the length of each bucket. Each row aggregates the last hour of events. - Each event appears in
period / every = 60min / 5min = 12buckets on average. Memory and compute scale with that multiplier — for high-frequency hops, pick incremental algorithms (cumulative sums) or accept the cost. - The output's left edge (first bucket) is the floor of the smallest
tsto the nearest 5-minute boundary; the right edge of each bucket isbucket_start + period.
Output (showing first three buckets).
| ts (bucket start) | n_last_hour |
|---|---|
| 2026-06-05 08:05:00 | 0 (events from 08:05 to 09:05) |
| ... | ... |
| 2026-06-05 09:00:00 | 2 |
| 2026-06-05 09:50:00 | 3 |
Rule of thumb. Hopping is the right answer when the spec contains both a window length ("last hour") and a refresh cadence ("every 5 minutes"). If the lengths are equal, you have tumbling; if period > every, you have hopping.
Worked example — session: 30-minute inactivity gap
Detailed explanation. A web analytics team defines a "user session" as a sequence of events with no gap longer than 30 minutes. Two events 31 minutes apart belong to different sessions; two events 29 minutes apart belong to the same.
Question. Given a clicks DataFrame with user_id and ts, assign a session_id such that consecutive events within 30 minutes share an id.
Input.
| user_id | ts |
|---|---|
| 1 | 2026-06-05 09:00:00 |
| 1 | 2026-06-05 09:15:00 |
| 1 | 2026-06-05 09:50:00 |
| 1 | 2026-06-05 10:45:00 |
| 2 | 2026-06-05 09:05:00 |
Code.
import polars as pl
df = pl.DataFrame({
"user_id": [1, 1, 1, 1, 2],
"ts": ["2026-06-05 09:00:00", "2026-06-05 09:15:00",
"2026-06-05 09:50:00", "2026-06-05 10:45:00",
"2026-06-05 09:05:00"],
}).with_columns(pl.col("ts").str.to_datetime())
out = (df.sort(["user_id", "ts"])
.with_columns([
(pl.col("ts").diff().over("user_id")
> pl.duration(minutes=30)).fill_null(True).alias("new_session"),
])
.with_columns(
pl.col("new_session").cum_sum().over("user_id").alias("session_id"),
))
Step-by-step explanation.
- Sort by
(user_id, ts)so thediffis sensible within each user. -
pl.col("ts").diff().over("user_id")computes the inter-event gap within each user; the first event per user has a NULL diff. -
> pl.duration(minutes=30)produces a boolean: TRUE when the gap exceeds 30 minutes (new session boundary), FALSE otherwise.fill_null(True)treats each user's first event as a new session start. -
cum_sum().over("user_id")assigns 1 to the first event, 2 to the next session-start, and so on — giving a contiguoussession_idper user.
Output.
| user_id | ts | new_session | session_id |
|---|---|---|---|
| 1 | 09:00:00 | TRUE | 1 |
| 1 | 09:15:00 | FALSE | 1 |
| 1 | 09:50:00 | FALSE | 1 |
| 1 | 10:45:00 | TRUE | 2 |
| 2 | 09:05:00 | TRUE | 1 |
Rule of thumb. Session windows always reduce to "boolean flag + cumsum." Once you have the session_id, downstream aggregates (session length, session revenue) are plain group-bys.
Worked example — event-time vs processing-time
Detailed explanation. A late-arriving event from a mobile client lands in the pipeline 90 minutes after it actually happened. Bucketing by processing-time puts it in the wrong hour; bucketing by event-time puts it where it semantically belongs. The difference matters for revenue accounting, anomaly detection, and any user-visible metric.
Question. Given two columns event_ts and processing_ts, show how the hourly count differs when you group by each.
Input.
| event_ts | processing_ts |
|---|---|
| 2026-06-05 09:30 | 2026-06-05 09:31 |
| 2026-06-05 09:45 | 2026-06-05 09:46 |
| 2026-06-05 09:55 | 2026-06-05 11:25 (late!) |
Code.
out_event = (df.group_by_dynamic("event_ts", every="1h")
.agg(pl.len().alias("n"))
.sort("event_ts"))
out_proc = (df.group_by_dynamic("processing_ts", every="1h")
.agg(pl.len().alias("n"))
.sort("processing_ts"))
Step-by-step explanation.
-
event_tsreflects when the user actually clicked. Every event belongs to the 09:00 hour. -
processing_tsreflects when the pipeline saw the event. The late one lands in the 11:00 hour even though it was generated at 09:55. - Bucketing by event-time gives
09:00 → 3events; bucketing by processing-time gives09:00 → 2and11:00 → 1— a misleading view of "what happened." - The right answer for user-visible metrics is event-time, with a watermark policy for closing late-arriving buckets.
Output (event-time):
| event_ts (hour) | n |
|---|---|
| 09:00 | 3 |
Output (processing-time):
| processing_ts (hour) | n |
|---|---|
| 09:00 | 2 |
| 11:00 | 1 |
Rule of thumb. Always default to event-time for correctness. Use processing-time only when the question is operational ("how loaded was the ingest pipeline this hour?"), not analytical.
Python interview question on session detection
A senior interviewer might frame this as: "Given a clicks DataFrame with user_id and ts, define a user session as consecutive events within 30 minutes. Return a DataFrame of (user_id, session_id, session_start, session_end, n_events)."
Solution Using diff + cum_sum for session_id then group_by
import polars as pl
def sessionise(df: pl.DataFrame, gap_minutes: int = 30) -> pl.DataFrame:
s = (df.sort(["user_id", "ts"])
.with_columns(
(pl.col("ts").diff().over("user_id")
> pl.duration(minutes=gap_minutes))
.fill_null(True).alias("new_sess"))
.with_columns(
pl.col("new_sess").cum_sum().over("user_id").alias("session_id")))
return (s.group_by(["user_id", "session_id"])
.agg([pl.col("ts").min().alias("session_start"),
pl.col("ts").max().alias("session_end"),
pl.len().alias("n_events")])
.sort(["user_id", "session_id"]))
Step-by-step trace.
| user_id | ts | gap | new_sess | session_id |
|---|---|---|---|---|
| 1 | 09:00 | NULL | TRUE | 1 |
| 1 | 09:15 | 15m | FALSE | 1 |
| 1 | 09:50 | 35m | TRUE | 2 |
| 1 | 10:45 | 55m | TRUE | 3 |
| 2 | 09:05 | NULL | TRUE | 1 |
cum_sum over user_id increments only on the TRUE flags, producing a contiguous integer id per user that matches the gap-closure semantics.
Output:
| user_id | session_id | session_start | session_end | n_events |
|---|---|---|---|---|
| 1 | 1 | 09:00 | 09:15 | 2 |
| 1 | 2 | 09:50 | 09:50 | 1 |
| 1 | 3 | 10:45 | 10:45 | 1 |
| 2 | 1 | 09:05 | 09:05 | 1 |
Why this works — concept by concept:
-
diff().over(user_id)— computes the inter-event gap within each user. Polars handles the partition implicitly; pandas would writegroupby("user_id")["ts"].diff(). -
Boolean flag +
cum_sum— the classic session-detection idiom. Every TRUE marks a new session; the running sum assigns a contiguous id. -
fill_null(True)— the first event per user has a NULLdiff; treating it asTrue(new session) seeds the id correctly. -
group_byfor the summary — oncesession_idexists, the rest is a plain group-by withmin,max, andlen. No special window primitive needed. -
Linear cost —
diffis O(1) per row;cum_sumis O(n) over a sorted partition;group_byis O(n). Whole pipeline is O(n log n) for the initial sort, O(n) thereafter — orders of magnitude faster than a Python loop. - Cost — O(n log n) sort + O(n) for the rest. On 10 M clicks across 100 K users, this runs in seconds on a laptop with polars.
Python
Topic — sliding-window
Sliding-window problems (Python)
Worked example — late data and watermarks
Detailed explanation. A streaming job emits hourly revenue rollups. The watermark advances as event-time progresses; once it passes 10:00, the 09:00 bucket is "closed." A 09:55 event that arrives at 11:30 is late — the job's policy determines whether it reopens the 09:00 bucket (with a re-emit) or is dropped to a side-output.
Question. Given the events below with event_ts and arrival_ts, mark which events arrived after the 1-hour-late watermark.
Input.
| event_ts | arrival_ts |
|---|---|
| 09:30 | 09:31 |
| 09:55 | 11:30 (late!) |
| 10:10 | 10:11 |
Code.
from datetime import timedelta
import polars as pl
# Watermark: events more than 1 hour late are flagged
df_marked = df.with_columns(
(pl.col("arrival_ts") - pl.col("event_ts") > pl.duration(hours=1)).alias("is_late"),
)
Step-by-step explanation.
- The watermark contract: "we consider an event 'on time' if
arrival_ts - event_ts <= 1 hour." Anything else is late. - The 09:30 event arrives at 09:31 — 1 minute late.
is_late = FALSE. - The 09:55 event arrives at 11:30 — 1h35m late.
is_late = TRUE. The 09:00 bucket has already been emitted; the policy choice is reopen-and-re-emit or drop-to-side-output. - The 10:10 event arrives at 10:11 — 1 minute late, before the 10:00 bucket closes.
is_late = FALSE.
Output.
| event_ts | arrival_ts | is_late |
|---|---|---|
| 09:30 | 09:31 | FALSE |
| 09:55 | 11:30 | TRUE |
| 10:10 | 10:11 | FALSE |
Rule of thumb. State the watermark policy in the pipeline's documentation, not in a comment. Late data is a business decision (drop vs reopen vs side-output), not an engineering one — and the policy must be auditable.
Cheat sheet — Python time-math recipes
-
Parse ISO 8601.
datetime.fromisoformat(s)— handles offsets, microseconds, and (after replacingZwith+00:00) Zulu time. -
Current UTC, aware.
datetime.now(timezone.utc)— neverdatetime.utcnow()(naive) ordatetime.now()(host-local). -
Attach a source zone.
dt.replace(tzinfo=ZoneInfo("America/New_York"))— labels a naive wall clock without changing it. -
Convert to UTC.
dt.astimezone(ZoneInfo("UTC"))— preserves the instant, changes the wall clock to UTC. -
Convert to display zone.
dt.astimezone(ZoneInfo("America/Tokyo"))— same operation, different target zone. -
Add 30 days (fixed length).
dt + timedelta(days=30). -
Add 1 month (calendar).
dt + relativedelta(months=1)— leap-year and end-of-month aware. -
Difference in seconds.
(a - b).total_seconds()— never(a - b).seconds(ignores.days). -
Round / floor / ceil to hour. pandas:
s.dt.floor("1H"); polars:pl.col("ts").dt.truncate("1h"). -
Tumbling 1-hour rollup. pandas:
df.resample("1H").sum(); polars:df.group_by_dynamic("ts", every="1h").agg(...). -
Hopping window. polars:
df.group_by_dynamic("ts", every="5m", period="1h").agg(...). -
Sliding (time-based) window. pandas:
df.rolling("7D").sum()on aDatetimeIndex. -
Session detection.
(diff > threshold).cum_sum().over("user")→ contiguoussession_id. -
Localize tz-naive pandas Series.
s.dt.tz_localize("UTC")thendt.tz_convert("America/NY")— never.tz_converton a naive Series. -
DST-safe future timestamp. anchor in UTC, add
timedelta, convert at display. -
Count NULL
ts.(s.isna() | s.isnull()).sum()in pandas;df.select(pl.col("ts").is_null().sum())in polars. -
Ingestion contract. require ISO 8601 with offset; reject naive strings without an explicit
source_zone. -
Pandas object-dtype trap. never
concata tz-aware and a tz-naive Series — explicitly localize the naive one first.
Frequently asked questions
Should I use pytz or zoneinfo in 2026?
Use zoneinfo — it is part of the standard library since Python 3.9, has no pip install cost, and uses the system's IANA tzdata (or the tzdata PyPI package on Windows). pytz is deprecated as the default and survives only in legacy code. The two key differences: zoneinfo uses the fold attribute to disambiguate the fall-back hour (fold=0 for the first occurrence, fold=1 for the second), while pytz uses is_dst=True/False; zoneinfo lets you attach a zone via dt.replace(tzinfo=ZoneInfo(...)) directly, while pytz requires tz.localize(dt) to avoid the historical LMT offset trap. For new code, default to zoneinfo and migrate pytz callsites at your leisure.
Why does datetime.utcnow() return a naive datetime?
Because the standard library, when datetime was designed in 2002, treated tzinfo as optional and utcnow() was specified to return the wall-clock time in UTC without attaching a tzinfo — the rationale being that "you obviously already know it is UTC because you called utcnow()." In practice this convention has been a footgun for two decades because the contract is encoded in the function name, not the value, so any downstream code that lost track of the source paths will treat it as naive-local. The 2026 best practice is to call datetime.now(timezone.utc) instead — same instant, but the timezone is part of the value, and any comparison or subtraction with another aware datetime works without TypeError. New code should never use utcnow(); legacy code should wrap it with .replace(tzinfo=timezone.utc) at the boundary, ideally behind a thin helper that documents the convention.
How do I add months to a datetime if timedelta has no months?
Use dateutil.relativedelta: from dateutil.relativedelta import relativedelta; dt + relativedelta(months=3) is the canonical answer. relativedelta is calendar-aware — it increments the month field by 3 and clamps the day to the last valid day of the target month, handling leap years and end-of-month edge cases correctly. timedelta(days=30 * 3) is not equivalent — it advances exactly 90 days, which is one to three days different from "three calendar months later" depending on which months are involved. The reason timedelta deliberately refuses months and years is that those units have variable length (28-31 days for a month; 365 or 366 days for a year); encoding them as a fixed-duration delta would be a lie. The standard library forces you to choose between a fixed duration (timedelta) and a calendar delta (relativedelta), which is exactly the choice you want to make explicitly.
What's the difference between pd.Timedelta and datetime.timedelta?
Both represent a duration; pd.Timedelta is pandas' version with nanosecond resolution (since pandas 2.0) and vectorised operations across whole columns, while datetime.timedelta is the standard library's version with microsecond resolution and scalar-only behaviour. Functionally they interop: pd.Timedelta(timedelta(hours=1)) round-trips cleanly, and arithmetic between a pd.Timestamp and a datetime.timedelta works. The reason to prefer pd.Timedelta inside a pandas pipeline is that the operations vectorise — s + pd.Timedelta("1D") is a single C-level loop across the whole column, while a Python timedelta would force per-row Python dispatch. The reason to use plain datetime.timedelta outside pandas is that it is stdlib, dependency-free, and forces the same arithmetic discipline (total_seconds() instead of .seconds, no months/years). The polars equivalent is pl.duration(hours=1), which has the same vectorised semantics.
How do I handle DST transitions safely?
The single rule that defuses DST: store and compute in UTC; convert at display time only. UTC has no DST, so any arithmetic on UTC instants behaves consistently year-round. When you need to express a business rule in local time ("trigger at 09:00 New York every day"), convert the next trigger time into UTC and store it as a UTC instant — re-compute it from the local rule on every iteration so you never end up with a stale UTC timestamp that drifted across a DST boundary. For the spring-forward gap (02:30 does not exist), zoneinfo will silently pick the post-jump offset, so validate naive wall clocks at the ingestion layer and reject inputs that fall in the skipped band. For the fall-back ambiguity (01:30 happens twice), use fold=0 (first occurrence) by default and surface the ambiguity in a log line; if the source system records the offset separately, use the offset to choose the fold. Never use timedelta(days=1) to mean "the same wall-clock time tomorrow" in local time — that semantic requires dateutil.relativedelta(days=1) plus a local-zone-aware datetime.
Why is my pandas DataFrame's datetime column suddenly object dtype?
Almost always because a tz-aware Series and a tz-naive Series were concatenated, joined, or stacked. pandas refuses to coerce the two into a single tz-aware or tz-naive column (the contracts conflict), so it falls back to object dtype — which silently breaks every downstream dt.* accessor and most arithmetic. The diagnosis is one line: print(s.dtype) should print datetime64[ns, UTC] or datetime64[ns] — if it prints object, you have the bug. The fix is to localize every naive Series at its source (s.dt.tz_localize("UTC") or whatever the convention is) before the concat. The defensive pattern is a CI lint that scans for pd.concat calls and flags any path where the inputs have mismatched dt.tz. The same trap exists in polars but is louder: polars raises a type error on the concat instead of silently coercing, so it surfaces at the first run rather than at the dashboard render.
Practice on PipeCode
- Drill the date-time practice library → for
datetime/timedelta/zoneinforeps. - Rehearse on time-series problems (Python) → for the
pd.resample/polars group_by_dynamicfamily. - Sharpen sliding-window drills (Python) → for hopping windows and session-detection patterns.
- Layer the window functions library → for SQL-side
LAG/LEAD/ partitioned time math. - Stack the Python language practice library → for end-to-end ingestion-pipeline reps.
- For the broader surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the Python axis with the Python for data engineering interviews course →.
- For long-form schema craft, work through data modelling for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every time-math recipe above ships with hands-on practice rooms where you write the UTC normalisation funnel, the `group_by_dynamic` rollup, and the session-detection `diff + cum_sum` against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your fix to a DST off-by-one actually behaves the same on pandas as on polars.





Top comments (0)