Alex Merced

Posted on Jun 11

Apache Data Lakehouse Weekly: June 4 to June 11, 2026

#database #dataengineering #news #opensource

The lakehouse community spent this week arguing about versions, and the arguments mattered. Parquet contributors produced the single largest thread across all five projects with a 40-message debate on what Parquet versioning should even mean, while Iceberg shipped four release candidates of its C++ implementation in seven days and locked in a patch release plan for its two production lines. Underneath the release activity, a quieter theme connected everything: how these projects make decisions. Polaris debated merge button mechanics and HTTP status codes, Parquet contributors insisted that working group syncs cannot replace mailing list consensus, and Arrow wrote down rules for AI-generated code reviews. The formats are maturing, and so is the governance around them.

Before getting into each project, the raw numbers set the scene. The five dev lists combined for 358 emails this week. Iceberg led with 135 emails across 34 threads from 51 distinct participants, followed by Polaris at 114 emails across 23 threads from a tight group of 14 regulars. Parquet concentrated 72 emails into only 7 threads, which tells you its conversations ran deep rather than wide. Arrow posted 24 emails across 11 threads from 18 participants, and DataFusion rounded things out at 13 emails across 6 threads. The shape of those numbers matters as much as the totals. Iceberg's breadth reflects a project with a dozen parallel workstreams from spec evolution to language implementations to community events. Polaris's depth from a small group reflects a project where a core team is hammering out operational fundamentals. And Parquet's concentration reflects a community wrestling with a handful of existential questions all at once.

Apache Iceberg

The Iceberg dev list logged 135 emails across 34 threads from 51 participants this week, and the headline work happened in the spec.

Ryan Blue's vote to add a draft bitmap spec to git drew 14 messages and broad support, with binding +1s from Amogh Jahagirdar and others, plus non-binding approval from Micah Kornfield, who left clarity comments for implementers. The bitmap format targets small bitmaps, and the discussion surfaced a practical wrinkle worth watching. Péter Váry supported the move but flagged that delete vectors will need good compression if the community wants to store them in metadata files. Kornfield also asked Ryan a sharp process question: given the limited nature of the vote, what are the decision factors for actually promoting the draft to a finalized spec? That question echoes through several other Iceberg threads this week, because the project is increasingly comfortable landing draft specifications in git and iterating in the open rather than perfecting documents in Google Docs first.

The most consequential design debate centered on the REST catalog protocol. A discussion on adding an X-Iceberg-Client-Capabilities header to the REST spec evolved into a full conversation about a v2 loadTable endpoint. Ryan Blue laid out the case for v2, including optional locations, optional snapshots, and moving credentials out of properties. Russell Spitzer agreed those are good reasons but questioned whether a v2 endpoint actually changes the capability negotiation problem the header was meant to solve. The sharpest pushback came from Christian Thiel of the Lakekeeper project, who challenged the sentiment that a v2 loadTable should mandate that clients fail when they encounter unsupported restrictions. His argument is grounded in adoption reality: a v2 endpoint gets adopted for many reasons, and strict failure semantics create friction for clients that have nothing to do with the restriction features. Kurtis Wright backed the v2 direction after missing the original community meeting discussion. This thread is the one to follow if you build or operate REST catalogs, because the outcome shapes how every engine negotiates features with every catalog for years.

Step back and the stakes become clearer. The REST catalog spec is now the contract that binds the entire commercial Iceberg ecosystem together. Every managed catalog service, every query engine, and every standalone tool implements some slice of it, and those slices increasingly diverge in subtle ways. A capabilities header gives clients a standard way to declare what they understand, which lets catalogs make informed decisions about what to return. A v2 loadTable goes further by fixing accumulated design debt in the most heavily trafficked endpoint in the protocol. The tension Thiel identified is the classic protocol evolution dilemma: strict semantics protect correctness for new features like fine grained access control, where a client silently ignoring a row filter is a security incident, but strictness also slows adoption by punishing clients for capabilities unrelated to their workload. How the community threads that needle will determine whether v2 arrives as a clean upgrade path or a compatibility minefield. The fact that catalog implementers like Thiel, engine maintainers like Spitzer, and spec authors like Blue are all in the same thread arguing in good faith is the system working as designed.

Prashant Singh's summary of the dedicated sync on finer grained read restrictions connects directly to that capabilities debate. The room landed on capabilities handling as a core piece of the fine grained access control design, and Singh posted the recording and an AI-assisted summary for those who could not attend. Sung Yun extended the FGAC conversation with a thoughtful post on a write-path gap for field-id-bound policies during schema evolution. The read side of the proposal binds row filters and masks to field IDs so they survive schema evolution safely, but Yun points out that the write path has no equivalent story yet. Securing reads while leaving writes unguarded is a half-finished lock, so expect this gap to get attention as the proposal matures.

Security work continued on a second front. Adam Szita published a spec proposal for KMS credential vending through the REST catalog, separating credential management for KMS and Vault systems from the broader table encryption discussion. The intent is to let catalogs vend KMS credentials the same way they vend storage credentials today, which would make table-level encryption practical in multi-engine deployments where distributing key access manually does not scale.

On the release front, Amogh Jahagirdar kicked off planning for 1.11.1 and 1.10.3 patch releases after encountering a bug where the Spark rewrite manifests procedure fails to carry over first row IDs correctly. The thread gathered 13 messages and quick consensus. Steven Wu pointed to the existing 1.11.1 milestone, Yufei Gu and Daniel Weeks added their support, and Weeks made the operating principle explicit: keep the 1.10 backports narrow so the release stays easy and helps anyone who has not yet moved forward. Meanwhile Neelesh Salian opened planning for Apache Iceberg 1.12.0 with a direct acknowledgment that 1.11.0 took roughly eight months from 1.10.0, longer than the project wants. Steven Wu's response captured the philosophy the community is converging on: with a regular release habit, nobody needs to hold the release train for their feature, because the next train leaves in two to three months. Salian also published the Iceberg 1.11 feature branch retrospective conclusion over on the Polaris list crossover thread, where Alexandre Dutra summarized the community's honest feedback by recommending the feature branch experiment not be repeated.

The C++ implementation provided the week's endurance story. Junwang Zhao proposed RC0 of Apache Iceberg C++ 0.3.0 on June 6, and what followed was a sprint through RC1, RC2, and RC3 by June 11. Each candidate fixed issues the previous one surfaced. Matt Topol's RC2 verification caught real gaps in the release tooling, including undocumented meson and gtest requirements and an SSL workaround needed for the curl dependency, and Gang Wu called for improving the release script to catch similar issues automatically. By RC3, verification reports were coming in clean from macOS and Ubuntu environments across multiple contributors including Steven Wu, Raúl Cumplido, and Tanmay Rauth. Four release candidates in a week is not a failure story. It is what a healthy verification culture looks like when a young implementation is still hardening its release process.

Spec precision got its own dedicated attention. Andrei Tserakhau called a vote to clarify that the day partition transform's result type is date in the spec, gathering ten messages of support including binding +1s from Matt Topol and others within hours. The companion discussion on the Avro schema ambiguity for day transform fields in manifests shows why this dry-sounding clarification matters: Tserakhau noted the ambiguity bit someone again just last week on the Go side, where compacting a Spark-written table produced incompatible manifests. Kevin Liu suggested keeping the spec explanation format agnostic, and the fix landed in PR review. Small spec ambiguities compound into real interoperability bugs once five language implementations write the same metadata.

The function catalog work crossed a milestone when huaxin gao's vote on REST spec endpoints for listing and loading functions passed with ten +1 votes, five of them binding. Szehon Ho used his +1 to suggest tracking a specific-name for convenience over definition-id so engines can refer to each overloaded version of a function. With the spec change merging, Iceberg moves closer to catalogs that serve shared function definitions to every connected engine, which matters enormously for teams tired of reimplementing the same UDFs in Spark, Trino, and Flink.

The variant data type push kept its momentum through two threads and a sync. Neelesh Salian posted the variant tracking document and sync notes, and the follow-up discussion on variant shredding policy across Iceberg implementations tackled a subtle problem: aligning not just on the type definition but on how implementations shred variant values into columnar storage. Kurtis Wright praised the community for aligning on implementations rather than stopping at types. Shredding policy differences between engines would produce files that are technically spec compliant but perform wildly differently depending on which engine wrote them, so this alignment work protects the performance portability that makes Iceberg valuable.

Performance optimization proposals arrived from Varun Lakhyani, who opened two related threads on cutting S3 request counts. His proposal to combine three GET calls for Parquet reads targets small file workloads where Iceberg currently issues two GETs for the footer and one for data when a single GET could fetch the whole file. The companion idea to store Parquet footer size in Iceberg metadata would let readers skip footer discovery entirely. For workloads on object storage where request costs and latency dominate, a two-thirds reduction in GET calls for small files is real money.

Looking further ahead, Daniel Weeks proposed default value expressions for the v4 spec, building on the earlier expressions proposal to let defaults be computed rather than constant. Xiening Dai and Maninder Parmar continued working through global snapshot consistency for Iceberg tables, comparing a commit sequence number approach against a batch LoadTables API and concluding the two are complementary rather than contradictory. Mukund Thakur asked for review on his proposal for repartitioning old partition spec data files, which has been waiting since mid-May. Robert Kruszewski noticed that Iceberg's arrow-java dependency is more than two years old at 15.0.2 and offered to drive the upgrade to 19.0.0. And Joana Hrotkó proposed exposing the commit retry exhaustion reason in failure messages, a small operability win for anyone who has stared at an opaque commit failure at 2 AM.

Community infrastructure had a moment too. Bob Thomson from ASF Infra reported that Iceberg is the top consumer of shared GitHub-hosted runners over the last seven days, with overall utilization maxing out daily. The timing was good, because Vova Kolmakov had already proposed running JDK 21 tests only on main and nightly builds to halve PR runner minutes, and Ajantha Bhat pointed to his open PR doing exactly that plus incremental CI builds, which has been waiting for review. On the events side, the Iceberg Summit 2027 location discussion turned into a friendly bidding war, with Viktor Kessler pitching Barcelona, Paris, and Berlin under the banner of making Iceberg global, while Danica Fine reminded everyone that Lakehouse Day EU in Glasgow this October already gives the EMEA community a major gathering, co-located with Community Over Code and with its agenda now live. Kessler also announced the Iceberg Community Meetup Europe in Munich on July 22. Alex Stephen shared a healthy Iceberg Terraform Provider update with namespace and table management now supported, and huaxin gao posted notes from both the constraint support sync and the index support sync series.

Apache Polaris

Polaris generated 114 emails across 23 threads this week, and the volume tells you something: this project is in the thick of working out what a production catalog service owes its operators.

The biggest thread by message count was, surprisingly, about the merge button. Jean-Baptiste Onofré opened a PR to enable all three GitHub merge actions, adding merge commits and rebase-and-merge alongside the existing squash-and-merge, and the thread ran to 23 messages. Yong Zheng merged it before seeing the discussion, offered to revert, and JB waved it off with characteristic calm. The substantive objection came from Alexandre Dutra, who sees some value in rebase-and-merge when used wisely but struggles to imagine a useful case for merge commits, and worries about what happens when someone uses the wrong button on a messy branch. Twenty-three messages about merge strategies sounds like bikeshedding until you remember that commit history is how a project audits itself, and Polaris contributors clearly care about getting their development hygiene right while the project is still young enough to set habits.

The week's best protocol discussion came from Nándor Kollár, who asked the community to settle the correct HTTP status code for table and view rename conflicts when a conflicting operation is in progress. The current behavior returns a 500, which Dmitri Bourlatchkov reviewed and declared most certainly not correct, since 5xx codes signal fundamental service failure beyond the client's control. The candidates each have problems: 503 implies the whole service is unhealthy, 429 means rate limiting and is not defined for rename in the Iceberg REST spec, and 409 traditionally signals a conflict the client should not blindly retry. Seventeen messages in, the thread had become a genuinely useful seminar on REST semantics for catalog operations. The resolution matters beyond Polaris, because whatever convention Polaris adopts will influence how clients across the ecosystem implement retry logic for concurrent catalog operations.

Operational maturity drove a cluster of related threads on events and metrics. Yong Zheng raised the need for a mechanism to purge the events and metrics tables, since Polaris now persists both event streams and Iceberg metrics with no retention story. Kollár noted the urgency grows as event persistence expands to more event types, and Bourlatchkov suggested the Admin tool as the natural home, similar to the existing NoSQL maintenance task. Zheng followed with a proposal for filters on Iceberg metrics reporting, sketching expressions that match on catalog, namespace, and table name. Bourlatchkov floated CEL as the filter language before recalling that prior community consensus leaned toward removing CEL, leaving include and exclude lists with glob patterns as the likely landing spot. The largest design question in this cluster came from Yufei Gu, who proposed routing Iceberg scan and commit metrics through the events subsystem rather than maintaining a parallel persistence path, since synchronous metrics persistence chokes the Polaris persistence layer. Anand Kumar Sankaran noted with a smile that his original metrics PR proposed exactly this before the community decided to keep them separate, and flagged that any change here is a breaking schema migration. Dutra found the events approach appealing but wants performance overhead evaluated thoroughly first.

That events subsystem got its own scrutiny in Dutra's thread on event delivery ordering and concurrency guarantees, prompted by a PR that shifted delivery to a blocking executor. The previous behavior implicitly relied on Vert.x event bus semantics that nobody had written down. Kollár argued listeners should be documented as thread-safe and that strict ordering rarely matters as long as every event arrives, and Gu took the pragmatic position: keep ordered delivery as the only behavior now, and introduce unordered delivery only if a real need appears. Documenting implicit guarantees before users depend on them accidentally is exactly the kind of unglamorous work that separates production infrastructure from promising prototypes.

JB's Polaris Directories proposal advanced after several months of design work, and the discussion sharpened around one architectural question: where does the scanner live? Gu argued that if the scanning component sits completely outside Polaris, the user experience becomes confusing, with Polaris storing only directory configuration while real work happens elsewhere. JB clarified his two-step plan, landing configuration and high-level architecture first, then building the scanning service as part of Polaris proper. Romain Manni-Bucau pushed on extensibility, asking whether users can plug in their own metadata and whether scanning will be streaming friendly rather than batch only. Directories would give Polaris a way to govern data that has not yet been formalized into Iceberg tables, which extends the catalog's reach into the messy reality of most data lakes.

Release machinery is turning for Apache Polaris 1.6.0, targeted around June 26. EJ Wang reported no must-have blockers and plans to cut from main, while Adnan Hemani asked to land one PR first, a fix for a documentation versioning issue that had gone unreported for a while. JB updated the release process documentation to match. In parallel, the project took a step toward friendlier adoption when Yong Zheng proposed promoting the polaris CLI from PyPI as the recommended setup for non-development use, sparing users a full repository clone. Gu, JB, and Hemani all backed it immediately.

Two storage-layer threads rounded out the design work. Gu's proposal for making unique table locations the default won quick support from Russell Spitzer, who endorsed taking determinism out of table creation paths as a safety improvement. Bourlatchkov raised an important operational catch: with randomized locations, long-running staged create operations like CTAS face a credential refresh problem, connecting to the credential refresh discussion Gu had flagged earlier in the week and to active design work on the Iceberg side. Bourlatchkov also recapped community sync consensus on supporting multiple storage configurations per catalog, with authorization aspects deferred. And the Iceberg table encryption discussion continued between Gu and Bourlatchkov, working through whether Polaris can realistically test against encrypted Iceberg tables today. The answer is yes with caveats, and the work proceeds incrementally starting with internal Polaris workflows that touch encrypted files.

Testing infrastructure produced this week's most quietly notable line. In the object storage mock testing thread, Russell Spitzer shared a proof of concept he implemented with Claude's help, comparing approaches for testing file operations without real cloud containers. Robert Stupp agreed the POC clarifies the layering problem and they converged on a split: synthetic FileIO for generated listings and pure file operation behavior, real containers where fidelity matters. Bourlatchkov also opened threads on retiring the regtests code in favor of Yong's new Spark smoke tests, fixing a Principal Role validation regex through a REST spec change, and a subtle JSONB reformatting issue in PostgreSQL persistence that argues for semantic JSON comparison in entity tests.

The lineage conversation kept building. Adnan Hemani and Robert Stupp continued their OpenLineage follow-up by working through what Polaris should do when lineage events reference non-Polaris datasets on both ends, with Stupp calling for broader community input because the options on the table represent materially different commitments. And Sankaran proposed a GCP counterpart to AWS STS session tags so Polaris can correlate vended-credential data access back to the catalog operation that issued the credential on Google Cloud, closing an auditability gap between cloud providers.

Taken together, the week's Polaris threads sketch the profile of a catalog growing into production responsibilities. Almost nothing this week was about new catalog features in the demo sense. Instead the community worked on retention for its own telemetry, correct HTTP semantics under concurrency, documented threading guarantees, credential lifecycle edge cases in staged writes, audit correlation across clouds, and test infrastructure that does not require a cloud bill. This is the unglamorous middle phase of an infrastructure project's life, after the architecture is proven and before the enterprise checklists are fully satisfied, and how a community handles this phase predicts whether operators will trust it with their metadata five years from now. The Polaris regulars, a group of roughly fourteen people this week, are handling it with notable discipline, and the 1.6.0 release later this month will carry the early fruits of that work.

Apache Arrow

Arrow had a steadier week at 24 emails across 11 threads, anchored by a release and a governance decision about AI tooling.

Andrew Lamb shepherded Apache Arrow Rust 59.0.0 through its RC2 vote after RC1 hit a verification problem that Ed Seidl fixed. Verification reports came in from Seidl on RHEL 8, Raúl Cumplido on Debian 14 with Rust 1.96, Adam Reeve on Fedora 44, and L. C. Hsieh, and Lamb announced the result with five +1 votes, four binding, publishing to crates.io. The arrow-rs release train remains one of the most reliable in the ecosystem, which matters because half the Rust data infrastructure world, DataFusion included, builds directly on it.

The discussion on automatic GitHub Copilot reviews produced one of the more thoughtful AI governance conversations in the ASF right now. After two weeks of testing, Cumplido found the reviews useful for ready PRs but wants them disabled for drafts, since a draft signals work in progress and an immediate bot review adds noise. Lamb agreed they help as an initial pass and pushed for documenting what contributors are expected to do with bot feedback. Sutou Kouhei synthesized the feedback into a PR with a pragmatic split: first-time contributors get one policy, returning contributors another. Alenka Frim asked the practical question nobody had answered, which is when Copilot actually considers itself satisfied with a PR, since nobody had seen it grant an approval. Arrow is writing down norms for AI participation in code review while most projects are still improvising, and other communities will likely copy this homework.

The format itself saw movement on two fronts. The arrow.range canonical extension type discussion wrestled with naming and semantics for bounded ranges, with Felipe Oliveira Carvalho proposing distinct types per boundary closedness, half-open, closed, and the variations between, rather than a single parameterized type. And the variant type support thread surfaced a coordination problem: Gang Wu pointed out that several duplicate efforts are underway on variant support in Arrow C++, including work by his colleague Zehua that iceberg-cpp already depends on. Micah Kornfield confirmed community interest and pointed to the freshly opened tracking issue. Duplicate implementations of the same type are wasted effort the dev list exists to prevent, so expect consolidation here.

The Arrow family also grew. Following the donation vote, Benjamin Philip transferred the Arrow Erlang repository to the ASF, and Kouhei confirmed it now lives at apache/arrow-erlang with repository setup landing next week. Flight SQL picked up two small protocol wins, with Pedro Matias closing the vote on the is_update field for prepared statement results with four binding +1s and work proceeding on Go, Java, ADBC, and JDBC implementations, while Richie Black's COLUMN_DEF addition to Flight SQL JDBC schema metadata moved through its own vote. And in a thread that touches Arrow's measurement culture, Rok Mihevc and Jonathan Keane discussed the status of conbench, Arrow's continuous benchmarking project, with Mihevc interested in having his agents work on it and Keane happy to see anyone pick it up. The phrase "having your agents work on it" passing without comment in an ASF dev thread says plenty about where 2026 is.

Arrow's quieter week should not be mistaken for a quiet project. The format has reached the stage where its biggest contributions happen downstream, in arrow-rs powering DataFusion and a growing share of the Rust analytics ecosystem, in ADBC and Flight SQL steadily replacing bespoke wire protocols, and in the C++ library serving as the substrate for iceberg-cpp and the engines built on it. That last dependency is why the variant duplication issue deserves a faster resolution than it might otherwise get. With Iceberg, Parquet, and Spark all converging on variant as the standard answer for semi-structured data, Arrow C++ sits in the critical path for every engine that wants to read shredded variant columns efficiently, and two parallel implementations means review attention split exactly where the ecosystem can least afford it. Wu naming the problem publicly, with a disclaimer about his colleague's involvement, is the dev list doing its job.

Apache Parquet

Parquet packed 72 emails into just 7 threads, and one of them was the week's heavyweight across the entire lakehouse ecosystem.

The Future of Parquet Versioning discussion ran to 40 messages and pulled in nearly everyone who matters to the format: Ed Seidl, Andrew Lamb, Antoine Pitrou, Micah Kornfield, Daniel Weeks, Russell Spitzer, Ryan Blue, Fokko Driesprong, and Andrew Bell. The thread got off to an inauspicious start when the Google Doc anchoring the discussion started throwing terms of service violations for Seidl, Lamb, and others, an ironic argument for keeping foundational decisions in plain text on the mailing list. The substance is the question Parquet has deferred for a decade: what does a version number actually promise? Bell asked the question every practitioner asks, which is how a reader knows it has the tooling to read a given file, and what the hesitation is to simply bump version numbers. Seidl's answer exposed the uncomfortable status quo: today there is no in-use mechanism beyond parsing the created_by string, which means readers infer capabilities from writer name-dropping. The debate continues over whether Parquet should adopt feature flags, real version increments, or some hybrid, and the outcome will define how the format evolves for its second decade.

The reason this debate is happening now, rather than five years ago, is that Parquet's roadmap has filled up with changes that strain the old informal model. Variant types, geometry types, new statistics, the footer redesign, and dense encodings are all arriving in a short window, and each one forces the same question of how a reader discovers it can safely consume a file. The created_by approach worked when two or three writers dominated and everyone could memorize each other's quirks. With a dozen serious implementations across Java, C++, Rust, Go, and Python, capability discovery by string parsing is a correctness bug waiting to happen at every reader-writer pairing. The versioning thread is really an interoperability thread wearing a version number costume, and the contributors arguing in it know that whatever mechanism wins must serve files that will still be read decades from now. Formats outlive engines, and they outlive companies. That is precisely why 40 messages of careful argument is time well spent.

Lamb attacked the same problem from the documentation side. Convinced by recent discussions that the community must document what V1 and V2 actually mean, messy reality included, he spent several days producing a feature-by-version documentation page. Pitrou pushed back with a precise objection: the page invents an a posteriori meaning for V1 and V2, and he questioned why parquet-format 2.0.0 deserves to be singled out as a meaningful boundary. Lamb conceded that earlier drafts did try to invent definitions and revised toward describing what shipped rather than what the labels should have meant. This exchange is the versioning debate in miniature. The community is discovering that before it can design future versioning, it has to agree on a truthful account of past versioning.

While the philosophy unfolded, the release train kept moving. Gang Wu confirmed in the 2.13.0 release discussion that making ColumnMetaData.path_in_schema optional needs more discussion and will not block the release, with Fokko Driesprong and Kornfield agreeing to proceed. The vote on Apache Parquet Format 2.13.0 RC0 collected binding +1s from Kornfield and others, with Seidl's vote carrying the best line of the week: we have waited long enough for usable float statistics. Sortable floating point statistics have been a known gap for years, and 2.13.0 finally closes it.

The footer redesign work formalized its process. Jiayi Wang scheduled session 2 of the Parquet Footer Working Group, moving to a biweekly cadence, and Pitrou immediately raised the governance flag: for a change as foundational as the footer, decisions cannot be made in sync calls and merely reported to the list afterward. Wang agreed without hesitation, committing that syncs will inform but the mailing list will decide. Given that the footer working group is rethinking how every Parquet reader on earth bootstraps file access, insisting on mailing list primacy is not process pedantry. It is how the ASF model protects a format that multiple competing vendors depend on.

Two type system proposals advanced. Burak Yavuz moved the new File logical type proposal from design doc to pull requests against parquet-format and the reference implementation, after the Parquet sync aligned on keeping the field simple and minimalistic. Daniel Weeks followed up with additional context from the sync discussion. A File logical type gives engines a standard way to represent file references inside Parquet data, which matters for multimodal and document-heavy workloads where tables increasingly point at external binary content. And Divjot Arora closed the loop on the long-running INT96 statistics question, announcing the community has settled on introducing a new ColumnOrder to signal statistics validity for INT96 columns. Seidl endorsed it immediately, noting a new ColumnOrder is far preferable to parsing created_by strings, and offered a Rust proof of concept once the format PR lands. Notice the pattern: two separate threads this week independently identified created_by string parsing as the anti-pattern to eliminate.

Apache DataFusion

DataFusion makes its second appearance in this newsletter with a lighter week by volume, 13 emails across 6 threads, but the quality of its release process was on full display.

The vote on Apache DataFusion 54.0.0 RC1 featured the kind of drama that proves verification works. Matt Butrovich cast a -1 after Comet, the Spark accelerator built on DataFusion, showed large performance regressions on TPC-H and TPC-DS at scale factor 1000 that appeared related to Parquet metadata parsing. Andrew Lamb connected it to a similar report from Adam in the Vortex project tied to new metadata cache size limits. Butrovich investigated further, found Adam's issue went through the ListingTable API that Comet does not use, could not reproduce the regression in DataFusion alone, and retracted his -1 while deferring the Comet upgrade for more investigation. Lamb then announced the release approved with 11 +1 votes, 7 binding. A downstream consumer running thousand-scale-factor benchmarks against a release candidate and the project taking the result seriously is exactly how the Rust data stack has earned its reputation.

Lamb also submitted the ASF board report after crowdsourcing input from the community, and opened the 2026 Q3-Q4 roadmap discussion with a tracking ticket inviting the community to say where it wants the project to go. Recognition arrived from inside the foundation too, with Rich Bowen inviting the project to a PlusOne.apache.org interview, citing the 54.0 release, the new Java bindings, and a remarkable growth trajectory. Meanwhile Bob Thomson's infra review brought good news on the resource front: DataFusion has dropped out of the top consumers of ASF shared GitHub runners after recent CI optimization work that Oleks V. helped drive, the same week Iceberg learned it now tops that list. One project's playbook is sitting right there for the other to borrow.

Cross-Project Themes

The week's loudest theme is that format governance is becoming as important as format features. Parquet's 40-message versioning debate, Pitrou's insistence that footer decisions happen on the list rather than in syncs, Iceberg's question about when a draft spec in git becomes a finalized spec, and even the Polaris merge button thread are all the same conversation: as these projects become load-bearing infrastructure for the industry, the process by which they change matters as much as the changes themselves. Two separate Parquet threads independently named created_by string parsing as the failure mode to engineer away, which is what happens when a format relies on convention where it needs specification. Iceberg's day transform clarification, prompted by a real interoperability bug between Spark-written and Go-compacted tables, is the same lesson at smaller scale.

The variant type is now a genuinely cross-project effort, and this week showed both its promise and its coordination cost. Iceberg contributors aligned on shredding policy across implementations, Arrow surfaced duplicate variant implementations in C++ that need consolidation, and iceberg-cpp already depends on one of them. Semi-structured data support is arriving across the whole stack at once, which is exactly why the alignment syncs Neelesh Salian is running matter. Metadata efficiency formed a third connective thread: Iceberg proposals to cut GET calls and store footer sizes, the Parquet footer working group rethinking file bootstrap, and a DataFusion release candidate nearly held up by metadata cache behavior all point at the same bottleneck. The data files are fast. The metadata round trips are the tax everyone is now optimizing.

Finally, AI is quietly becoming part of how these communities work. Arrow is writing policy for Copilot reviews, Russell Spitzer prototyped Polaris test infrastructure with Claude's help, Iceberg syncs circulate AI-assisted summaries, and Rok Mihevc casually offered his agents for conbench maintenance. None of this was framed as remarkable by the participants, which is the remarkable part.

For practitioners, the week distills into three watch items. First, if you operate REST catalogs or pin client versions in production, the v2 loadTable and capabilities outcome will eventually reach your upgrade planning, so the time to read that thread is before the vote rather than after. Second, the metadata efficiency work across Iceberg and Parquet signals that small file performance on object storage is getting first-class attention at the format level, which may relieve pressure on some of the compaction gymnastics teams perform today, even though compaction remains essential for the foreseeable future. Third, the float statistics fix in parquet-format 2.13.0 and the INT96 ColumnOrder decision both close long-standing correctness gaps in predicate pushdown, and engines will pick these up over the coming release cycles, so expect quiet query performance improvements on float-heavy datasets without changing a line of your own code.

Looking Ahead

Watch for the Iceberg C++ 0.3.0 RC3 result and the outcome of the v2 loadTable capabilities debate, which will shape REST catalog evolution well beyond this release cycle. Polaris 1.6.0 branches around June 26, the Parquet footer working group reconvenes June 23 with its mailing-list-first commitment in place, and the parquet-format 2.13.0 vote should close with float statistics finally fixed. The Iceberg patch releases 1.11.1 and 1.10.3 should move to votes shortly, and the Parquet versioning thread shows no sign of slowing down. The Iceberg variant shredding alignment and the Arrow C++ variant consolidation are worth tracking as a pair, since the semi-structured data story only works if both layers land compatible implementations. On the community calendar, Munich hosts the Iceberg Europe meetup July 22, Lakehouse Day EU registration is open for Glasgow in October, and the Iceberg Summit 2027 location conversation is just getting started, with European cities making an energetic early case. If the past week is any guide, the next one will be busy.

Resources & Further Learning

Get Started with Dremio

Try Dremio Free — Build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris & Arrow — Learn how Dremio brings the open lakehouse stack together

Free Downloads