DEV Community

Cover image for FSx for ONTAP S3 Access Points Lakehouse — What Works, What Doesn't, and Why

FSx for ONTAP S3 Access Points Lakehouse — What Works, What Doesn't, and Why

TL;DR

Amazon FSx for ONTAP S3 Access Points let you access NAS file data through S3-compatible APIs — without first copying source files to S3.

I tested multiple analytics, AI/ML, and lakehouse access patterns across AWS-native services, open-source engines, and third-party platforms. The results fall into four categories:

Verified in this series ✅ Candidate (AWS-documented) 🔎 Partially resolved, not production-ready ⚠️ Not suitable for this path ❌
Athena, Glue, EMR Spark, Redshift Spectrum, DuckDB Lambda, Trino, Snowflake (with AWS_ACCESS_POINT_ARN) Bedrock KB, Lake Formation, Quick Databricks UC (session policy partially resolved; UC table creation and directory listing still blocked) Delta / Iceberg / Hudi transactional write paths

The pattern: Read-oriented analytics and flat-file writes (such as Parquet append) worked reliably in my validation environment. Transactional table-format write paths failed in this validation because they require commit semantics (atomic rename, conditional metadata update) that were not satisfied through the FSx S3 AP path.

GitHub Repository: fsxn-lakehouse-integrations


Validation Vocabulary

Term Meaning
Verified Worked in my test environment with evidence in verification-pack/
Candidate AWS-documented or related-series path that still requires workload-specific validation
Blocked Failed due to integration-layer behavior observed in validation
Not suitable Failed because required table-format semantics were unavailable or incompatible

When this article says "Verified," it means the behavior was observed in my test environment and evidence is available. It does not mean production certification or vendor support guarantee.


Why This Matters

Enterprise organizations store petabytes of file data on NAS (NFS/SMB). To analyze this data with modern tools, they typically:

  1. Copy data from NAS to S3 (ETL pipeline)
  2. Register in a catalog (Glue, Unity Catalog)
  3. Query with analytics platform

FSx for ONTAP S3 Access Points eliminate step 1. The same files accessible via NFS/SMB are now queryable via S3 API — zero source-file movement, zero sync pipeline, zero duplicate storage.

Before: NFS/SMB → [ETL Copy] → S3 → Analytics Platform
After:  NFS/SMB ←→ FSx for ONTAP ←→ S3 Access Point → Analytics Platform
                    (same data, same volume)
Enter fullscreen mode Exit fullscreen mode

Note for regulated workloads: "Zero data movement" means source files do not need to be copied from FSx for ONTAP to S3 for the tested access paths. However, metadata, query results, logs, embeddings, temporary files, and derived datasets may still be created by the consuming service. See Note for Regulated Workloads below.


From File Access to AI-Ready Data

Eliminating the copy pipeline is step one. The real business value comes from what you do next — turning raw file data into AI-ready data products that drive business outcomes.

The engines validated in this series form a multi-engine data product journey:

FSx for ONTAP (source of truth)
  ↓ S3 Access Point (zero-copy access)
  ├── Athena / Redshift Spectrum → Ad-hoc discovery, data profiling
  ├── Glue / EMR Spark → ETL, curated Parquet/Iceberg datasets
  ├── DuckDB Lambda → Lightweight validation, cost-optimized queries
  ├── Snowflake External Table → Governed analytics, Cortex AI (summarize, RAG, sentiment)
  ├── Lake Formation → Fine-grained access control (column/row/tag)
  └── Databricks (via DataSync → S3) → ML training, feature engineering, Mosaic AI
Enter fullscreen mode Exit fullscreen mode

The key insight: You don't need to pick one engine. Each platform excels at a different stage of the data product lifecycle:

Stage Best engine Why
Discover Athena, DuckDB Lambda Cheapest way to explore what's in your NAS data
Profile & validate Glue Crawler, DuckDB Schema discovery, data quality checks
Transform & curate Glue ETL, EMR Spark Medallion architecture, write curated Parquet
Govern Lake Formation, Snowflake Column/row/tag access control, governance tags
Analyze & share Snowflake, Redshift Spectrum Governed analytics, data sharing, Cortex AI
Train & predict Databricks, EMR ML training, feature store, model serving

This is not "pick one platform" — it's "use the right engine for each stage, with FSx for ONTAP as the single source of truth."

Concrete Scenario: Manufacturing Quality Team

A manufacturing company stores 50 TB of inspection images, sensor logs, and quality reports on FSx for ONTAP (NFS). Today, the quality team waits 24 hours for nightly batch copies to S3 before they can query defect patterns.

Day 1: Enable S3 Access Point → Athena query confirms sensor data is readable ($0.005). Quality team discovers 3 months of unanalyzed inspection data.

Day 7: Snowflake External Table + AWS_ACCESS_POINT_ARN → Quality team runs CORTEX.SUMMARIZE on inspection notes, CORTEX.SENTIMENT on operator feedback. No COPY INTO needed. Governance tags applied to sensitive equipment IDs.

Day 14: Dynamic Table (TARGET_LAG = '1 hour') enriches sensor data with AI-generated anomaly scores. Cortex Search Service enables "find similar defects" across 3 years of history.

Day 30: Curated quality dataset shared with supplier via Snowflake Data Sharing — governed, auditable, no file transfer. Supplier sees only their component data (Row Access Policy).

Business outcome: Data freshness 24h → 0h. Defect detection time 3 days → 1 hour. Supplier collaboration setup 2 weeks → 1 day.

ONTAP value in this journey:

  • Snapshot: If a bad data load corrupts curated datasets, revert to previous Snapshot in seconds — no re-ingestion needed
  • Dedup + Compression: 50 TB of raw inspection data stored at ~30 TB effective (typical 1.5-2x reduction)
  • FlexClone: Data science team gets an instant zero-copy clone for ML experimentation — no impact on production NFS workloads
  • Multi-protocol: Factory systems continue writing via NFS unchanged. Analytics reads via S3 AP. Same bytes, no sync

Why this matters for partners: This scenario is a single customer journey that touches Athena (discovery), Snowflake (AI + governance + sharing), and ONTAP (source of truth + Snapshot for recovery). Each stage is independently valuable and independently sellable.

Open Table Format: The Multi-Platform Bridge

For customers using both Snowflake and Databricks, the curated datasets created in this journey can be shared via open table formats:

FSx for ONTAP (source) → DataSync → S3 → Snowflake Managed Iceberg Table
                                              ↓
                                    Same Iceberg metadata on S3
                                              ↓
                          ┌───────────────────┼───────────────────┐
                          │                   │                   │
                    Databricks UC       AWS Glue/Athena      Other engines
                    (read Iceberg)      (read Iceberg)       (read Iceberg)
Enter fullscreen mode Exit fullscreen mode

Snowflake Managed Iceberg Tables write data in open Iceberg format to customer-owned S3 storage. This means:

  • No vendor lock-in: Data is in open format, owned by the customer
  • Multi-engine access: Databricks, Athena, EMR, Trino can all read the same Iceberg tables
  • Snowflake manages lifecycle: OPTIMIZE, Time Travel, governance — without locking data into proprietary format
  • ONTAP remains source of truth: Raw data stays on FSx for ONTAP; only curated subsets are promoted to Iceberg

Zero-ETL design principle: The goal is not "zero processing" — it's "no hand-built copy pipelines, no duplicate storage management, no stale data." Where a platform requires data in S3 (Databricks UC, Delta/Iceberg writes), use DataSync as a managed bridge — not a custom ETL pipeline.


Security Model

Every request to FSx for ONTAP S3 Access Points must pass two authorization layers (AWS documentation):

  1. S3-side authorization: IAM identity policy, S3 Access Point policy, VPC endpoint policy (if applicable), SCP
  2. FSx-side authorization: Associated UNIX or Windows file system user permissions on the underlying volume

Both layers must permit the request. A permissive IAM policy does not override restrictive file system permissions, and vice versa.


The Compatibility Map

Verified (Evidence in verification-pack)

Platform Pattern Benchmark Cost/Query
Athena Serverless SQL via Glue Catalog 54.8 MB/s (5M rows in 2.2s) ~$0.0005
DuckDB Lambda In-process analytics (arm64) 10K rows in 452ms (warm) ~$0.00001
EMR Spark Distributed Spark SQL 10K rows read+write in 16s ~$0.001
Redshift Spectrum DWH + external data JOIN 5M rows in 4.3s ~$0.005
Trino Open-source distributed SQL 5M rows in 1.5s Compute cost only
Glue ETL PySpark medallion pipeline 10K rows transform in 64s ~$0.02

Candidate (AWS-documented, requires workload validation)

Platform Pattern Notes
Lake Formation Governance overlay Table/column-level access behavior observed; production workload validation needed
Bedrock KB RAG document ingestion Per AWS tutorial; permission-aware retrieval requires separate validation

Blocked in Validation (Third-Party Platforms)

Platform Symptom Root Cause Workaround
Databricks (Unity Catalog) Subdirectory ls → AccessDenied; CREATE TABLE → fails Session policy partially resolved with access_point field; prefix-level listing and UC table creation still blocked Explicit-path spark.read works but without UC table registration, governance features (lineage, tags, fine-grained access) cannot be applied; Instance Profile + boto3 for full access (bypasses UC entirely)
Snowflake (External Stage) ✅ Works with AWS_ACCESS_POINT_ARN AWS_ACCESS_POINT_ARN stage parameter resolves session policy for GetObject Full zero-copy analytics: SELECT, External Table, Directory Table, Cortex AI (summarize/translate/sentiment), Governance Tags, Row/Column policies — all verified. See [Part 3]

Databricks update (2026-05-24): Setting the access_point field on the UC External Location partially resolves the session policy issue. Top-level dbutils.fs.ls, dbutils.fs.head, and spark.read with explicit file paths now succeed. However, UC table creation (CREATE TABLE LOCATION) fails, subdirectory listing is blocked, and write operations are denied. Without UC table registration, Unity Catalog governance features — lineage tracking, fine-grained access control, governance tags, and audit — cannot be applied to the data. This means the data is technically readable but not governable through UC. Support case active — awaiting guidance on table creation and prefix-level access.

Support cases filed with both vendors.

Not Suitable for This Path (Table Format Constraints)

Format Write Operation Why It Failed in Validation
Delta Lake INSERT/MERGE/VACUUM Requires conditional writes (If-None-Match) for _delta_log/ commit — FSx for ONTAP S3 AP returns 501 Not Implemented
Apache Iceberg CREATE TABLE/INSERT S3FileIO metadata write requires conditional writes for atomic commit — same root cause as Delta
Apache Hudi Upsert/Compaction Timeline commit requires atomic rename — not available on FSx for ONTAP S3 AP

Important distinction (read vs write): The failures above are for write operations only. Reading pre-existing Iceberg/Delta tables (where metadata and data files already exist on storage) is theoretically possible via GetObject — but has not been validated on FSx for ONTAP S3 AP. If you have Iceberg tables written to standard S3 (via EMR/Glue), those can be registered in Glue Data Catalog and queried alongside FSx for ONTAP external tables from the same Athena/Redshift session.

In this validation, transactional table writes failed because the tested engines required conditional writes (If-None-Match / put-if-absent) that FSx for ONTAP S3 AP does not support (returns 501 Not Implemented). AWS feature request submitted (May 2026). See API support documentation.

What DOES work for writes: Flat Parquet/CSV append via PutObject (Athena CTAS, Glue ETL write-back, EMR Spark write, DuckDB COPY TO).


8. ListObjectsV2 Latency Is a Product-Level Characteristic

ListObjectsV2 on FSx S3 AP exhibits higher latency than standard S3 (observed 30-80x for small directories). AWS Support confirmed this as a product-level performance characteristic (May 2026) — not an environmental or configuration issue. GetObject performance is acceptable (2-10x S3), but listing operations are disproportionately slow. Feature request submitted with targets: <1s for <100 files, <3s for <1000 files. Plan for this in query planning and catalog refresh operations.

Benchmark Methodology

All benchmark numbers should be read with the following context:

Parameter Value
FSx for ONTAP deployment type Single-AZ
Provisioned throughput 128 MB/s
Region ap-northeast-1
Dataset shape 10K rows (250 KB) and 5M rows (103 MB), single Parquet file
Run type Warm (unless noted as cold start)
Network path Internet-origin AP (no VPC attachment for managed services)

Future benchmark runs will also capture: prefix depth, file count per prefix, average object size, p50/p90/p95/p99 latency where available, and cold/warm/repeated run count.

FSx S3 AP latency is in the tens of milliseconds range, and throughput depends on the file system's provisioned throughput capacity (AWS documentation). These benchmarks are sizing references from one test environment, not service limits or guarantees.


Architecture Decision Guide

Q: Do you need to WRITE transactional tables (Delta/Iceberg)?
  → Yes: Use native S3 for write path; FSx S3 AP for read-only source data
  → No: FSx S3 AP can handle the read-oriented and flat-file write patterns validated in this series

Q: Do you need sub-millisecond latency or unlimited concurrency?
  → Yes: Use native S3
  → No: FSx S3 AP (tens of ms, provisioned throughput)

Q: Do you have existing NAS data you want to analyze?
  → Yes: FSx S3 AP eliminates the copy pipeline
  → No: Native S3 may be simpler

Q: Do you need NFS/SMB access alongside S3 analytics?
  → Yes: FSx S3 AP (multi-protocol on same data)
  → No: Evaluate based on above
Enter fullscreen mode Exit fullscreen mode

Decision Criteria

Scale when:

  • Business metric improves (freshness, cost, time-to-insight)
  • Governance path is approved
  • Performance impact is within threshold

Adjust when:

  • Engine works but governance or performance needs redesign
  • Staging to native S3 is required for write path

Stop when:

  • Transactional table write semantics are mandatory on the same path
  • Vendor session policy blocks production path with no approved workaround
  • Security owner rejects the access model

Why FSx for ONTAP, Not Just S3?

A common objection: "Why not put data directly on S3 and skip FSx for ONTAP entirely?"

Consideration S3 only FSx for ONTAP + S3 AP
Existing NFS/SMB workloads Must migrate or maintain dual paths No change — existing apps continue on NFS/SMB
Storage efficiency No dedup/compression (pay for every byte) ONTAP dedup + compression (1.5-2x typical reduction)
Point-in-time recovery S3 Versioning (per-object, costly at scale) ONTAP Snapshot (volume-level, instant, space-efficient)
Dev/test data provisioning Full copy required FlexClone (instant zero-copy clone)
Multi-protocol access S3 only NFS + SMB + S3 on same data simultaneously
Application changes needed Yes (rewrite to S3 SDK) No (NFS/SMB unchanged, S3 AP is additive)

The answer: If you're starting fresh with no existing file data, S3 is simpler. If you have existing NAS workloads (and most enterprises do), FSx for ONTAP lets you add analytics without disrupting applications or duplicating data.


Business Value Hypotheses

Business issue Baseline metric Expected value Validation path Decision owner
NAS analytics requires nightly copy to S3 Copy pipeline runtime, freshness lag Reduce data freshness lag to near-zero Athena / Glue / EMR direct query Data platform owner
Enterprise documents are hard to search Avg search time per user Faster document discovery Bedrock KB / permission-aware RAG Information management owner
ETL pipeline duplicates storage Duplicate storage cost Lower copy and storage overhead Glue / EMR write-back to same volume Storage / FinOps owner
Platform selection is unclear Weeks spent on PoC Faster architecture decision This compatibility map Architecture lead

Partner Offer Paths

Customer need Suggested offer Exit decision
Query NAS data without copy Athena / Redshift Spectrum validation pilot Scale / adjust / stop
ETL from NAS to curated Parquet Glue or EMR Serverless validation sprint Production design / stage to S3
RAG over enterprise documents Bedrock KB / permission-aware RAG assessment Proceed only with authorization model validated
Databricks lakehouse integration UC External Location with access_point field for read; staging to native S3 for Delta write File-level read works under UC; subdirectory listing and table creation pending vendor resolution
Transactional table write Native S3 table storage design FSx S3 AP as source, not table log storage

The purpose of these offers is not to force every workload onto FSx S3 AP, but to quickly identify the right access path, the right engine, and the right stop condition.


Why Snowflake When Athena Is $0.005/Query?

Partners and customers often ask: "Athena is serverless and cheap — why would I pay for Snowflake?"

Capability Athena Snowflake External Table
Ad-hoc SQL query
Cost per query (10K rows) ~$0.0005 ~$0.01 (XS warehouse)
AI functions (summarize, translate, sentiment) ✅ Zero-copy, no COPY INTO
Governance tags + column masking Via Lake Formation (separate setup) ✅ Built-in (same platform)
Row Access Policy Via Lake Formation ✅ Built-in
Data Sharing (cross-org, governed) ✅ Snowflake Data Sharing / Marketplace
Cortex Search (semantic RAG) ✅ (requires Dynamic Table)
Materialized View on external data

The answer: Athena is the cheapest discovery tool. Snowflake is the fastest path to governed AI + data sharing. They are not competing — they serve different stages of the data product lifecycle. Start with Athena to prove the data is queryable, then graduate to Snowflake when the customer needs AI functions, governance, or cross-organization data sharing.


Key Technical Findings

1. Internet-Origin AP Required for Managed Services

In this validation, managed service paths (Athena, Glue, Redshift Spectrum, Bedrock) required internet-origin access points because the service access path did not originate from the customer VPC. Validate this per service, region, and network configuration.

2. Parquet Timestamp Compatibility

pandas and DuckDB generate Parquet with nanosecond timestamps by default. Spark (Glue, EMR) cannot read these files. Always use microsecond resolution for cross-engine compatibility.

3. EMRFS vs S3A

EMR's EMRFS (s3://) natively supports S3 AP aliases. The S3A FileSystem (s3a://) does NOT work with AP aliases (URL parsing error). Use s3:// prefix in EMR.

4. DuckDB httpfs Configuration

DuckDB requires s3_url_style = 'path' and explicit s3_endpoint to work with S3 AP aliases. In Lambda, also set home_directory = '/tmp'.

5. Trino Hive Connector

Trino requires hive.s3.path-style-access=true and explicit hive.s3.endpoint to resolve S3 AP aliases. Same pattern as DuckDB — path-style access is the key.

6. S3 Gateway Endpoint Routing

VPC-attached compute (Lambda in VPC, EC2) may experience timeouts when accessing FSx S3 AP through an S3 Gateway VPC Endpoint. The FSx S3 AP alias resolves to s3-r-w.<region>.amazonaws.com which may not route correctly through the Gateway endpoint. Workaround: use NAT Gateway or place compute outside VPC. See FSx S3 AP Networking Considerations.

7. Session Policy Is the Common Blocker for Third-Party Platforms

The session policy issue is not unique to one vendor in this validation. It may affect any analytics platform that applies restrictive AssumeRole session policies designed around standard S3 bucket ARN patterns. AWS-native services work because they use IAM roles directly without intermediary session policies.


Note for Regulated Workloads

"Zero data movement" means source files do not need to be copied from FSx for ONTAP to S3 for the tested access paths. However, metadata, query results, logs, embeddings, temporary files, and derived datasets may still be created by the consuming service.

For regulated workloads, validate:

  • Data classification of source and derived data
  • Derived data location (query results, embeddings, temp files)
  • Encryption and key ownership at each layer
  • Audit log coverage (CloudTrail, platform logs, ONTAP audit)
  • Retention and deletion policy
  • Approval owner and expiration date

Bedrock KB is a strong candidate for RAG over NAS documents, but regulated use cases must validate permission-aware retrieval, data classification, human review requirements, and residual risk acceptance before production use.

For regulated workloads, do not start a PoC until the data owner, security owner, and platform owner agree on the allowed prefixes, derived data locations, logging scope, rollback plan, and approval expiration date.

Assurance artifacts to prepare:

  • Non-technical overview for stakeholders
  • Data flow diagram (source → AP → service → output)
  • Access control summary (dual-layer authorization)
  • Audit evidence summary
  • Rollback plan
  • Residual risk register

Store these artifacts with an approval ID, owner, review date, and expiration date so the PoC decision can be audited later.


GenAI / RAG Evaluation Metrics

For GenAI and RAG workloads on FSx for ONTAP data, measure:

  • Retrieval accuracy (relevant documents returned)
  • Permission-aware retrieval pass rate (unauthorized documents NOT returned)
  • Hallucination reduction vs baseline
  • Data freshness lag (NFS write → S3 AP availability)
  • Human review workload
  • User time saved vs previous search method

Start with read-only, permission-aware, human-review-attached PoC before production deployment.


Series Index

This is the series overview for "FSx for ONTAP S3 Access Points × Lakehouse Deep Dive."

Part Platform Status URL
Part 1 Athena — Query NAS Data In Place ✅ Published dev.to
Part 2 Databricks — A Layer-by-Layer Validation of Observed Boundaries ✅ Published
Part 3 Snowflake — From 'Access Denied' to Working External Tables ✅ Resolved
Part 4 DuckDB Lambda — Serverless for $0.00001/query Ready to publish
Part 5 EMR Spark — Read-Write ETL Pipeline Ready to publish
Part 6 Redshift Spectrum — DWH Meets NAS Data Coming soon
Part 7 Trino — Open-Source SQL on NAS Data Coming soon
Summary This article (Overview — What Works and What Doesn't) Ready to publish

Note: This overview article can be published as the final "summary" post in the series, or as a standalone reference.

Update to Part 1 (Athena)

Since Part 1 was published, additional verification has been completed and published as a v1.1 update:

  • CTAS write-back: Verified as WORKING (3.7s, writes Parquet back to FSxN S3 AP)
  • Partition projection: Verified with Hive-style partitioning
  • Benchmark: 54.8 MB/s peak throughput (5M rows, 103 MB scan in 2.2s)
  • 9/9 negative tests pass: Unauthorized access correctly denied

Try It Yourself

git clone https://github.com/Yoshiki0705/fsxn-lakehouse-integrations.git
cd fsxn-lakehouse-integrations

# Deploy base infrastructure
aws cloudformation deploy \
  --template-file shared/cloudformation/fsxn-s3ap-base.yaml \
  --stack-name fsxn-lakehouse-base \
  --capabilities CAPABILITY_IAM

# Validate connectivity
python shared/scripts/validate-access.py --access-point-alias <your-ap-alias>

# Choose your platform: integrations/athena/, integrations/duckdb/, etc.
Enter fullscreen mode Exit fullscreen mode

Each integration directory includes a README, CloudFormation template, deployment script, and sample queries.


What's Next

  • Databricks UC + access_point field — partial success confirmed (2026-05-24); awaiting vendor guidance on subdirectory listing and table creation
  • Snowflake AWS_ACCESS_POINT_ARNresolved (2026-05-24); SELECT and External Table work with stage parameter
  • Apache Iceberg community engagement (S3FileIO + AP alias support)
  • ONTAP feature quantification (dedup ratio, snapshot RTO) — resolved (DNS/AD orphan config removed, S3 AP recovered 2026-05-24)
  • Redshift Spectrum and Trino deep-dive articles
  • Customer PoC execution with measured business outcomes

Operational Lessons Learned

S3 AP Timeout Caused by Orphaned DNS/AD Configuration (2026-05-24)

During this series validation, all S3 APs on one SVM became unresponsive for 7+ days. Root cause: the SVM had DNS servers configured for an AD domain that no longer existed. When the S3 AP backend processes requests on an AD-joined SVM, ONTAP's name-service stack attempts DNS resolution for user-mapping — if DNS is unreachable, requests block until timeout.

Key findings:

  • Disabling customer-configured FPolicy did NOT fix the issue
  • A separate SVM without DNS/AD worked normally on the same file system
  • Removing the orphaned CIFS/DNS configuration restored S3 AP instantly

Prevention: Do not leave orphaned DNS/AD configurations on SVMs used for S3 AP access. If AD is decommissioned, clean up vserver cifs and vserver services dns settings. See FSx S3 AP Networking — Section 7 for full details.


References


This series is based on hands-on verification, not documentation review. Every "Verified" claim has a corresponding evidence record in the verification-pack/ directory.

Disclaimer: This article is an independent validation report and does not represent AWS, NetApp, Databricks, or Snowflake official guidance. Product behavior, support status, and platform capabilities may change. Always validate in your own environment and consult vendor documentation and support channels.

Top comments (0)