Yoshiki Fujiwara(藤原善基)@AWS Community Builder for AWS Community Builders

Posted on May 24 • Edited on May 28

FSx for ONTAP S3 Access Points Lakehouse — What Works, What Doesn't, and Why

#aws #amazonfsxfornetappontap #lakehouse #dataengineering

TL;DR

Amazon FSx for ONTAP S3 Access Points let you access NAS file data through S3-compatible APIs — without first copying source files to S3.

I tested multiple analytics, AI/ML, and lakehouse access patterns across AWS-native services, open-source engines, and third-party platforms. The results fall into four categories:

Verified in this series ✅	Candidate (AWS-documented) 🔎	Partially resolved, not production-ready ⚠️	Not suitable for this path ❌
Athena, Glue, EMR Spark, Redshift Spectrum, DuckDB Lambda, Trino, Snowflake (with `AWS_ACCESS_POINT_ARN`)	Bedrock KB, Lake Formation, Quick	Databricks UC (session policy partially resolved; UC table creation and directory listing still blocked)	Delta / Iceberg / Hudi transactional write paths

The pattern: Read-oriented analytics and flat-file writes (such as Parquet append) worked reliably in my validation environment. Transactional table-format write paths failed in this validation because they require commit semantics (atomic rename, conditional metadata update) that were not satisfied through the FSx S3 AP path.

GitHub Repository: fsxn-lakehouse-integrations

Validation Vocabulary

Term	Meaning
Verified	Worked in my test environment with evidence in verification-pack/
Candidate	AWS-documented or related-series path that still requires workload-specific validation
Blocked	Failed due to integration-layer behavior observed in validation
Not suitable	Failed because required table-format semantics were unavailable or incompatible

When this article says "Verified," it means the behavior was observed in my test environment and evidence is available. It does not mean production certification or vendor support guarantee.

Why This Matters

Enterprise organizations store petabytes of file data on NAS (NFS/SMB). To analyze this data with modern tools, they typically:

Copy data from NAS to S3 (ETL pipeline)
Register in a catalog (Glue, Unity Catalog)
Query with analytics platform

FSx for ONTAP S3 Access Points eliminate step 1. The same files accessible via NFS/SMB are now queryable via S3 API — zero source-file movement, zero sync pipeline, zero duplicate storage.

Before: NFS/SMB → [ETL Copy] → S3 → Analytics Platform
After:  NFS/SMB ←→ FSx for ONTAP ←→ S3 Access Point → Analytics Platform
                    (same data, same volume)

Note for regulated workloads: "Zero data movement" means source files do not need to be copied from FSx for ONTAP to S3 for the tested access paths. However, metadata, query results, logs, embeddings, temporary files, and derived datasets may still be created by the consuming service. See Note for Regulated Workloads below.

From File Access to AI-Ready Data

Eliminating the copy pipeline is step one. The real business value comes from what you do next — turning raw file data into AI-ready data products that drive business outcomes.

The engines validated in this series form a multi-engine data product journey:

FSx for ONTAP (source of truth)
  ↓ S3 Access Point (zero-copy access)
  ├── Athena / Redshift Spectrum → Ad-hoc discovery, data profiling
  ├── Glue / EMR Spark → ETL, curated Parquet/Iceberg datasets
  ├── DuckDB Lambda → Lightweight validation, cost-optimized queries
  ├── Snowflake External Table → Governed analytics, Cortex AI (summarize, RAG, sentiment)
  ├── Lake Formation → Fine-grained access control (column/row/tag)
  └── Databricks (via DataSync → S3) → ML training, feature engineering, Mosaic AI

The key insight: You don't need to pick one engine. Each platform excels at a different stage of the data product lifecycle:

Stage	Best engine	Why
Discover	Athena, DuckDB Lambda	Cheapest way to explore what's in your NAS data
Profile & validate	Glue Crawler, DuckDB	Schema discovery, data quality checks
Transform & curate	Glue ETL, EMR Spark	Medallion architecture, write curated Parquet
Govern	Lake Formation, Snowflake	Column/row/tag access control, governance tags
Analyze & share	Snowflake, Redshift Spectrum	Governed analytics, data sharing, Cortex AI
Train & predict	Databricks, EMR	ML training, feature store, model serving

This is not "pick one platform" — it's "use the right engine for each stage, with FSx for ONTAP as the single source of truth."

Concrete Scenario: Manufacturing Quality Team

A manufacturing company stores 50 TB of inspection images, sensor logs, and quality reports on FSx for ONTAP (NFS). Today, the quality team waits 24 hours for nightly batch copies to S3 before they can query defect patterns.

Day 1: Enable S3 Access Point → Athena query confirms sensor data is readable ($0.005). Quality team discovers 3 months of unanalyzed inspection data.

Day 7: Snowflake External Table + AWS_ACCESS_POINT_ARN → Quality team runs CORTEX.SUMMARIZE on inspection notes, CORTEX.SENTIMENT on operator feedback. No COPY INTO needed. Governance tags applied to sensitive equipment IDs.

Day 14: Dynamic Table (TARGET_LAG = '1 hour') enriches sensor data with AI-generated anomaly scores. Cortex Search Service enables "find similar defects" across 3 years of history.

Day 30: Curated quality dataset shared with supplier via Snowflake Data Sharing — governed, auditable, no file transfer. Supplier sees only their component data (Row Access Policy).

Business outcome: Data freshness 24h → 0h. Defect detection time 3 days → 1 hour. Supplier collaboration setup 2 weeks → 1 day.

ONTAP value in this journey:

Snapshot: If a bad data load corrupts curated datasets, revert to previous Snapshot in seconds — no re-ingestion needed
Dedup + Compression: 50 TB of raw inspection data stored at ~30 TB effective (typical 1.5-2x reduction)
FlexClone: Data science team gets an instant zero-copy clone for ML experimentation — no impact on production NFS workloads
Multi-protocol: Factory systems continue writing via NFS unchanged. Analytics reads via S3 AP. Same bytes, no sync

Why this matters for partners: This scenario is a single customer journey that touches Athena (discovery), Snowflake (AI + governance + sharing), and ONTAP (source of truth + Snapshot for recovery). Each stage is independently valuable and independently sellable.

Open Table Format: The Multi-Platform Bridge

For customers using both Snowflake and Databricks, the curated datasets created in this journey can be shared via open table formats:

FSx for ONTAP (source) → DataSync → S3 → Snowflake Managed Iceberg Table
                                              ↓
                                    Same Iceberg metadata on S3
                                              ↓
                          ┌───────────────────┼───────────────────┐
                          │                   │                   │
                    Databricks UC       AWS Glue/Athena      Other engines
                    (read Iceberg)      (read Iceberg)       (read Iceberg)

Snowflake Managed Iceberg Tables write data in open Iceberg format to customer-owned S3 storage. This means:

No vendor lock-in: Data is in open format, owned by the customer
Multi-engine access: Databricks, Athena, EMR, Trino can all read the same Iceberg tables
Snowflake manages lifecycle: OPTIMIZE, Time Travel, governance — without locking data into proprietary format
ONTAP remains source of truth: Raw data stays on FSx for ONTAP; only curated subsets are promoted to Iceberg

Zero-ETL design principle: The goal is not "zero processing" — it's "no hand-built copy pipelines, no duplicate storage management, no stale data." Where a platform requires data in S3 (Databricks UC, Delta/Iceberg writes), use DataSync as a managed bridge — not a custom ETL pipeline.

Security Model

Every request to FSx for ONTAP S3 Access Points must pass two authorization layers (AWS documentation):

S3-side authorization: IAM identity policy, S3 Access Point policy, VPC endpoint policy (if applicable), SCP
FSx-side authorization: Associated UNIX or Windows file system user permissions on the underlying volume

Both layers must permit the request. A permissive IAM policy does not override restrictive file system permissions, and vice versa.

The Compatibility Map

Verified (Evidence in verification-pack)

Platform	Pattern	Benchmark	Cost/Query
Athena	Serverless SQL via Glue Catalog	54.8 MB/s (5M rows in 2.2s)	~$0.0005
DuckDB Lambda	In-process analytics (arm64)	10K rows in 452ms (warm)	~$0.00001
EMR Spark	Distributed Spark SQL	10K rows read+write in 16s	~$0.001
Redshift Spectrum	DWH + external data JOIN	5M rows in 4.3s	~$0.005
Trino	Open-source distributed SQL	5M rows in 1.5s	Compute cost only
Glue ETL	PySpark medallion pipeline	10K rows transform in 64s	~$0.02

Candidate (AWS-documented, requires workload validation)

Platform	Pattern	Notes
Lake Formation	Governance overlay	Table/column-level access behavior observed; production workload validation needed
Bedrock KB	RAG document ingestion	Per AWS tutorial; permission-aware retrieval requires separate validation

Blocked in Validation (Third-Party Platforms)

Platform	Symptom	Root Cause	Workaround
Databricks (Unity Catalog)	Subdirectory ls → AccessDenied; CREATE TABLE → fails	Session policy partially resolved with `access_point` field; prefix-level listing and UC table creation still blocked	Explicit-path spark.read works but without UC table registration, governance features (lineage, tags, fine-grained access) cannot be applied; Instance Profile + boto3 for full access (bypasses UC entirely)
Snowflake (External Stage)	✅ Works with `AWS_ACCESS_POINT_ARN`	`AWS_ACCESS_POINT_ARN` stage parameter resolves session policy for GetObject	Full zero-copy analytics: SELECT, External Table, Directory Table, Cortex AI (summarize/translate/sentiment), Governance Tags, Row/Column policies — all verified. See [Part 3]

Databricks update (2026-05-24): Setting the access_point field on the UC External Location partially resolves the session policy issue. Top-level dbutils.fs.ls, dbutils.fs.head, and spark.read with explicit file paths now succeed. However, UC table creation (CREATE TABLE LOCATION) fails, subdirectory listing is blocked, and write operations are denied. Without UC table registration, Unity Catalog governance features — lineage tracking, fine-grained access control, governance tags, and audit — cannot be applied to the data. This means the data is technically readable but not governable through UC. Support case active — awaiting guidance on table creation and prefix-level access.

Support cases filed with both vendors.

Not Suitable for This Path (Table Format Constraints)

Format	Write Operation	Why It Failed in Validation
Delta Lake	INSERT/MERGE/VACUUM	Requires conditional writes (`If-None-Match`) for `_delta_log/` commit — FSx for ONTAP S3 AP returns 501 Not Implemented
Apache Iceberg	CREATE TABLE/INSERT	S3FileIO metadata write requires conditional writes for atomic commit — same root cause as Delta
Apache Hudi	Upsert/Compaction	Timeline commit requires atomic rename — not available on FSx for ONTAP S3 AP

Important distinction (read vs write): The failures above are for write operations only. Reading pre-existing Iceberg/Delta tables (where metadata and data files already exist on storage) is theoretically possible via GetObject — but has not been validated on FSx for ONTAP S3 AP. If you have Iceberg tables written to standard S3 (via EMR/Glue), those can be registered in Glue Data Catalog and queried alongside FSx for ONTAP external tables from the same Athena/Redshift session.

In this validation, transactional table writes failed because the tested engines required conditional writes (If-None-Match / put-if-absent) that FSx for ONTAP S3 AP does not support (returns 501 Not Implemented). AWS feature request submitted (May 2026). See API support documentation.

What DOES work for writes: Flat Parquet/CSV append via PutObject (Athena CTAS, Glue ETL write-back, EMR Spark write, DuckDB COPY TO).

8. ListObjectsV2 Latency Is a Product-Level Characteristic

ListObjectsV2 on FSx S3 AP exhibits higher latency than standard S3 (observed 30-80x for small directories). AWS Support confirmed this as a product-level performance characteristic (May 2026) — not an environmental or configuration issue. GetObject performance is acceptable (2-10x S3), but listing operations are disproportionately slow. Feature request submitted with targets: <1s for <100 files, <3s for <1000 files. Plan for this in query planning and catalog refresh operations.

Benchmark Methodology

All benchmark numbers should be read with the following context:

Parameter	Value
FSx for ONTAP deployment type	Single-AZ
Provisioned throughput	128 MB/s
Region	ap-northeast-1
Dataset shape	10K rows (250 KB) and 5M rows (103 MB), single Parquet file
Run type	Warm (unless noted as cold start)
Network path	Internet-origin AP (no VPC attachment for managed services)

Future benchmark runs will also capture: prefix depth, file count per prefix, average object size, p50/p90/p95/p99 latency where available, and cold/warm/repeated run count.

FSx S3 AP latency is in the tens of milliseconds range, and throughput depends on the file system's provisioned throughput capacity (AWS documentation). These benchmarks are sizing references from one test environment, not service limits or guarantees.

Architecture Decision Guide

Q: Do you need to WRITE transactional tables (Delta/Iceberg)?
  → Yes: Use native S3 for write path; FSx S3 AP for read-only source data
  → No: FSx S3 AP can handle the read-oriented and flat-file write patterns validated in this series

Q: Do you need sub-millisecond latency or unlimited concurrency?
  → Yes: Use native S3
  → No: FSx S3 AP (tens of ms, provisioned throughput)

Q: Do you have existing NAS data you want to analyze?
  → Yes: FSx S3 AP eliminates the copy pipeline
  → No: Native S3 may be simpler

Q: Do you need NFS/SMB access alongside S3 analytics?
  → Yes: FSx S3 AP (multi-protocol on same data)
  → No: Evaluate based on above

Decision Criteria

Scale when:

Business metric improves (freshness, cost, time-to-insight)
Governance path is approved
Performance impact is within threshold

Adjust when:

Engine works but governance or performance needs redesign
Staging to native S3 is required for write path

Stop when:

Transactional table write semantics are mandatory on the same path
Vendor session policy blocks production path with no approved workaround
Security owner rejects the access model

Why FSx for ONTAP, Not Just S3?

A common objection: "Why not put data directly on S3 and skip FSx for ONTAP entirely?"

Consideration	S3 only	FSx for ONTAP + S3 AP
Existing NFS/SMB workloads	Must migrate or maintain dual paths	No change — existing apps continue on NFS/SMB
Storage efficiency	No dedup/compression (pay for every byte)	ONTAP dedup + compression (1.5-2x typical reduction)
Point-in-time recovery	S3 Versioning (per-object, costly at scale)	ONTAP Snapshot (volume-level, instant, space-efficient)
Dev/test data provisioning	Full copy required	FlexClone (instant zero-copy clone)
Multi-protocol access	S3 only	NFS + SMB + S3 on same data simultaneously
Application changes needed	Yes (rewrite to S3 SDK)	No (NFS/SMB unchanged, S3 AP is additive)

The answer: If you're starting fresh with no existing file data, S3 is simpler. If you have existing NAS workloads (and most enterprises do), FSx for ONTAP lets you add analytics without disrupting applications or duplicating data.

Business Value Hypotheses

Business issue	Baseline metric	Expected value	Validation path	Decision owner
NAS analytics requires nightly copy to S3	Copy pipeline runtime, freshness lag	Reduce data freshness lag to near-zero	Athena / Glue / EMR direct query	Data platform owner
Enterprise documents are hard to search	Avg search time per user	Faster document discovery	Bedrock KB / permission-aware RAG	Information management owner
ETL pipeline duplicates storage	Duplicate storage cost	Lower copy and storage overhead	Glue / EMR write-back to same volume	Storage / FinOps owner
Platform selection is unclear	Weeks spent on PoC	Faster architecture decision	This compatibility map	Architecture lead

Partner Offer Paths

Customer need	Suggested offer	Exit decision
Query NAS data without copy	Athena / Redshift Spectrum validation pilot	Scale / adjust / stop
ETL from NAS to curated Parquet	Glue or EMR Serverless validation sprint	Production design / stage to S3
RAG over enterprise documents	Bedrock KB / permission-aware RAG assessment	Proceed only with authorization model validated
Databricks lakehouse integration	UC External Location with `access_point` field for read; staging to native S3 for Delta write	File-level read works under UC; subdirectory listing and table creation pending vendor resolution
Transactional table write	Native S3 table storage design	FSx S3 AP as source, not table log storage

The purpose of these offers is not to force every workload onto FSx S3 AP, but to quickly identify the right access path, the right engine, and the right stop condition.

Why Snowflake When Athena Is $0.005/Query?

Partners and customers often ask: "Athena is serverless and cheap — why would I pay for Snowflake?"

Capability	Athena	Snowflake External Table
Ad-hoc SQL query	✅	✅
Cost per query (10K rows)	~$0.0005	~$0.01 (XS warehouse)
AI functions (summarize, translate, sentiment)	❌	✅ Zero-copy, no COPY INTO
Governance tags + column masking	Via Lake Formation (separate setup)	✅ Built-in (same platform)
Row Access Policy	Via Lake Formation	✅ Built-in
Data Sharing (cross-org, governed)	❌	✅ Snowflake Data Sharing / Marketplace
Cortex Search (semantic RAG)	❌	✅ (requires Dynamic Table)
Materialized View on external data	❌	✅

The answer: Athena is the cheapest discovery tool. Snowflake is the fastest path to governed AI + data sharing. They are not competing — they serve different stages of the data product lifecycle. Start with Athena to prove the data is queryable, then graduate to Snowflake when the customer needs AI functions, governance, or cross-organization data sharing.

Key Technical Findings

1. Internet-Origin AP Required for Managed Services

In this validation, managed service paths (Athena, Glue, Redshift Spectrum, Bedrock) required internet-origin access points because the service access path did not originate from the customer VPC. Validate this per service, region, and network configuration.

2. Parquet Timestamp Compatibility

pandas and DuckDB generate Parquet with nanosecond timestamps by default. Spark (Glue, EMR) cannot read these files. Always use microsecond resolution for cross-engine compatibility.

3. EMRFS vs S3A

EMR's EMRFS (s3://) natively supports S3 AP aliases. The S3A FileSystem (s3a://) does NOT work with AP aliases (URL parsing error). Use s3:// prefix in EMR.

4. DuckDB httpfs Configuration

DuckDB requires s3_url_style = 'path' and explicit s3_endpoint to work with S3 AP aliases. In Lambda, also set home_directory = '/tmp'.

5. Trino Hive Connector

Trino requires hive.s3.path-style-access=true and explicit hive.s3.endpoint to resolve S3 AP aliases. Same pattern as DuckDB — path-style access is the key.

6. S3 Gateway Endpoint Routing

VPC-attached compute (Lambda in VPC, EC2) may experience timeouts when accessing FSx S3 AP through an S3 Gateway VPC Endpoint. The FSx S3 AP alias resolves to s3-r-w.<region>.amazonaws.com which may not route correctly through the Gateway endpoint. Workaround: use NAT Gateway or place compute outside VPC. See FSx S3 AP Networking Considerations.

7. Session Policy Is the Common Blocker for Third-Party Platforms

The session policy issue is not unique to one vendor in this validation. It may affect any analytics platform that applies restrictive AssumeRole session policies designed around standard S3 bucket ARN patterns. AWS-native services work because they use IAM roles directly without intermediary session policies.

Note for Regulated Workloads

"Zero data movement" means source files do not need to be copied from FSx for ONTAP to S3 for the tested access paths. However, metadata, query results, logs, embeddings, temporary files, and derived datasets may still be created by the consuming service.

For regulated workloads, validate:

Data classification of source and derived data
Derived data location (query results, embeddings, temp files)
Encryption and key ownership at each layer
Audit log coverage (CloudTrail, platform logs, ONTAP audit)
Retention and deletion policy
Approval owner and expiration date

Bedrock KB is a strong candidate for RAG over NAS documents, but regulated use cases must validate permission-aware retrieval, data classification, human review requirements, and residual risk acceptance before production use.

For regulated workloads, do not start a PoC until the data owner, security owner, and platform owner agree on the allowed prefixes, derived data locations, logging scope, rollback plan, and approval expiration date.

Assurance artifacts to prepare:

Non-technical overview for stakeholders
Data flow diagram (source → AP → service → output)
Access control summary (dual-layer authorization)
Audit evidence summary
Rollback plan
Residual risk register

Store these artifacts with an approval ID, owner, review date, and expiration date so the PoC decision can be audited later.

GenAI / RAG Evaluation Metrics

For GenAI and RAG workloads on FSx for ONTAP data, measure:

Retrieval accuracy (relevant documents returned)
Permission-aware retrieval pass rate (unauthorized documents NOT returned)
Hallucination reduction vs baseline
Data freshness lag (NFS write → S3 AP availability)
Human review workload
User time saved vs previous search method

Start with read-only, permission-aware, human-review-attached PoC before production deployment.

Series Index

This is the series overview for "FSx for ONTAP S3 Access Points × Lakehouse Deep Dive."

Part	Platform	Status	URL
Part 1	Athena — Query NAS Data In Place	✅ Published	dev.to
Part 2	Databricks — A Layer-by-Layer Validation of Observed Boundaries	✅ Published	—
Part 3	Snowflake — From 'Access Denied' to Working External Tables	✅ Resolved	—
Part 4	DuckDB Lambda — Serverless for $0.00001/query	Ready to publish	—
Part 5	EMR Spark — Read-Write ETL Pipeline	Ready to publish	—
Part 6	Redshift Spectrum — DWH Meets NAS Data	Coming soon	—
Part 7	Trino — Open-Source SQL on NAS Data	Coming soon	—
Summary	This article (Overview — What Works and What Doesn't)	Ready to publish	—

Note: This overview article can be published as the final "summary" post in the series, or as a standalone reference.

Update to Part 1 (Athena)

Since Part 1 was published, additional verification has been completed and published as a v1.1 update:

CTAS write-back: Verified as WORKING (3.7s, writes Parquet back to FSxN S3 AP)
Partition projection: Verified with Hive-style partitioning
Benchmark: 54.8 MB/s peak throughput (5M rows, 103 MB scan in 2.2s)
9/9 negative tests pass: Unauthorized access correctly denied

Try It Yourself

git clone https://github.com/Yoshiki0705/fsxn-lakehouse-integrations.git
cd fsxn-lakehouse-integrations

# Deploy base infrastructure
aws cloudformation deploy \
  --template-file shared/cloudformation/fsxn-s3ap-base.yaml \
  --stack-name fsxn-lakehouse-base \
  --capabilities CAPABILITY_IAM

# Validate connectivity
python shared/scripts/validate-access.py --access-point-alias <your-ap-alias>

# Choose your platform: integrations/athena/, integrations/duckdb/, etc.

Each integration directory includes a README, CloudFormation template, deployment script, and sample queries.

What's Next

Databricks UC + access_point field — partial success confirmed (2026-05-24); awaiting vendor guidance on subdirectory listing and table creation
Snowflake AWS_ACCESS_POINT_ARN — resolved (2026-05-24); SELECT and External Table work with stage parameter
Apache Iceberg community engagement (S3FileIO + AP alias support)
ONTAP feature quantification (dedup ratio, snapshot RTO) — resolved (DNS/AD orphan config removed, S3 AP recovered 2026-05-24)
Redshift Spectrum and Trino deep-dive articles
Customer PoC execution with measured business outcomes

Operational Lessons Learned

S3 AP Timeout Caused by Orphaned DNS/AD Configuration (2026-05-24)

During this series validation, all S3 APs on one SVM became unresponsive for 7+ days. Root cause: the SVM had DNS servers configured for an AD domain that no longer existed. When the S3 AP backend processes requests on an AD-joined SVM, ONTAP's name-service stack attempts DNS resolution for user-mapping — if DNS is unreachable, requests block until timeout.

Key findings:

Disabling customer-configured FPolicy did NOT fix the issue
A separate SVM without DNS/AD worked normally on the same file system
Removing the orphaned CIFS/DNS configuration restored S3 AP instantly

Prevention: Do not leave orphaned DNS/AD configurations on SVMs used for S3 AP access. If AD is decommissioned, clean up vserver cifs and vserver services dns settings. See FSx S3 AP Networking — Section 7 for full details.

References

This series is based on hands-on verification, not documentation review. Every "Verified" claim has a corresponding evidence record in the verification-pack/ directory.

Disclaimer: This article is an independent validation report and does not represent AWS, NetApp, Databricks, or Snowflake official guidance. Product behavior, support status, and platform capabilities may change. Always validate in your own environment and consult vendor documentation and support channels.

DEV Community