TL;DR: Deployed 100+ AI models to edge devices. Discovered the hard way that manual optimization sucks. Built a tool to automate it. Sharing everything I learned.
The Problem Nobody Talks About
You spend weeks training the perfect computer vision model. 98% accuracy. Beautiful loss curves. Your team is celebrating.
Then someone asks: "Can we run this on a Jetson Nano?"
And suddenly, your 2GB PyTorch masterpiece becomes a 500MB problem.
This was me six months ago. I had a YOLOv8 model that needed to run on edge hardware for a robotics project. The model worked perfectly in the cloud. On a Jetson Nano? 12 FPS. Unusable.
I needed 30+ FPS for real-time detection.
The Manual Optimization Rabbit Hole
Here's what I tried first (spoiler: it was painful):
Attempt 1: TensorFlow Lite Conversion
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
with open('model.tflite', 'wb') as f:
f.write(tflite_model)
Result: Model size went from 980MB → 780MB. Not enough. Inference time barely improved.
Time wasted: 8 hours fighting compatibility issues between TensorFlow versions.
Attempt 2: Manual INT8 Quantization
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
def representative_dataset():
for _ in range(100):
yield [np.random.rand(1, 224, 224, 3).astype(np.float32)]
converter.representative_dataset = representative_dataset
tflite_model = converter.convert()
Result: Model crashed on inference. Accuracy dropped to 73% (from 98%). Completely broken.
Time wasted: Another 12 hours debugging why my calibration dataset wasn't working.
Attempt 3: ONNX Runtime + TensorRT
This one actually worked. But here's what it took:
- Convert PyTorch → ONNX (3 hours, fighting version conflicts)
- Optimize ONNX graph (2 hours, manual layer fusion)
- Convert to TensorRT engine (4 hours, hardware-specific tuning)
- Profile and fix precision issues (6 hours of trial and error)
Total time: 3 days for ONE model.
Final result:
- ✅ Size: 245MB (75% reduction)
- ✅ Latency: 33ms (2.5x faster)
- ✅ Accuracy: 97.2% (0.8% loss)
It worked! But I had 15 more models to optimize.
The "Oh Crap" Moment
At that rate, optimizing all my models would take 45 days of full-time work.
I started Googling for tools. Found OctoAI. Perfect solution.
Then I read:
"OctoAI acquired by NVIDIA. Platform shutting down October 2024."
Great. 😑
Neural Magic? Enterprise-only. $50K minimum.
Edge Impulse? Microcontroller focus, not for my use case.
There was no affordable, automated solution for regular developers.
What I Learned From 100+ Model Compressions
Over the next 3 months, I compressed over 100 different models. Here's the non-obvious stuff nobody tells you:
1. INT8 Quantization is Magic (When Done Right)
Average results across 100+ models:
- 📦 Compression: 4x smaller
- ⚡ Speedup: 2-3x faster
- 🎯 Accuracy loss: 0.5-1.5%
But here's the catch: calibration dataset matters more than model architecture.
Bad calibration = 10% accuracy loss 😱
Good calibration = 0.5% accuracy loss 🎉
My calibration strategy:
# Use 1000 representative samples from validation set
calibration_data = validation_set.sample(n=1000, random_state=42)
def representative_dataset_gen():
for sample in calibration_data:
yield [sample.numpy()]
converter.representative_dataset = representative_dataset_gen
Game changer. Accuracy loss went from 3% to 0.5%.
2. Not All Layers Should Be Quantized
I was quantizing everything to INT8. Rookie mistake.
Some layers (especially first conv and last FC layers) are super sensitive to quantization.
Better approach: Mixed precision quantization
| Layer Type | Precision | Reason |
|---|---|---|
| First Conv | FP16 | Sensitive to input variations |
| Middle Conv Layers | INT8 | Biggest size savings |
| Attention Layers | FP16 | Critical for accuracy |
| Final FC Layer | FP16 | Output quality matters |
| Batch Norm | INT8 | Can be fused anyway |
Result: 6x compression with 0.3% accuracy loss instead of 3%.
3. Framework Conversion is a Minefield
Success rates I observed:
- PyTorch → TFLite directly: 60% success
- PyTorch → ONNX → TFLite: 85% success
- TensorFlow → ONNX → TFLite: 90% success
The trick: Always go through ONNX as an intermediate step.
# Export PyTorch to ONNX
torch.onnx.export(
model,
dummy_input,
"model.onnx",
opset_version=13,
input_names=['input'],
output_names=['output'],
dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
)
# Optimize ONNX graph
import onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.optimized_model_filepath = "model_optimized.onnx"
session = ort.InferenceSession("model.onnx", sess_options)
Free 15-20% speedup just from graph fusion.
4. Hardware-Specific Optimization is Non-Negotiable
Generic optimization: 2x speedup
Hardware-specific optimization: 5x speedup
Here's what works for each platform:
| Hardware | Best Framework | Key Optimization |
|---|---|---|
| NVIDIA Jetson | TensorRT | FP16 + layer fusion |
| Raspberry Pi | TFLite + XNNPACK | INT8 quantization |
| iOS (iPhone) | CoreML | Neural Engine offload |
| Android | TFLite + NNAPI | GPU delegate |
| Edge TPU | TFLite + Edge TPU compiler | INT8 required |
Don't try to use the same optimized model everywhere. It won't work.
5. Pruning is Overrated (for Most Use Cases)
Everyone talks about pruning. I tried it extensively:
Structured pruning:
- ✅ 30% size reduction
- ❌ 5% accuracy loss
- ❌ Marginal speedup on real hardware
Unstructured pruning:
- ✅ 50% size reduction
- ❌ No speedup on real hardware (sparse ops aren't optimized)
- ❌ Complicated to maintain
My take: Quantization first. Pruning only if you REALLY need that extra 20% size reduction.
The Breaking Point
After manually optimizing 47 models, I hit a wall.
Each model took 4-8 hours. I was burning out. And I still had:
- ❌ 15 models for the robotics project
- ❌ 22 models for a computer vision pipeline
- ❌ 18 models for a client's IoT deployment
I did the math: 220+ hours of manual work remaining.
That's when I decided to automate it.
Building the Solution
I built a pipeline that handles everything I was doing manually:
Step 1: Model Analysis
def analyze_model(model_path):
"""Auto-detect framework and analyze architecture"""
# Detect framework
if model_path.endswith('.pt') or model_path.endswith('.pth'):
framework = 'pytorch'
model = torch.load(model_path)
elif model_path.endswith('.h5'):
framework = 'tensorflow'
model = tf.keras.models.load_model(model_path)
elif model_path.endswith('.onnx'):
framework = 'onnx'
model = onnx.load(model_path)
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
# Identify quantization-sensitive layers
sensitive_layers = identify_sensitive_layers(model)
return {
'framework': framework,
'total_params': total_params,
'sensitive_layers': sensitive_layers
}
Step 2: Automated Compression
def compress_model(model, target_hardware, compression_level):
"""Apply mixed precision quantization"""
# Generate calibration dataset
calibration_data = generate_calibration_dataset(model)
# Define quantization config
if compression_level == 'aggressive':
default_precision = 'int8'
sensitive_precision = 'fp16'
elif compression_level == 'balanced':
default_precision = 'int8'
sensitive_precision = 'fp16'
else: # conservative
default_precision = 'fp16'
sensitive_precision = 'fp16'
# Apply mixed precision
quantized_model = apply_mixed_precision(
model,
default_precision=default_precision,
sensitive_layers_precision=sensitive_precision,
calibration_data=calibration_data
)
# Hardware-specific compilation
if target_hardware == 'jetson':
final_model = compile_tensorrt(quantized_model)
elif target_hardware == 'raspberry_pi':
final_model = compile_tflite_xnnpack(quantized_model)
elif target_hardware == 'mobile':
final_model = compile_coreml(quantized_model)
return final_model
Step 3: Validation
def validate_compression(original_model, compressed_model, test_dataset):
"""Benchmark and validate accuracy"""
# Size comparison
original_size = get_model_size(original_model)
compressed_size = get_model_size(compressed_model)
compression_ratio = original_size / compressed_size
# Latency benchmark
original_latency = benchmark_latency(original_model, test_dataset)
compressed_latency = benchmark_latency(compressed_model, test_dataset)
speedup = original_latency / compressed_latency
# Accuracy validation
original_acc = evaluate_accuracy(original_model, test_dataset)
compressed_acc = evaluate_accuracy(compressed_model, test_dataset)
accuracy_loss = original_acc - compressed_acc
return {
'compression_ratio': compression_ratio,
'speedup': speedup,
'accuracy_loss': accuracy_loss
}
Tech stack:
- Backend: FastAPI (async job queue with Redis)
- ML: ONNX Runtime, TFLite, PyTorch, llama.cpp
- Compute: RunPod GPU instances (70% cheaper than AWS)
- Storage: S3-compatible object storage
The whole pipeline runs in 3-8 minutes depending on model size.
What took me 4-8 hours manually now takes under 10 minutes.
Real-World Results
Here are some standout cases from the 100+ models I've compressed:
Case 1: YOLOv8-Large (Object Detection)
| Metric | Before | After | Change |
|---|---|---|---|
| Size | 980MB | 245MB | 75% smaller |
| Latency | 85ms | 33ms | 2.5x faster |
| mAP | 98.2% | 97.4% | 0.8% loss |
| Deployment | — | Jetson Nano, real-time drone detection | — |
Case 2: BERT-Base (NLP)
| Metric | Before | After | Change |
|---|---|---|---|
| Size | 440MB | 110MB | 75% smaller |
| Latency | 120ms | 45ms | 2.7x faster |
| Accuracy | 94.3% | 93.8% | 0.5% loss |
| Deployment | — | Raspberry Pi 4, on-device sentiment analysis | — |
Case 3: MobileNetV3 (Image Classification)
| Metric | Before | After | Change |
|---|---|---|---|
| Size | 21MB | 5.2MB | 75% smaller |
| Latency | 18ms | 7ms | 2.6x faster |
| Top-1 Acc | 75.2% | 74.6% | 0.6% loss |
| Deployment | — | Android app, 100M+ users | — |
Pattern: Consistent 4x compression with 2-3x speedup and <1% accuracy loss.
The Bigger Picture
Edge AI is exploding. The market is going from $24.9B (2025) to $118.69B by 2033.
But here's the problem: most AI is still trained in the cloud and deployed in the cloud.
The real world needs AI at the edge:
- 🚁 Drones that detect objects in real-time
- 🤖 Robots that navigate autonomously
- 🏥 Medical devices that process data locally (HIPAA compliance)
- 📷 Smart cameras that work without internet
- 🔋 IoT sensors that run for years on battery
And getting models onto these devices is still way too hard.
What I Wish Someone Had Told Me 6 Months Ago
1. Start with quantization, not pruning
It's the 80/20 solution. You'll get 4x compression with minimal effort.
2. Always use ONNX as an intermediate format
It saves so much pain with framework conversions.
3. Calibration dataset quality > model architecture
Use 1000+ representative samples from your actual validation set.
4. Hardware-specific optimization is non-optional
Generic models won't cut it. Optimize for your target hardware.
5. Measure everything on real hardware
Latency, throughput, memory, power consumption. Not just in theory.
6. Don't trust the accuracy number from quantization tools
Always validate on your actual test set with your actual metrics.
Key Takeaways
If you're deploying models to edge devices, here's your action plan:
# The compression playbook
def optimize_for_edge(model, target_hardware):
# 1. Convert to ONNX first (universal format)
onnx_model = convert_to_onnx(model)
# 2. Apply INT8 quantization with good calibration
calibration_data = sample_validation_set(n=1000)
quantized_model = quantize_int8(onnx_model, calibration_data)
# 3. Hardware-specific compilation
if target_hardware == 'jetson':
final_model = compile_tensorrt(quantized_model)
elif target_hardware == 'raspberry_pi':
final_model = compile_tflite(quantized_model)
# 4. Validate on real hardware
metrics = benchmark_on_device(final_model, target_hardware)
return final_model, metrics
Expected results:
- 4x smaller models
- 2-3x faster inference
- <1% accuracy loss
Try It Yourself
If you're facing similar challenges, I built this into a platform that automates the entire process.
🔗 Try it here: https://edge-ai-alpha.vercel.app/
Free tier: 5 compressions/month, no credit card needed.
Supports:
- ✅ PyTorch, TensorFlow, ONNX models
- ✅ Export to TFLite, CoreML, TensorRT, ONNX Runtime
- ✅ Automatic calibration dataset generation
- ✅ Hardware-specific optimization profiles
What's Next?
I'm working on:
- 🔧 Automated hyperparameter tuning for compression
- 🌐 Federated learning support (train on edge, aggregate in cloud)
- 🎛️ Custom hardware profiles (add your own device specs)
- 📦 Multi-model ensemble optimization
Let's Discuss!
What's your experience with model deployment?
- Ever tried quantization? How did it go?
- What's your biggest pain point with edge AI?
- Any horror stories to share? 😅
Drop a comment below! I'll try to respond to everyone.
Thanks for reading! If you found this helpful, consider:
- ❤️ Reacting with a ❤️ or 🦄
- 💬 Sharing your own edge deployment experiences
- 🔖 Bookmarking for later reference
Happy optimizing! 🚀
— Patel Darshit
Top comments (0)