Patel Darshit

Posted on Feb 15

I Spent 3 Months Compressing AI Models So You Don't Have To – Here's What I Learned

#ai #machinelearning #python #tensorflow

TL;DR: Deployed 100+ AI models to edge devices. Discovered the hard way that manual optimization sucks. Built a tool to automate it. Sharing everything I learned.

The Problem Nobody Talks About

You spend weeks training the perfect computer vision model. 98% accuracy. Beautiful loss curves. Your team is celebrating.

Then someone asks: "Can we run this on a Jetson Nano?"

And suddenly, your 2GB PyTorch masterpiece becomes a 500MB problem.

This was me six months ago. I had a YOLOv8 model that needed to run on edge hardware for a robotics project. The model worked perfectly in the cloud. On a Jetson Nano? 12 FPS. Unusable.

I needed 30+ FPS for real-time detection.

The Manual Optimization Rabbit Hole

Here's what I tried first (spoiler: it was painful):

Attempt 1: TensorFlow Lite Conversion

import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

Result: Model size went from 980MB → 780MB. Not enough. Inference time barely improved.

Time wasted: 8 hours fighting compatibility issues between TensorFlow versions.

Attempt 2: Manual INT8 Quantization

converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

def representative_dataset():
    for _ in range(100):
        yield [np.random.rand(1, 224, 224, 3).astype(np.float32)]

converter.representative_dataset = representative_dataset
tflite_model = converter.convert()

Result: Model crashed on inference. Accuracy dropped to 73% (from 98%). Completely broken.

Time wasted: Another 12 hours debugging why my calibration dataset wasn't working.

Attempt 3: ONNX Runtime + TensorRT

This one actually worked. But here's what it took:

Convert PyTorch → ONNX (3 hours, fighting version conflicts)
Optimize ONNX graph (2 hours, manual layer fusion)
Convert to TensorRT engine (4 hours, hardware-specific tuning)
Profile and fix precision issues (6 hours of trial and error)

Total time: 3 days for ONE model.

Final result:

✅ Size: 245MB (75% reduction)
✅ Latency: 33ms (2.5x faster)
✅ Accuracy: 97.2% (0.8% loss)

It worked! But I had 15 more models to optimize.

The "Oh Crap" Moment

At that rate, optimizing all my models would take 45 days of full-time work.

I started Googling for tools. Found OctoAI. Perfect solution.

Then I read:

"OctoAI acquired by NVIDIA. Platform shutting down October 2024."

Great. 😑

Neural Magic? Enterprise-only. $50K minimum.

Edge Impulse? Microcontroller focus, not for my use case.

There was no affordable, automated solution for regular developers.

What I Learned From 100+ Model Compressions

Over the next 3 months, I compressed over 100 different models. Here's the non-obvious stuff nobody tells you:

1. INT8 Quantization is Magic (When Done Right)

Average results across 100+ models:

📦 Compression: 4x smaller
⚡ Speedup: 2-3x faster
🎯 Accuracy loss: 0.5-1.5%

But here's the catch: calibration dataset matters more than model architecture.

Bad calibration = 10% accuracy loss 😱

Good calibration = 0.5% accuracy loss 🎉

My calibration strategy:

# Use 1000 representative samples from validation set
calibration_data = validation_set.sample(n=1000, random_state=42)

def representative_dataset_gen():
    for sample in calibration_data:
        yield [sample.numpy()]

converter.representative_dataset = representative_dataset_gen

Game changer. Accuracy loss went from 3% to 0.5%.

2. Not All Layers Should Be Quantized

I was quantizing everything to INT8. Rookie mistake.

Some layers (especially first conv and last FC layers) are super sensitive to quantization.

Better approach: Mixed precision quantization

Layer Type	Precision	Reason
First Conv	FP16	Sensitive to input variations
Middle Conv Layers	INT8	Biggest size savings
Attention Layers	FP16	Critical for accuracy
Final FC Layer	FP16	Output quality matters
Batch Norm	INT8	Can be fused anyway

Result: 6x compression with 0.3% accuracy loss instead of 3%.

3. Framework Conversion is a Minefield

Success rates I observed:

PyTorch → TFLite directly: 60% success
PyTorch → ONNX → TFLite: 85% success
TensorFlow → ONNX → TFLite: 90% success

The trick: Always go through ONNX as an intermediate step.

# Export PyTorch to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    opset_version=13,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
)

# Optimize ONNX graph
import onnxruntime as ort

sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.optimized_model_filepath = "model_optimized.onnx"

session = ort.InferenceSession("model.onnx", sess_options)

Free 15-20% speedup just from graph fusion.

4. Hardware-Specific Optimization is Non-Negotiable

Generic optimization: 2x speedup

Hardware-specific optimization: 5x speedup

Here's what works for each platform:

Hardware	Best Framework	Key Optimization
NVIDIA Jetson	TensorRT	FP16 + layer fusion
Raspberry Pi	TFLite + XNNPACK	INT8 quantization
iOS (iPhone)	CoreML	Neural Engine offload
Android	TFLite + NNAPI	GPU delegate
Edge TPU	TFLite + Edge TPU compiler	INT8 required

Don't try to use the same optimized model everywhere. It won't work.

5. Pruning is Overrated (for Most Use Cases)

Everyone talks about pruning. I tried it extensively:

Structured pruning:

✅ 30% size reduction
❌ 5% accuracy loss
❌ Marginal speedup on real hardware

Unstructured pruning:

✅ 50% size reduction
❌ No speedup on real hardware (sparse ops aren't optimized)
❌ Complicated to maintain

My take: Quantization first. Pruning only if you REALLY need that extra 20% size reduction.

The Breaking Point

After manually optimizing 47 models, I hit a wall.

Each model took 4-8 hours. I was burning out. And I still had:

❌ 15 models for the robotics project
❌ 22 models for a computer vision pipeline
❌ 18 models for a client's IoT deployment

I did the math: 220+ hours of manual work remaining.

That's when I decided to automate it.

Building the Solution

I built a pipeline that handles everything I was doing manually:

Step 1: Model Analysis

def analyze_model(model_path):
    """Auto-detect framework and analyze architecture"""

    # Detect framework
    if model_path.endswith('.pt') or model_path.endswith('.pth'):
        framework = 'pytorch'
        model = torch.load(model_path)
    elif model_path.endswith('.h5'):
        framework = 'tensorflow'
        model = tf.keras.models.load_model(model_path)
    elif model_path.endswith('.onnx'):
        framework = 'onnx'
        model = onnx.load(model_path)

    # Count parameters
    total_params = sum(p.numel() for p in model.parameters())

    # Identify quantization-sensitive layers
    sensitive_layers = identify_sensitive_layers(model)

    return {
        'framework': framework,
        'total_params': total_params,
        'sensitive_layers': sensitive_layers
    }

Step 2: Automated Compression

def compress_model(model, target_hardware, compression_level):
    """Apply mixed precision quantization"""

    # Generate calibration dataset
    calibration_data = generate_calibration_dataset(model)

    # Define quantization config
    if compression_level == 'aggressive':
        default_precision = 'int8'
        sensitive_precision = 'fp16'
    elif compression_level == 'balanced':
        default_precision = 'int8'
        sensitive_precision = 'fp16'
    else:  # conservative
        default_precision = 'fp16'
        sensitive_precision = 'fp16'

    # Apply mixed precision
    quantized_model = apply_mixed_precision(
        model,
        default_precision=default_precision,
        sensitive_layers_precision=sensitive_precision,
        calibration_data=calibration_data
    )

    # Hardware-specific compilation
    if target_hardware == 'jetson':
        final_model = compile_tensorrt(quantized_model)
    elif target_hardware == 'raspberry_pi':
        final_model = compile_tflite_xnnpack(quantized_model)
    elif target_hardware == 'mobile':
        final_model = compile_coreml(quantized_model)

    return final_model

Step 3: Validation

def validate_compression(original_model, compressed_model, test_dataset):
    """Benchmark and validate accuracy"""

    # Size comparison
    original_size = get_model_size(original_model)
    compressed_size = get_model_size(compressed_model)
    compression_ratio = original_size / compressed_size

    # Latency benchmark
    original_latency = benchmark_latency(original_model, test_dataset)
    compressed_latency = benchmark_latency(compressed_model, test_dataset)
    speedup = original_latency / compressed_latency

    # Accuracy validation
    original_acc = evaluate_accuracy(original_model, test_dataset)
    compressed_acc = evaluate_accuracy(compressed_model, test_dataset)
    accuracy_loss = original_acc - compressed_acc

    return {
        'compression_ratio': compression_ratio,
        'speedup': speedup,
        'accuracy_loss': accuracy_loss
    }

Tech stack:

Backend: FastAPI (async job queue with Redis)
ML: ONNX Runtime, TFLite, PyTorch, llama.cpp
Compute: RunPod GPU instances (70% cheaper than AWS)
Storage: S3-compatible object storage

The whole pipeline runs in 3-8 minutes depending on model size.

What took me 4-8 hours manually now takes under 10 minutes.

Real-World Results

Here are some standout cases from the 100+ models I've compressed:

Case 1: YOLOv8-Large (Object Detection)

Metric	Before	After	Change
Size	980MB	245MB	75% smaller
Latency	85ms	33ms	2.5x faster
mAP	98.2%	97.4%	0.8% loss
Deployment	—	Jetson Nano, real-time drone detection	—

Case 2: BERT-Base (NLP)

Metric	Before	After	Change
Size	440MB	110MB	75% smaller
Latency	120ms	45ms	2.7x faster
Accuracy	94.3%	93.8%	0.5% loss
Deployment	—	Raspberry Pi 4, on-device sentiment analysis	—

Case 3: MobileNetV3 (Image Classification)

Metric	Before	After	Change
Size	21MB	5.2MB	75% smaller
Latency	18ms	7ms	2.6x faster
Top-1 Acc	75.2%	74.6%	0.6% loss
Deployment	—	Android app, 100M+ users	—

Pattern: Consistent 4x compression with 2-3x speedup and <1% accuracy loss.

The Bigger Picture

Edge AI is exploding. The market is going from $24.9B (2025) to $118.69B by 2033.

But here's the problem: most AI is still trained in the cloud and deployed in the cloud.

The real world needs AI at the edge:

🚁 Drones that detect objects in real-time
🤖 Robots that navigate autonomously
🏥 Medical devices that process data locally (HIPAA compliance)
📷 Smart cameras that work without internet
🔋 IoT sensors that run for years on battery

And getting models onto these devices is still way too hard.

What I Wish Someone Had Told Me 6 Months Ago

1. Start with quantization, not pruning

It's the 80/20 solution. You'll get 4x compression with minimal effort.

2. Always use ONNX as an intermediate format

It saves so much pain with framework conversions.

3. Calibration dataset quality > model architecture

Use 1000+ representative samples from your actual validation set.

4. Hardware-specific optimization is non-optional

Generic models won't cut it. Optimize for your target hardware.

5. Measure everything on real hardware

Latency, throughput, memory, power consumption. Not just in theory.

6. Don't trust the accuracy number from quantization tools

Always validate on your actual test set with your actual metrics.

Key Takeaways

If you're deploying models to edge devices, here's your action plan:

# The compression playbook
def optimize_for_edge(model, target_hardware):
    # 1. Convert to ONNX first (universal format)
    onnx_model = convert_to_onnx(model)

    # 2. Apply INT8 quantization with good calibration
    calibration_data = sample_validation_set(n=1000)
    quantized_model = quantize_int8(onnx_model, calibration_data)

    # 3. Hardware-specific compilation
    if target_hardware == 'jetson':
        final_model = compile_tensorrt(quantized_model)
    elif target_hardware == 'raspberry_pi':
        final_model = compile_tflite(quantized_model)

    # 4. Validate on real hardware
    metrics = benchmark_on_device(final_model, target_hardware)

    return final_model, metrics

Expected results:

4x smaller models
2-3x faster inference
<1% accuracy loss

Try It Yourself

If you're facing similar challenges, I built this into a platform that automates the entire process.

🔗 Try it here: https://edge-ai-alpha.vercel.app/

Free tier: 5 compressions/month, no credit card needed.

Supports:

✅ PyTorch, TensorFlow, ONNX models
✅ Export to TFLite, CoreML, TensorRT, ONNX Runtime
✅ Automatic calibration dataset generation
✅ Hardware-specific optimization profiles

What's Next?

I'm working on:

🔧 Automated hyperparameter tuning for compression
🌐 Federated learning support (train on edge, aggregate in cloud)
🎛️ Custom hardware profiles (add your own device specs)
📦 Multi-model ensemble optimization

Let's Discuss!

What's your experience with model deployment?

Ever tried quantization? How did it go?
What's your biggest pain point with edge AI?
Any horror stories to share? 😅

Drop a comment below! I'll try to respond to everyone.

Thanks for reading! If you found this helpful, consider:

❤️ Reacting with a ❤️ or 🦄
💬 Sharing your own edge deployment experiences
🔖 Bookmarking for later reference

Happy optimizing! 🚀

— Patel Darshit

DEV Community