DEV Community

Cover image for I Spent 3 Months Compressing AI Models So You Don't Have To – Here's What I Learned
Patel Darshit
Patel Darshit

Posted on

I Spent 3 Months Compressing AI Models So You Don't Have To – Here's What I Learned

TL;DR: Deployed 100+ AI models to edge devices. Discovered the hard way that manual optimization sucks. Built a tool to automate it. Sharing everything I learned.


The Problem Nobody Talks About

You spend weeks training the perfect computer vision model. 98% accuracy. Beautiful loss curves. Your team is celebrating.

Then someone asks: "Can we run this on a Jetson Nano?"

And suddenly, your 2GB PyTorch masterpiece becomes a 500MB problem.

This was me six months ago. I had a YOLOv8 model that needed to run on edge hardware for a robotics project. The model worked perfectly in the cloud. On a Jetson Nano? 12 FPS. Unusable.

I needed 30+ FPS for real-time detection.


The Manual Optimization Rabbit Hole

Here's what I tried first (spoiler: it was painful):

Attempt 1: TensorFlow Lite Conversion

import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

with open('model.tflite', 'wb') as f:
    f.write(tflite_model)
Enter fullscreen mode Exit fullscreen mode

Result: Model size went from 980MB → 780MB. Not enough. Inference time barely improved.

Time wasted: 8 hours fighting compatibility issues between TensorFlow versions.


Attempt 2: Manual INT8 Quantization

converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

def representative_dataset():
    for _ in range(100):
        yield [np.random.rand(1, 224, 224, 3).astype(np.float32)]

converter.representative_dataset = representative_dataset
tflite_model = converter.convert()
Enter fullscreen mode Exit fullscreen mode

Result: Model crashed on inference. Accuracy dropped to 73% (from 98%). Completely broken.

Time wasted: Another 12 hours debugging why my calibration dataset wasn't working.


Attempt 3: ONNX Runtime + TensorRT

This one actually worked. But here's what it took:

  1. Convert PyTorch → ONNX (3 hours, fighting version conflicts)
  2. Optimize ONNX graph (2 hours, manual layer fusion)
  3. Convert to TensorRT engine (4 hours, hardware-specific tuning)
  4. Profile and fix precision issues (6 hours of trial and error)

Total time: 3 days for ONE model.

Final result:

  • ✅ Size: 245MB (75% reduction)
  • ✅ Latency: 33ms (2.5x faster)
  • ✅ Accuracy: 97.2% (0.8% loss)

It worked! But I had 15 more models to optimize.


The "Oh Crap" Moment

At that rate, optimizing all my models would take 45 days of full-time work.

I started Googling for tools. Found OctoAI. Perfect solution.

Then I read:

"OctoAI acquired by NVIDIA. Platform shutting down October 2024."

Great. 😑

Neural Magic? Enterprise-only. $50K minimum.

Edge Impulse? Microcontroller focus, not for my use case.

There was no affordable, automated solution for regular developers.


What I Learned From 100+ Model Compressions

Over the next 3 months, I compressed over 100 different models. Here's the non-obvious stuff nobody tells you:

1. INT8 Quantization is Magic (When Done Right)

Average results across 100+ models:

  • 📦 Compression: 4x smaller
  • ⚡ Speedup: 2-3x faster
  • 🎯 Accuracy loss: 0.5-1.5%

But here's the catch: calibration dataset matters more than model architecture.

Bad calibration = 10% accuracy loss 😱

Good calibration = 0.5% accuracy loss 🎉

My calibration strategy:

# Use 1000 representative samples from validation set
calibration_data = validation_set.sample(n=1000, random_state=42)

def representative_dataset_gen():
    for sample in calibration_data:
        yield [sample.numpy()]

converter.representative_dataset = representative_dataset_gen
Enter fullscreen mode Exit fullscreen mode

Game changer. Accuracy loss went from 3% to 0.5%.


2. Not All Layers Should Be Quantized

I was quantizing everything to INT8. Rookie mistake.

Some layers (especially first conv and last FC layers) are super sensitive to quantization.

Better approach: Mixed precision quantization

Layer Type Precision Reason
First Conv FP16 Sensitive to input variations
Middle Conv Layers INT8 Biggest size savings
Attention Layers FP16 Critical for accuracy
Final FC Layer FP16 Output quality matters
Batch Norm INT8 Can be fused anyway

Result: 6x compression with 0.3% accuracy loss instead of 3%.


3. Framework Conversion is a Minefield

Success rates I observed:

  • PyTorch → TFLite directly: 60% success
  • PyTorch → ONNX → TFLite: 85% success
  • TensorFlow → ONNX → TFLite: 90% success

The trick: Always go through ONNX as an intermediate step.

# Export PyTorch to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    opset_version=13,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
)

# Optimize ONNX graph
import onnxruntime as ort

sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.optimized_model_filepath = "model_optimized.onnx"

session = ort.InferenceSession("model.onnx", sess_options)
Enter fullscreen mode Exit fullscreen mode

Free 15-20% speedup just from graph fusion.


4. Hardware-Specific Optimization is Non-Negotiable

Generic optimization: 2x speedup

Hardware-specific optimization: 5x speedup

Here's what works for each platform:

Hardware Best Framework Key Optimization
NVIDIA Jetson TensorRT FP16 + layer fusion
Raspberry Pi TFLite + XNNPACK INT8 quantization
iOS (iPhone) CoreML Neural Engine offload
Android TFLite + NNAPI GPU delegate
Edge TPU TFLite + Edge TPU compiler INT8 required

Don't try to use the same optimized model everywhere. It won't work.


5. Pruning is Overrated (for Most Use Cases)

Everyone talks about pruning. I tried it extensively:

Structured pruning:

  • ✅ 30% size reduction
  • ❌ 5% accuracy loss
  • ❌ Marginal speedup on real hardware

Unstructured pruning:

  • ✅ 50% size reduction
  • ❌ No speedup on real hardware (sparse ops aren't optimized)
  • ❌ Complicated to maintain

My take: Quantization first. Pruning only if you REALLY need that extra 20% size reduction.


The Breaking Point

After manually optimizing 47 models, I hit a wall.

Each model took 4-8 hours. I was burning out. And I still had:

  • ❌ 15 models for the robotics project
  • ❌ 22 models for a computer vision pipeline
  • ❌ 18 models for a client's IoT deployment

I did the math: 220+ hours of manual work remaining.

That's when I decided to automate it.


Building the Solution

I built a pipeline that handles everything I was doing manually:

Step 1: Model Analysis

def analyze_model(model_path):
    """Auto-detect framework and analyze architecture"""

    # Detect framework
    if model_path.endswith('.pt') or model_path.endswith('.pth'):
        framework = 'pytorch'
        model = torch.load(model_path)
    elif model_path.endswith('.h5'):
        framework = 'tensorflow'
        model = tf.keras.models.load_model(model_path)
    elif model_path.endswith('.onnx'):
        framework = 'onnx'
        model = onnx.load(model_path)

    # Count parameters
    total_params = sum(p.numel() for p in model.parameters())

    # Identify quantization-sensitive layers
    sensitive_layers = identify_sensitive_layers(model)

    return {
        'framework': framework,
        'total_params': total_params,
        'sensitive_layers': sensitive_layers
    }
Enter fullscreen mode Exit fullscreen mode

Step 2: Automated Compression

def compress_model(model, target_hardware, compression_level):
    """Apply mixed precision quantization"""

    # Generate calibration dataset
    calibration_data = generate_calibration_dataset(model)

    # Define quantization config
    if compression_level == 'aggressive':
        default_precision = 'int8'
        sensitive_precision = 'fp16'
    elif compression_level == 'balanced':
        default_precision = 'int8'
        sensitive_precision = 'fp16'
    else:  # conservative
        default_precision = 'fp16'
        sensitive_precision = 'fp16'

    # Apply mixed precision
    quantized_model = apply_mixed_precision(
        model,
        default_precision=default_precision,
        sensitive_layers_precision=sensitive_precision,
        calibration_data=calibration_data
    )

    # Hardware-specific compilation
    if target_hardware == 'jetson':
        final_model = compile_tensorrt(quantized_model)
    elif target_hardware == 'raspberry_pi':
        final_model = compile_tflite_xnnpack(quantized_model)
    elif target_hardware == 'mobile':
        final_model = compile_coreml(quantized_model)

    return final_model
Enter fullscreen mode Exit fullscreen mode

Step 3: Validation

def validate_compression(original_model, compressed_model, test_dataset):
    """Benchmark and validate accuracy"""

    # Size comparison
    original_size = get_model_size(original_model)
    compressed_size = get_model_size(compressed_model)
    compression_ratio = original_size / compressed_size

    # Latency benchmark
    original_latency = benchmark_latency(original_model, test_dataset)
    compressed_latency = benchmark_latency(compressed_model, test_dataset)
    speedup = original_latency / compressed_latency

    # Accuracy validation
    original_acc = evaluate_accuracy(original_model, test_dataset)
    compressed_acc = evaluate_accuracy(compressed_model, test_dataset)
    accuracy_loss = original_acc - compressed_acc

    return {
        'compression_ratio': compression_ratio,
        'speedup': speedup,
        'accuracy_loss': accuracy_loss
    }
Enter fullscreen mode Exit fullscreen mode

Tech stack:

  • Backend: FastAPI (async job queue with Redis)
  • ML: ONNX Runtime, TFLite, PyTorch, llama.cpp
  • Compute: RunPod GPU instances (70% cheaper than AWS)
  • Storage: S3-compatible object storage

The whole pipeline runs in 3-8 minutes depending on model size.

What took me 4-8 hours manually now takes under 10 minutes.


Real-World Results

Here are some standout cases from the 100+ models I've compressed:

Case 1: YOLOv8-Large (Object Detection)

Metric Before After Change
Size 980MB 245MB 75% smaller
Latency 85ms 33ms 2.5x faster
mAP 98.2% 97.4% 0.8% loss
Deployment Jetson Nano, real-time drone detection

Case 2: BERT-Base (NLP)

Metric Before After Change
Size 440MB 110MB 75% smaller
Latency 120ms 45ms 2.7x faster
Accuracy 94.3% 93.8% 0.5% loss
Deployment Raspberry Pi 4, on-device sentiment analysis

Case 3: MobileNetV3 (Image Classification)

Metric Before After Change
Size 21MB 5.2MB 75% smaller
Latency 18ms 7ms 2.6x faster
Top-1 Acc 75.2% 74.6% 0.6% loss
Deployment Android app, 100M+ users

Pattern: Consistent 4x compression with 2-3x speedup and <1% accuracy loss.


The Bigger Picture

Edge AI is exploding. The market is going from $24.9B (2025) to $118.69B by 2033.

But here's the problem: most AI is still trained in the cloud and deployed in the cloud.

The real world needs AI at the edge:

  • 🚁 Drones that detect objects in real-time
  • 🤖 Robots that navigate autonomously
  • 🏥 Medical devices that process data locally (HIPAA compliance)
  • 📷 Smart cameras that work without internet
  • 🔋 IoT sensors that run for years on battery

And getting models onto these devices is still way too hard.


What I Wish Someone Had Told Me 6 Months Ago

1. Start with quantization, not pruning

It's the 80/20 solution. You'll get 4x compression with minimal effort.

2. Always use ONNX as an intermediate format

It saves so much pain with framework conversions.

3. Calibration dataset quality > model architecture

Use 1000+ representative samples from your actual validation set.

4. Hardware-specific optimization is non-optional

Generic models won't cut it. Optimize for your target hardware.

5. Measure everything on real hardware

Latency, throughput, memory, power consumption. Not just in theory.

6. Don't trust the accuracy number from quantization tools

Always validate on your actual test set with your actual metrics.


Key Takeaways

If you're deploying models to edge devices, here's your action plan:

# The compression playbook
def optimize_for_edge(model, target_hardware):
    # 1. Convert to ONNX first (universal format)
    onnx_model = convert_to_onnx(model)

    # 2. Apply INT8 quantization with good calibration
    calibration_data = sample_validation_set(n=1000)
    quantized_model = quantize_int8(onnx_model, calibration_data)

    # 3. Hardware-specific compilation
    if target_hardware == 'jetson':
        final_model = compile_tensorrt(quantized_model)
    elif target_hardware == 'raspberry_pi':
        final_model = compile_tflite(quantized_model)

    # 4. Validate on real hardware
    metrics = benchmark_on_device(final_model, target_hardware)

    return final_model, metrics
Enter fullscreen mode Exit fullscreen mode

Expected results:

  • 4x smaller models
  • 2-3x faster inference
  • <1% accuracy loss

Try It Yourself

If you're facing similar challenges, I built this into a platform that automates the entire process.

🔗 Try it here: https://edge-ai-alpha.vercel.app/

Free tier: 5 compressions/month, no credit card needed.

Supports:

  • ✅ PyTorch, TensorFlow, ONNX models
  • ✅ Export to TFLite, CoreML, TensorRT, ONNX Runtime
  • ✅ Automatic calibration dataset generation
  • ✅ Hardware-specific optimization profiles

What's Next?

I'm working on:

  • 🔧 Automated hyperparameter tuning for compression
  • 🌐 Federated learning support (train on edge, aggregate in cloud)
  • 🎛️ Custom hardware profiles (add your own device specs)
  • 📦 Multi-model ensemble optimization

Let's Discuss!

What's your experience with model deployment?

  • Ever tried quantization? How did it go?
  • What's your biggest pain point with edge AI?
  • Any horror stories to share? 😅

Drop a comment below! I'll try to respond to everyone.


Thanks for reading! If you found this helpful, consider:

  • ❤️ Reacting with a ❤️ or 🦄
  • 💬 Sharing your own edge deployment experiences
  • 🔖 Bookmarking for later reference

Happy optimizing! 🚀

— Patel Darshit

Top comments (0)