Dechun Wang

Posted on Feb 15

Stop Fine-Tuning Blindly: When to Fine-Tune—and When Not to Touch Model Weights

#ai #llm #finetuning

Fine-Tuning Is a Knife, Not a Hammer

Fine-tuning has a reputation problem.

Some people treat it like magic: “Just fine-tune and the model will understand our domain.”

Others treat it like a sin: “Never touch weights, it’s all prompt engineering now.”

Both are wrong.

Fine-tuning is a precision tool. Used well, it turns a generic model into a specialist. Used badly, it burns GPU budgets, bakes in bias, and ships a model that performs worse than the base.

This is a field guide: what types of fine-tuning exist, what they cost, how to run them, and the traps that quietly ruin outcomes.

1) The Real Taxonomy of Fine-Tuning

There are multiple ways to classify fine-tuning. The cleanest is: what changes, what signal you train on, and what model type you’re adapting.

1.1 By training scope: Full FT vs PEFT

Full fine-tuning (Full FT)

Definition: update all model weights so the model fully adapts to the new task.

Traits:

Maximum flexibility, maximum cost
Requires strong data quality and careful regularisation
Risk: catastrophic forgetting (the model “forgets” general abilities)

When it makes sense:

You have a stable task and a solid dataset (usually 10k–100k+ high-quality samples)
You can afford experiments and regression testing
You need deeper behavioural change than PEFT can deliver

Parameter-Efficient Fine-Tuning (PEFT)

Definition: freeze most weights and train small, targeted parameters.

You get most of the gains with a fraction of the cost.

PEFT subtypes you’ll actually see in production:

(A) Adapters

Insert small modules inside transformer blocks; train only those adapter weights. Typically a few percent of the total parameters.

(B) Prompt tuning (soft prompts / prefix tuning)

Train learnable “prompt vectors” (or a prefix) that steer behaviour.

Soft prompts: continuous vectors
Hard prompts: discrete tokens (rarely “trained” in the same way)

(C) LoRA (Low-Rank Adaptation)

LoRA is the workhorse. It decomposes weight updates into low-rank matrices:

$$
\Delta W = BA, \quad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times k},\; r \ll \min(d,k)
$$

Why it wins:

You store only (\Delta W) (small)
Easy to swap adapters per task
Strong performance per compute

(D) QLoRA

QLoRA runs LoRA on a quantised base model (often 4-bit), slashing VRAM requirements and making “big-ish” fine-tuning viable on consumer GPUs.

1.2 By learning signal: SFT, RLHF, contrastive (and friends)

Supervised Fine-Tuning (SFT)

Train on labelled input-output pairs. This is the default for:

classification
extraction
instruction following (instruction tuning)
style / tone adaptation

Preference optimisation (RLHF / DPO / variants)

Classic RLHF pipeline: SFT → reward model → policy optimisation (e.g., PPO).

In practice, many teams now use direct preference optimisation (DPO)-style training because it’s simpler operationally, but the concept is the same: align the model to preferences.

Contrastive fine-tuning

Useful when you care about representations (retrieval, similarity, embedding quality), less common for everyday text generation.

1.3 By modality: language, vision, multimodal

NLP: BERT/GPT/T5-style models; instruction tuning and chain-of-thought-style supervision are common
Vision: ResNet/ViT; progressive unfreezing and strong augmentation matter
Multimodal: CLIP/BLIP/Flamingo-like; biggest challenge is aligning representations across modalities

2) When Fine-Tuning Actually Pays Off

Fine-tuning shines in three situations:

2.1 Your domain language is not optional

Example: finance risk text. If the base model misreads terms like “short”, “subprime”, “haircut”, it will miss signals no matter how clever the prompt is.

2.2 Your task needs consistent behaviour, not one-off brilliance

A model that produces “sometimes great” answers is a nightmare in production. Fine-tuning can stabilise behaviour and reduce prompt complexity.

2.3 Your deployment requires control

On-prem constraints, latency budgets, data residency: self-hosted models + PEFT are often the only workable path.

3) When You Should NOT Fine-Tune

Here are the expensive mistakes:

<100 labelled samples: you’ll overfit or learn noise
task changes weekly: your fine-tune becomes technical debt
you can solve it with retrieval: if the problem is “missing knowledge,” do RAG first
you can’t evaluate properly: if you can’t measure, don’t train

4) The Fine-Tuning Workflow That Survives Production

Forget “train.py and vibes.” A real pipeline has repeatable stages.

4.1 Environment

Core stack:

PyTorch
Transformers + Datasets
Accelerate
PEFT
Experiment tracking (Weights & Biases or MLflow)

4.2 Data

This is where most projects win or lose.

Minimum checklist:

label consistency (do two annotators agree?)
balanced distribution (avoid 10:1 class collapse unless you correct for it)
no leakage (train/val split must be clean)

4.3 Model config

pick base model
pick tuning method (LoRA vs QLoRA vs full)
decide what gets trained, what stays frozen

4.4 Training loop

forward → loss → backward
gradient clipping
mixed precision when appropriate
periodic eval

4.5 Evaluation + export

validate on held-out set
measure robustness and regression
export artefacts (base + adapter weights)

5) Practical Code: SFT + LoRA (PEFT) with Transformers

Below is a slightly tweaked version of the standard Hugging Face flow, tuned for clarity and real-world guardrails.

# pip install transformers datasets accelerate peft evaluate
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
import evaluate
import numpy as np

dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=384)

tokenised = dataset.map(preprocess, batched=True)
tokenised = tokenised.remove_columns(["text"]).rename_column("label", "labels")
tokenised.set_format("torch")

base = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

lora_cfg = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["query", "value"],  # tweak per model architecture
    bias="none",
    task_type="SEQ_CLS",
)
model = get_peft_model(base, lora_cfg)

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return metric.compute(predictions=preds, references=labels)

args = TrainingArguments(
    output_dir="./ft_out",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    logging_steps=100,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["test"],
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.evaluate()
model.save_pretrained("./ft_out/lora_adapter")

What’s different (and why it matters):

max_length trimmed to 384 to reduce waste
LoRA targets are explicit (you should verify for your model)
fp16 enabled, batch sizes set for typical GPUs

6) QLoRA in Practice: When VRAM Is Your Bottleneck

QLoRA is the “I don’t have an A100” option.

Use it when:

your model is too big to fine-tune in full precision
you want LoRA-level results with drastically less memory
you accept slightly more complexity in setup

Operational note: QLoRA is sensitive to:

quantisation config
optimizer choice
batch size / gradient accumulation

7) Hardware Planning (The Boring Part That Saves You £££)

A simple rule-of-thumb table (very rough, but directionally useful):

Model size	Practical approach	GPU class	Why
<1B	Full FT or LoRA	24GB consumer GPU	Cheap experiments
1–10B	LoRA/QLoRA	40–80GB	Stable training & eval
>10B	QLoRA or multi-GPU	80GB+ (multi-card)	Memory + throughput

If your goal is a production system, plan for:

checkpoints (storage balloons fast)
inference latency testing (p50/p95/p99)
versioning (base + adapters + configs)

8) Monitoring: How to Detect Failure Early

Track:

train vs val loss divergence (overfitting)
task metric (F1/AUC/accuracy) over time
gradient norms (explosions or vanishing)
GPU utilisation + VRAM (to catch bottlenecks)

Early stopping is not optional in small-data regimes.

9) The Pitfalls That Kill Fine-Tuning Projects

9.1 Data leakage

Validation looks amazing, test collapses.

Fix:

group-aware splits
time-based splits for temporal data
deduplicate aggressively

9.2 Class imbalance

Model learns the majority class.

Fix:

weighting
resampling
metric choice (F1 > accuracy in many cases)

9.3 “Bigger model = better”

On small data, bigger models can overfit harder.

Fix:

match model size to data
prefer PEFT
regularise

9.4 Ignoring deployment constraints

A model that hits 0.96 AUC but misses latency and memory budgets is a demo, not a product.

Fix:

benchmark early
export-friendly formats (ONNX/TensorRT) if needed
distil if latency matters

10) A Decision Cheat Sheet

Use this quick chooser:

Data < 100 → prompt + retrieval + synthetic data
100–1,000 → LoRA / adapters
1,000–10,000 → LoRA or full FT (small LR)
10,000+ → full FT can make sense (if eval + regression are solid)
VRAM tight → QLoRA
Need preference alignment → DPO/RLHF-style preference training
Task changes often → avoid weight updates, design workflows instead

Final Take

Successful fine-tuning isn’t “a training run.”

It’s a loop:
data → training → evaluation → deployment constraints → monitoring → back to data.

If you treat it as an engineering system (not a one-off experiment), PEFT methods like LoRA/QLoRA give you the best tradeoff curve in 2026: strong gains, manageable cost, and deployable artefacts.

And that’s what you want: not a model that’s “smart in a notebook,” but a model that’s reliable in production.

DEV Community

Stop Fine-Tuning Blindly: When to Fine-Tune—and When Not to Touch Model Weights

Fine-Tuning Is a Knife, Not a Hammer

1) The Real Taxonomy of Fine-Tuning

1.1 By training scope: Full FT vs PEFT

Full fine-tuning (Full FT)

Parameter-Efficient Fine-Tuning (PEFT)

(A) Adapters

(B) Prompt tuning (soft prompts / prefix tuning)

(C) LoRA (Low-Rank Adaptation)

(D) QLoRA

1.2 By learning signal: SFT, RLHF, contrastive (and friends)

Supervised Fine-Tuning (SFT)

Preference optimisation (RLHF / DPO / variants)

Contrastive fine-tuning

1.3 By modality: language, vision, multimodal

2) When Fine-Tuning Actually Pays Off

2.1 Your domain language is not optional

2.2 Your task needs consistent behaviour, not one-off brilliance

2.3 Your deployment requires control

3) When You Should NOT Fine-Tune

4) The Fine-Tuning Workflow That Survives Production

4.1 Environment

4.2 Data

4.3 Model config

4.4 Training loop

4.5 Evaluation + export

5) Practical Code: SFT + LoRA (PEFT) with Transformers

6) QLoRA in Practice: When VRAM Is Your Bottleneck

7) Hardware Planning (The Boring Part That Saves You £££)

8) Monitoring: How to Detect Failure Early

9) The Pitfalls That Kill Fine-Tuning Projects

9.1 Data leakage

9.2 Class imbalance

9.3 “Bigger model = better”

9.4 Ignoring deployment constraints

10) A Decision Cheat Sheet

Final Take

Top comments (0)