What LoRA rank should I use?

Start at r=16 for domain adaptation tasks — it is the right balance between expressiveness and parameter count for most use cases. Use r=8 if you are memory-constrained or working on narrow style/format tasks. Use r=32 or r=64 when the task requires capturing complex domain-specific patterns (e.g., technical jargon, specialized reasoning chains) or when r=16 consistently underperforms on eval. The QLoRA paper showed very little statistical difference between r=8 and r=256 when LoRA is applied to all linear layers, which means the key lever is target_modules coverage, not rank size.

When does LoRA fine-tuning fail?

The most common failure modes are: (1) Catastrophic forgetting — the adapter overwrites general capabilities when trained too aggressively (high learning rate, too many epochs, too narrow a dataset). Fix: lower learning rate, add an eval set covering general tasks alongside domain tasks, and use early stopping. (2) Dataset contamination — training data overlaps with your evaluation set, producing inflated metrics that collapse in production. Fix: deduplicate before splitting, use held-out test data from a different time period or source. (3) Underfitting — choosing r=8 with only attention projection layers (q_proj, v_proj) when the task requires deeper adaptation. Fix: extend target_modules to all linear layers including MLP projections. (4) Chat template mismatch — training on raw text instead of the model's expected chat template, causing the model to generate in the wrong format at inference.

Should I merge the LoRA adapter into the base model or serve them separately?

Merge for simplicity and throughput: a merged model loads like any standard checkpoint and runs at full inference speed with no adapter overhead. Use merge_and_unload() from PEFT to produce a single model directory. Serve separately when you need multiple adapters on one base model (e.g., different departments, different tasks) — vLLM supports LoRA adapter hot-swapping with --enable-lora. The separate-serving approach saves GPU memory when you have 10+ adapters, since only the active adapter needs to be loaded alongside the single base model.

LoRA and QLoRA: A Practical Guide to Fine-tuning LLMs on a Budget

Q: What is the difference between LoRA and QLoRA?

LoRA (Low-Rank Adaptation) freezes the base model weights and injects trainable low-rank matrices into selected linear layers, dramatically reducing the number of trainable parameters. QLoRA (Quantized LoRA) extends this by loading the frozen base model in 4-bit NormalFloat precision (NF4) using bitsandbytes, cutting VRAM requirements by roughly 4x compared to LoRA in full bfloat16. The adapters themselves are still trained in bfloat16. The trade-off: QLoRA is slower per step (dequantization overhead) but makes training feasible on consumer GPUs or single-server setups that would otherwise require a multi-GPU cluster.

Q: How much VRAM do I need to fine-tune an LLM with QLoRA?

With QLoRA (4-bit base + bfloat16 adapters): Llama 3 8B fits in approximately 10-12 GB of VRAM, making a single RTX 3090 or 4090 viable. Llama 3 70B requires approximately 40-48 GB, so a single A100 80GB or two A100 40GB GPUs. Mistral 7B fits in about 8-10 GB. With standard LoRA in bfloat16: multiply these figures by roughly 3.5-4x. The practical takeaway is that QLoRA brings 7-8B model fine-tuning into the consumer GPU range and 70B into single-A100 territory, which changes the economics entirely.

Q: How long does a typical QLoRA fine-tuning run take?

On a single A100 40GB: a 7-8B model fine-tuned on 5,000 instruction samples for 3 epochs takes roughly 45 minutes to 1.5 hours. At 50,000 samples, expect 4-6 hours. A 70B model fine-tuned with QLoRA on the same 5,000 samples takes approximately 2-4 hours on a single A100 80GB. Cost on RunPod or Lambda Labs: approximately $15-50 for small runs (5K samples, 7B model), $80-200 for larger runs (50K samples, 70B model). These numbers assume standard packing and gradient checkpointing are enabled.

Full fine-tuning a 70B parameter model requires gradient storage for every single weight. On bfloat16, that is 140 GB just for the model — before optimizer states, which double or triple the memory requirement depending on Adam. That puts full fine-tuning firmly in the "multi-node cluster with InfiniBand" category for serious models. LoRA and QLoRA are the pragmatic answer: they get you 90-95% of the quality at a fraction of the cost, on hardware that actually exists in a normal engineering budget.

This guide is not a theory overview. It covers everything you need to actually run a fine-tuning job and have a reasonable chance of it working: the math intuition behind low-rank adaptation, how QLoRA extends it with 4-bit quantization, how to pick your hyperparameters without guessing, how to structure your data, how to evaluate correctly, and — critically — what failure modes to expect and how to handle them. If you are still deciding whether fine-tuning is the right call for your use case, read our companion article on fine-tuning vs RAG vs prompting first. If you have already decided and just want the code, skip straight to the hyperparameter section.

The stack throughout this guide: HuggingFace PEFT, TRL (SFTTrainer), bitsandbytes for quantization, and optionally Axolotl or Unsloth for production-grade training pipelines.

The Intuition Behind LoRA

The key insight in the Hu et al. 2021 LoRA paper is that the weight updates during fine-tuning have a low intrinsic rank. When you fine-tune a model on a narrow domain task, you are not changing everything — you are nudging the model in a relatively constrained direction in weight space.

Instead of directly updating a weight matrix W (which is large — for a 7B model, a typical projection matrix might be 4096 × 4096 = 16 million parameters), LoRA parameterizes the update as the product of two low-rank matrices:

W' = W + ΔW = W + BA

where:
  B ∈ R^(d × r)   — projects from rank r up to output dimension d
  A ∈ R^(r × k)   — projects from input dimension k down to rank r
  r << min(d, k)  — the rank, typically 4–64

Trainable parameters: r × (d + k)  instead of  d × k

For a 4096×4096 matrix with r=16:
  Full fine-tuning:  16,777,216 params
  LoRA r=16:         131,072 params  (0.78% of original)

At initialization, A is drawn from a Gaussian distribution and B is set to zero, so ΔW = BA = 0 at the start of training — meaning the model begins from the pretrained weights with no perturbation. Only A and B are updated during training. W stays frozen.

The output of an adapted layer becomes:

h = Wx + (α/r) × BAx

where α is the LoRA scaling factor (lora_alpha in the config).
The (α/r) term controls how much the adapter's contribution is
scaled relative to the frozen weights.

In practice, this means adapters trained at r=16 with alpha=32 behave equivalently to adapters trained at r=16 with alpha=16 but with the learning rate doubled. The scaling and the learning rate interact — which is why most practitioners use the heuristic alpha = 2 × r and then tune the learning rate separately.

What QLoRA Adds

The Dettmers et al. 2023 QLoRA paper adds three components on top of LoRA, all implemented in the bitsandbytes library:

4-bit NormalFloat (NF4) quantization. The frozen base model weights are loaded and stored in NF4 format, which is an information-theoretically optimal quantization for normally distributed data (weights of pretrained LLMs are approximately normally distributed). This cuts the memory footprint of the frozen model by roughly 4x compared to bfloat16. The LoRA adapters (B and A matrices) are still stored and computed in bfloat16.

Double quantization. The quantization constants themselves are quantized again, saving roughly 0.37 bits per parameter on average. Small additional savings, but meaningful at scale.

Paged optimizers. bitsandbytes uses NVIDIA unified memory to page optimizer states between GPU and CPU RAM when there are memory spikes during the backward pass. This prevents out-of-memory crashes during gradient accumulation steps without requiring you to reduce batch size so aggressively.

The practical VRAM implications are concrete:

Model	Full fine-tune	LoRA (bf16)	QLoRA (4-bit)	Viable hardware (QLoRA)
Mistral 7B	~60 GB	~28 GB	~8–10 GB	RTX 3090, RTX 4090
Llama 3 8B	~64 GB	~30 GB	~10–12 GB	RTX 3090, A10G
Llama 3 70B	>400 GB	~140 GB	~40–48 GB	A100 80GB (single)
Llama 3 405B	Not feasible	Not feasible	~200–220 GB	4× A100 80GB

The VRAM numbers above assume gradient checkpointing is enabled and batch size is small (1–4). Optimizer states for the LoRA adapters (bfloat16 Adam) add a modest overhead — typically 2–4 GB for a 7B model with r=16 — since you only need optimizer states for the adapter parameters, not for the frozen base.

Lesson learned

On a Llama 3 8B QLoRA run with batch size 4, gradient accumulation 4, and sequence length 2048, we saw peak VRAM usage spike to 14–15 GB on specific batches with long sequences — well above the 10–12 GB baseline estimate. Always leave 15–20% headroom above your estimated peak. On a 24 GB GPU, this is fine; on a 12 GB GPU, you will hit OOM. Add max_seq_length padding or reduce gradient accumulation steps if this happens.

Hyperparameter Selection: Stop Guessing

Most tutorials present LoRA hyperparameters as knobs to tune without giving you a principled starting point. Here is an opinionated set of defaults that work across the majority of domain adaptation tasks, followed by guidance on when to deviate.

LoRA rank (r)

The rank controls the expressiveness of the adapter — how many independent directions the adapter can represent. Higher rank = more capacity = more parameters = slower training and higher VRAM.

r=8: use for narrow tasks (classification reformulation, output format adaptation, tone/style changes). Fewer parameters, faster training, lower overfitting risk on small datasets.
r=16: the sane default for most domain adaptation tasks. Good balance between capacity and efficiency for datasets of 5K–100K samples.
r=32: use when r=16 consistently shows underfitting on eval (validation loss plateaus above training loss early). Typical for complex reasoning tasks or heavy technical jargon.
r=64: rarely necessary. Mainly useful for long-context instruction following or tasks that require learning genuinely new factual associations (though even here, RAG often outperforms).

The QLoRA paper demonstrated that when LoRA is applied to all linear layers (not just attention), the rank has surprisingly little impact on final quality above r=8. Target module coverage matters more than rank size. Doubling r while keeping target_modules on attention only is less effective than keeping r=16 and adding MLP projections.

LoRA alpha (lora_alpha)

Alpha is the scaling factor applied to the adapter output: the contribution of the adapter is scaled by alpha/r. The practical implication is that alpha and learning rate interact multiplicatively.

Heuristic: alpha = 2 × r. So for r=16, use alpha=32. This effectively doubles the learning rate for the adapter relative to the baseline, which empirically works well across most tasks. You then tune the actual learning rate separately rather than using alpha as a secondary learning rate dial — which is what most engineers accidentally do when they set alpha equal to r.

An alpha lower than r (e.g., r=16, alpha=8) dampens the adapter's effect. An alpha much higher than r (e.g., r=16, alpha=128) can cause instability. Unless you have a specific reason to deviate, alpha = 2r is where to start.

Target modules

Target modules specify which linear layers receive LoRA adapters. This is the most impactful structural decision in the config.

Attention only (q_proj, v_proj): the minimal configuration from the original LoRA paper. Fewer parameters, faster. Often sufficient for style adaptation.
Full attention (q_proj, k_proj, v_proj, o_proj): covers all four attention projections. Better for tasks that require changing how the model attends to context.
All linear layers (+ gate_proj, up_proj, down_proj in MLP): this is what the QLoRA paper recommends and what matches full fine-tuning quality most closely. For Llama-family models, this is q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj. Use target_modules="all-linear" in PEFT to select all linear layers automatically.

In practice: for tasks where you care about factual accuracy or complex reasoning in a new domain, target all linear layers. The parameter overhead is modest at r=16 and the quality difference is real.

Learning rate

The stable range for LoRA/QLoRA is 1e-4 to 3e-4 with a cosine schedule and warmup. For most tasks, 2e-4 is a reliable default. Go lower (5e-5 to 1e-4) if you observe training instability or if your dataset is small (<2K samples). Go higher (3e-4 to 5e-4) only with large datasets and if you have confirmed the model is underfitting.

Use a cosine learning rate schedule with 3–5% warmup steps. Linear decay also works but cosine tends to recover better if there are rough patches in the loss curve mid-training.

The sane defaults table

Hyperparameter	Default	When to go lower	When to go higher
r (rank)	16	Style/format tasks, <2K samples	Complex reasoning, val loss plateau
lora_alpha	32 (= 2r)	Training instability	Rarely — tune LR instead
lora_dropout	0.05	Large datasets (>100K), no overfitting	Small datasets, high overfitting
target_modules	all-linear	Extreme memory constraints	N/A (already maximum)
Learning rate	2e-4	<2K samples, instability	Large dataset + confirmed underfit
Epochs	2–3	Large datasets (>50K)	Very small datasets (<1K)
LR schedule	cosine	N/A	N/A

Opinionated take: for most domain adaptation tasks, r=16, alpha=32, target_modules="all-linear", learning rate 2e-4, 2-3 epochs is a sane default. Start there before tweaking. You will spend more time debugging your dataset than tuning these parameters.

The Code: LoraConfig, BitsAndBytesConfig, SFTTrainer

Here is the canonical setup using PEFT and TRL. This is what a typical QLoRA fine-tuning script looks like before any task-specific customization.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer, SFTConfig
import torch

# ── QLoRA: load base model in 4-bit NF4 ──────────────────────────────────────
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4 — optimal for LLM weights
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,     # double quantization for ~0.37 bits/param savings
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",  # requires flash-attn installed
)
model.config.use_cache = False               # disable KV cache during training

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"             # avoid issues with left-padding on causal LMs

# ── LoRA adapter configuration ────────────────────────────────────────────────
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,                           # alpha = 2 * r heuristic
    target_modules="all-linear",             # covers q/k/v/o + MLP gate/up/down projections
    lora_dropout=0.05,
    bias="none",                             # "none" is standard; "all" rarely helps
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Expected output: trainable params: ~40M || all params: ~8.03B || trainable%: ~0.5%

# ── Training configuration ────────────────────────────────────────────────────
training_args = SFTConfig(
    output_dir="./outputs/llama3-8b-lora-v1",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,          # effective batch size = 16
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    bf16=True,
    tf32=True,
    max_seq_length=2048,
    packing=True,                           # pack multiple short samples into one sequence
    gradient_checkpointing=True,
    report_to="wandb",
    dataloader_num_workers=4,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

trainer.train()

A few notes on the config above. Setting packing=True packs multiple short training examples into a single sequence of max_seq_length tokens, maximizing GPU utilization. If your average sample is 200 tokens and your max_seq_length is 2048, you get roughly 10x more useful compute per forward pass. The trade-off is that packing can cause cross-contamination between samples if your chat template uses attention masks improperly — TRL handles this correctly in recent versions, but verify by inspecting packed batch examples.

The bias="none" setting is standard. Setting it to "all" (trains bias terms as well) shows negligible improvement in practice and adds minor overhead. Skip it.

Dataset Preparation and Chat Templates

The dataset is where most fine-tuning projects fail quietly. You can have a perfect hyperparameter config and still produce a useless model if the data is wrong.

Instruction format and chat templates

Modern instruction-tuned models (Llama 3 Instruct, Mistral Instruct, Qwen, Phi) expect inputs formatted with a specific chat template. Training on raw text without applying the template will teach the model to generate in the wrong format, and inference will be broken unless you compensate — which creates a fragile dependency between your data preprocessing and your inference code.

Always apply the tokenizer's chat template during dataset preparation:

from datasets import Dataset

# Raw data: list of conversation turns
raw_examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a medical coding assistant."},
            {"role": "user", "content": "What ICD-10 code covers type 2 diabetes with CKD stage 3?"},
            {"role": "assistant", "content": "The correct code is E11.22 (Type 2 diabetes mellitus with diabetic CKD stage 3a or 3b)."},
        ]
    },
    # ... more examples
]

def apply_chat_template(example):
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False,  # False for training — we include the assistant response
    )
    return {"text": text}

dataset = Dataset.from_list(raw_examples)
dataset = dataset.map(apply_chat_template)

# For SFTTrainer, pass dataset_text_field="text"
# TRL will handle tokenization internally

Dataset size and quality

There is a widespread misconception that fine-tuning requires large datasets. For domain adaptation — teaching the model a new format, a specific vocabulary, a domain-specific reasoning pattern — quality matters far more than quantity. 500–2,000 high-quality, diverse examples can produce measurable improvement on narrow tasks. 5,000–20,000 well-curated examples is a strong dataset for most enterprise domain adaptation use cases.

What "high quality" means in practice:

Diversity: examples cover the full range of input variation the model will see at inference. A dataset of 10,000 paraphrased versions of the same query teaches nothing new after the first 100.
Correct answers: every training example should be something you would be comfortable the model generating in production. Noisy labels are worse than fewer examples.
Representative length distribution: if your production queries are 50–200 tokens but your training data is all 1,000-token examples, the model will overfit to the length distribution of training data.

Eval split and leakage

Split your dataset before any processing. Never apply deduplication, filtering, or augmentation to the combined dataset and then split — this creates eval set leakage where transformed versions of training examples end up in the eval set.

A simple temporal or hash-based split is usually sufficient. Keep 5–10% for evaluation (but no fewer than 200 examples — smaller eval sets make it impossible to detect meaningful regressions). Keep a separate held-out test set that is never used during training or hyperparameter selection.

Lesson learned

On a medical coding fine-tune, we generated a synthetic dataset by prompting GPT-4o with code descriptions and having it produce question/answer pairs. The resulting eval perplexity looked great. In production, the model hallucinated codes that do not exist — because GPT-4o had introduced errors in roughly 8% of the generated answers, and our eval set was generated by the same process so it never caught them. Human review of at least 10–15% of generated training data is non-negotiable if you are using LLM-generated synthetic data.

Training Infrastructure and Cost

The hardware decision is downstream of the model size and whether you use LoRA or QLoRA.

Consumer GPU (RTX 3090/4090, 24 GB): QLoRA on Llama 3 8B or Mistral 7B. This is a fully viable setup for production fine-tuning of 7–8B models. Expect 1–3 hours for 5,000 samples at 3 epochs with packing. Running cost: electricity only (or ~$0 on a local machine).

Single A100 40 GB: LoRA in bf16 on 7–8B models, or QLoRA on 13–34B models. This is the workhorse setup for most fine-tuning projects. On RunPod, an A100 40 GB instance costs approximately $1.50–$2.00/hour. A 50K-sample 7B QLoRA run of 3 epochs takes roughly 4–6 hours: total cost $6–$12.

Single A100 80 GB: QLoRA on 70B models, or LoRA in bf16 on 13–34B. For 70B QLoRA: a 5K-sample run takes 2–4 hours at $2.50–$3.00/hour on RunPod or Lambda Labs — roughly $7–$12 per run. At 50K samples, budget $80–$200.

Multi-GPU (4× A100 40 GB or 4× A100 80 GB): Use DeepSpeed ZeRO-3 or FSDP for models that do not fit on a single card. Axolotl and TRL both support this natively. Necessary for 70B LoRA in bf16 without quantization. Cost: 4–6× the single-GPU cost, but typically 3–4× faster in wall-clock time due to data parallelism.

If you want to reduce iteration cost while doing hyperparameter search, run your experiments on a 7B model first and only scale to 70B once you have a stable config. Hyperparameter sensitivities transfer reasonably well across model sizes within the same model family.

Evaluating Your Fine-tune

Validation loss is a necessary condition for a good model but nowhere near sufficient. A model with validation loss 1.2 on your eval set can still be worthless in production if your eval set does not represent actual inference distribution.

What to measure

Build a task-specific evaluation set with ground truth labels before you start training. If the task has structured outputs (JSON, code, ICD-10 codes, specific formats), write an exact-match or parsing-based evaluator. If the task requires semantic quality (summarization, reasoning), use an LLM-as-judge pipeline: prompt a strong model (GPT-4o, Claude 3.5 Sonnet) with your evaluation criteria and have it score responses from 1–5 on each dimension.

Track these metrics across checkpoints:

Task-specific accuracy: exact match, F1, ROUGE, or LLM-as-judge score depending on the task.
Format compliance: if the model needs to produce JSON or follow a specific response structure, measure how often the output is parseable and structurally correct.
General capability retention: run a small benchmark slice (50–100 examples from MMLU or a general instruction set) to confirm you have not destroyed general reasoning. This is the catastrophic forgetting canary.
Validation perplexity: useful for monitoring training health and detecting overfitting, but not a standalone quality metric.

When to stop

Stop training when validation loss stops decreasing for 2 consecutive epochs, or when your task-specific accuracy metric plateaus. Do not continue training hoping it will improve — once the model starts memorizing training data, general capability degradation accelerates. Enable early stopping in the SFTConfig with load_best_model_at_end=True and metric_for_best_model set to your task metric.

Lesson learned

We ran a fine-tune for a legal document classification task. After epoch 3, task accuracy was 91% on the eval set. After epoch 5, it was 93%. After epoch 8, it was 94% — and general capability (MMLU slice) had dropped from 71% to 58%. The additional 3 points of task accuracy cost 13 points of general reasoning. Whether that trade-off is worth it depends entirely on the deployment context. In this case it was not: the production use case required the model to handle edge cases that required general reasoning, and those cases were not in the eval set.

Serving Adapters vs. Merging into the Base Model

After training, you have two deployment options: serve the adapter alongside the frozen base model, or merge the adapter into the base model weights and serve a single model.

Merging

Merging produces a single standard model checkpoint that runs at full inference speed with no overhead. This is the right choice for most deployments where you have one adapter per use case.

from peft import PeftModel
from transformers import AutoModelForCausalLM
import torch

# Load base model in full precision for merging (not quantized)
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load and merge the LoRA adapter
model = PeftModel.from_pretrained(base_model, "./outputs/llama3-8b-lora-v1/checkpoint-final")
model = model.merge_and_unload()

# Save the merged model
model.save_pretrained("./outputs/llama3-8b-lora-v1-merged")
tokenizer.save_pretrained("./outputs/llama3-8b-lora-v1-merged")

# The merged directory can now be loaded like any standard HuggingFace model
# and served with vLLM, TGI, or Ollama

One important caveat for QLoRA: you cannot merge a 4-bit quantized model directly. You need to load the base model in bfloat16 (or float32) for the merge step, then save. This means the merge step requires enough VRAM to hold the full bfloat16 model — plan accordingly or use a CPU offload approach for very large models.

Serving adapters separately with vLLM

vLLM supports dynamic LoRA adapter loading via the --enable-lora flag. This is the right architecture when you have multiple adapters for different tasks or user segments sharing a single base model deployment. A 7B base model at fp8 uses roughly 8 GB of VRAM; each LoRA adapter adds only a few hundred MB. You can serve 10–15 adapters on a single GPU and swap them per-request. For the broader serving stack — batching, autoscaling, GPU selection — see our deploying LLMs to production guide.

# Launch vLLM with LoRA support
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --enable-lora \
  --max-lora-rank 64 \
  --lora-modules \
    medical-coding=./outputs/llama3-8b-lora-medical/checkpoint-final \
    legal-analysis=./outputs/llama3-8b-lora-legal/checkpoint-final \
  --port 8000

# Request with adapter selection via model parameter
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "medical-coding",
    "messages": [{"role": "user", "content": "What ICD-10 code applies to ..."}]
  }'

Common Failure Modes

Most LoRA fine-tuning failures fall into four categories. None of them are subtle once you know what to look for.

Catastrophic forgetting

The model improves on your task metric but degrades on general reasoning, instruction following, or safety. Causes: high learning rate, too many epochs, dataset that is too narrow (all examples of the same type), or a learning rate schedule without warmup.

Prevention: include a small number of general instruction-following examples in your training set (10–20% of total volume), use early stopping, monitor a general capability metric on every checkpoint. This is why eval on general tasks is not optional — it is your canary for forgetting.

Dataset contamination and eval set leakage

This is more common than it should be. The two failure modes: training examples appear in the eval set (leakage), or both train and eval sets are generated from the same synthetic process with the same errors (contamination). Result: eval metrics look excellent, production quality is poor.

Fix: split before any processing, use a test set generated from a fundamentally different source than the training data, and manually review a sample of both training and evaluation examples.

Underfitting from insufficient target module coverage

The model trains without errors, validation loss decreases, but task accuracy is frustratingly low — below what a base model with a good system prompt achieves. The most common cause is applying LoRA only to q_proj and v_proj when the task requires adapting the model's internal computation, not just its attention patterns.

Fix: switch to target_modules="all-linear". Also check that your rank is not too low for the task complexity (try r=32 if r=16 shows persistent underfitting).

Chat template mismatch

Training proceeds normally but inference generates responses that do not follow the expected format, include unexpected special tokens, or repeat the input before generating the answer. Almost always caused by not applying the model's chat template during data preparation, or by a mismatch between the template applied during training and the one applied at inference.

Fix: always use tokenizer.apply_chat_template() on your data before passing it to SFTTrainer. Verify by printing 3–5 examples from the tokenized dataset and confirming they look exactly like what you would pass to the model at inference time.

Lesson learned

On a customer support fine-tune, the model started every response with the system prompt repeated verbatim. After two hours of debugging the model architecture, the root cause was that the SFTTrainer was processing the raw dataset with the chat template applied as plain text string concatenation — not through the tokenizer's template mechanism — so the model learned that the "correct" response begins by restating the system context. Printing the first 5 tokenized training examples before starting the run would have caught this in 30 seconds.

A Note on Alternatives and When LoRA Is the Wrong Tool

LoRA and QLoRA are excellent for domain adaptation and instruction tuning. They are the wrong tool in several scenarios that are worth being explicit about.

When the base model lacks the capability entirely. LoRA adapts existing capabilities — it does not add new ones from scratch. If the base model cannot perform a task at all (even with optimal prompting), fine-tuning is unlikely to fix it. You need a larger base model, more pretraining data, or a fundamentally different architecture.

When the knowledge changes frequently. Fine-tuning bakes knowledge into weights at training time. If your domain knowledge updates weekly (regulatory changes, product catalog, live documentation), a RAG system is better suited — it retrieves fresh information at inference time. The decision framework for choosing between fine-tuning, RAG, and prompting is covered in our companion article.

When you need guaranteed output structure. Fine-tuning reduces but does not eliminate format failures. For tasks that require 100% reliable structured output (JSON, SQL, specific schema), constrained decoding (outlines, guidance) or function calling with output validation is more robust than relying on fine-tuning alone.

If you are working on a use case that fits LoRA well — domain vocabulary, task-specific reasoning patterns, tone/style adaptation, instruction following on proprietary formats — the economics are compelling: $15–$200 for a run, 1–6 hours of compute, and a model that is genuinely specialized for your domain. The LLM integration work we do at Tensoria almost always involves a LoRA or QLoRA fine-tuning step alongside the broader system design.

Frequently Asked Questions

LoRA (Low-Rank Adaptation) freezes the base model weights and injects trainable low-rank matrices into selected linear layers, dramatically reducing the number of trainable parameters. QLoRA extends this by loading the frozen base model in 4-bit NormalFloat precision (NF4) using bitsandbytes, cutting VRAM requirements by roughly 4x compared to LoRA in full bfloat16. The adapters themselves are still trained in bfloat16. The trade-off: QLoRA is slower per step due to dequantization overhead, but makes training feasible on consumer GPUs or single-server setups that would otherwise require a multi-GPU cluster.

Start at r=16 for domain adaptation tasks — it is the right balance between expressiveness and parameter count for most use cases. Use r=8 if you are memory-constrained or working on narrow style/format tasks. Use r=32 or r=64 when the task requires capturing complex domain-specific patterns or when r=16 consistently underperforms on eval. The QLoRA paper showed very little statistical difference between r=8 and r=256 when LoRA is applied to all linear layers, which means the key lever is target_modules coverage, not rank size.

With QLoRA (4-bit base + bfloat16 adapters): Llama 3 8B fits in approximately 10–12 GB, making a single RTX 3090 or 4090 viable. Llama 3 70B requires approximately 40–48 GB, so a single A100 80GB. Mistral 7B fits in about 8–10 GB. With standard LoRA in bfloat16, multiply these figures by roughly 3.5–4x. QLoRA brings 7–8B model fine-tuning into the consumer GPU range and 70B into single-A100 territory, which changes the economics entirely.

The most common failure modes are: (1) Catastrophic forgetting — the adapter overwrites general capabilities when trained too aggressively. Fix: lower learning rate, add general instruction examples to training data, use early stopping. (2) Dataset contamination — training data overlaps with or is generated by the same process as the eval set. Fix: split before any processing, use a held-out test set from a different source. (3) Underfitting from insufficient target module coverage — using only q_proj/v_proj when the task needs deeper adaptation. Fix: switch to target_modules="all-linear". (4) Chat template mismatch — training on raw text instead of the model's expected chat template format. Fix: always use tokenizer.apply_chat_template() during data preparation.

Merge for simplicity and throughput: a merged model loads like any standard checkpoint and runs at full inference speed with no adapter overhead. Use merge_and_unload() from PEFT. Serve separately when you need multiple adapters on one base model — vLLM supports LoRA adapter hot-swapping with --enable-lora. The separate-serving approach saves GPU memory when you have 10+ adapters, since only the active adapter needs to be loaded alongside the single base model.

On a single A100 40GB: a 7–8B model fine-tuned on 5,000 instruction samples for 3 epochs takes roughly 45 minutes to 1.5 hours. At 50,000 samples, expect 4–6 hours. A 70B model with QLoRA on the same 5,000 samples takes approximately 2–4 hours on a single A100 80GB. Cost on RunPod or Lambda Labs: approximately $15–50 for small runs (5K samples, 7B model), $80–200 for larger runs (50K samples, 70B model). These numbers assume packing and gradient checkpointing are enabled.

Need a fine-tuned model for your domain?

We design, train, and evaluate LoRA fine-tunes end-to-end — from dataset curation to production deployment.

See LLM integration

LoRA and QLoRA: A Practical Guide to Fine-tuning LLMs on a Budget

The Intuition Behind LoRA

What QLoRA Adds

Hyperparameter Selection: Stop Guessing

LoRA rank (r)

LoRA alpha (lora_alpha)

Target modules

Learning rate

The sane defaults table

The Code: LoraConfig, BitsAndBytesConfig, SFTTrainer

Dataset Preparation and Chat Templates

Instruction format and chat templates

Dataset size and quality

Eval split and leakage

Training Infrastructure and Cost

Evaluating Your Fine-tune

What to measure

When to stop

Serving Adapters vs. Merging into the Base Model

Merging

Serving adapters separately with vLLM

Common Failure Modes

Catastrophic forgetting

Dataset contamination and eval set leakage

Underfitting from insufficient target module coverage

Chat template mismatch

A Note on Alternatives and When LoRA Is the Wrong Tool

Frequently Asked Questions

Further Reading