I Trained an 8B HTS Classifier at a Coffee Shop with an H100 for $43
A postmortem of fine-tuning an 8B model for U.S. tariff classification on a single rented H100. Sixteen hours, $43, and eight lessons earned the hard way.
I rented an H100 at a coffee shop on a Sunday afternoon and shipped a working 8B-parameter classifier 16 hours later for $43. The model classifies products against the U.S. Harmonized Tariff Schedule (HTS), a regulated 17,000-code system that customs brokers use every day to determine import duties. Most published baselines for this task either used 16x A100 fine-tunes or relied on frontier models at API pricing. I wanted to see what one person with a single rented GPU and a dirty chai could do.
The short answer: a lot, if you account for the parts that go wrong.
This is a postmortem of that run. Here's what worked, what didn't, and what I'd do differently.
Why HTS classification
HTS is a hierarchical product classification system. Every item imported into the U.S. gets mapped to a 10-digit code: the first two digits are the chapter (textiles, vehicles, chemicals), the next two are the heading, the next two are the subheading, and the last four are the statistical reporting suffix. The hierarchy is strict, the code list is updated annually, and U.S. Customs and Border Protection publishes binding rulings (CROSS rulings) that resolve ambiguous cases.
A few things make it a good benchmark for a domain fine-tune:
- Strict structured output. You need code, reasoning, and grounding in the actual rules.
- Real ground truth. Decades of CROSS rulings are public.
- Published baselines exist. ATLAS reached 40% exact-match accuracy with a Llama-3.3–70B full fine-tune on 16x A100s. GPT-5-Thinking zero-shot lands around 25%. Commercial tools like Tarifflo hit 89% with a custom retrieval pipeline.
- Misclassification has consequences. Importers pay duties based on the code, and getting it wrong creates customs disputes.
So: a real task, a real yardstick, real difficulty.
The stack
The choices, briefly justified.
Base model: Llama-3.1-Nemotron-Nano-8B-v1. Nvidia's Nemotron variant of Llama 3.1. 8B is the sweet spot for single-GPU work; the Nemotron tuning includes a "detailed thinking off" mode that lets you skip chain-of-thought and emit direct structured output. For a classification task where I want the answer fast and the reasoning short, that matters.
LoRA, not full fine-tune. Rank 32, alpha 64, dropout 0.05, applied to all seven projection layers (q, k, v, o, gate, up, down). That's 83.9M trainable parameters out of 8.1B total, or 1.03% of the base. A full fine-tune would have needed multiple GPUs and at least 10x the budget. LoRA on a single H100 is the right call.
4-bit NF4 quantization for the base. Bitsandbytes, double quantization, bf16 compute dtype. The base model loads in roughly 5GB instead of 16GB, leaving plenty of room for activations and optimizer state.
Completion-only loss. Loss is masked over the system and user tokens. The model only learns to produce the assistant response. This matters more than people think; without it, you waste capacity teaching the model to predict its own prompts.
Data pipeline: 119,602 train + 14,784 valid + 14,952 test examples. Built from CROSS rulings via a six-stage pipeline (ingest, normalize, build, split, format, audit). Each example is a single-turn conversation: system prompt, user message with product description, assistant message with the structured classification.
GPU: 1x H100 SXM 80GB on RunPod, $2.69/hour. No reservations, no commits. Spun it up when I needed it.
The setup gauntlet
The first two hours weren't training. They were debugging. Every minute was on the meter.
Flash Attention 2 doesn't want to install
After installing flash-attn==2.8.3, the import failed:
undefined symbol:
_ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEThe wheel was compiled against a different PyTorch ABI than the one on the pod. Tried a different flash-attn version. Same error. Tried a third. Same error.
Gave up. Switched to SDPA (PyTorch's built-in scaled dot-product attention). It's the default in modern transformers, has zero dependency hell, and is roughly 10–20% slower than Flash Attention on a single GPU. For an 8B model on one H100, that tradeoff is fine.
Lesson. Don't chase Flash Attention 2 for single-GPU 8B work. SDPA is the right default.
Disk space crisis
The pod had a 20GB root filesystem and a 200TB workspace volume. HuggingFace caches models to ~/.cache/huggingface, which on this pod meant /root/.cache on the small disk. After downloading the 16GB Nemotron base, root was at 94%.
Fix, three commands:
mkdir -p /workspace/.cache
mv /root/.cache/huggingface /workspace/.cache/huggingface
ln -s /workspace/.cache/huggingface /root/.cache/huggingfaceRoot dropped from 94% to 5%.
Lesson. On any rented GPU with separate root and data volumes, move the HF cache to the data volume on day 1. Add these three commands to your pod-setup script. Don't wait until you're out of space.
OOM on first launch
With training launched at batch size 16, the GPU OOMed almost immediately on the cross-entropy step:
CUDA out of memory. Tried to allocate 4.69 GiB.
GPU 0 has a total capacity of 79.18 GiB of which 1.73 GiB is free.The math, once I sat down with it, is brutal. Llama 3 has a 128,256-token vocabulary. The logits tensor during loss computation is [batch, seq, vocab]:
16 (batch) x 1536 (seq) x 128,256 (vocab) x 4 bytes (fp32) = 12.6 GBTwelve gigabytes just for the logits, before adding model weights, optimizer state, activations, and gradient checkpointing buffers. The big Llama 3 vocab is the hidden tax of fine-tuning these models.
Reduced max sequence length from 2048 to 1536. That dropped logits from 16.8GB to 12.6GB, and the run fit. Kept batch size 16 with grad accumulation of 2 for an effective batch of 32.
Lesson. For any Llama 3 derivative, do the logits math before launching. batch * seq * 128256 * 4 is most of your memory budget.
Smoke test typos (mine)
After all of that, I had two trivial bugs in my smoke test script. Wrote total_mem instead of total_memory. Tried to unpack model, tokenizer = load_base_model(…) when the function returns only the model.
Fixed both by reading the actual function signatures instead of guessing.
Lesson. Read the function signature, even for "obvious" APIs. It's the cheapest debugging you'll ever do.
After the gauntlet, the smoke test finally produced:
Loaded base model: nvidia/Llama-3.1-Nemotron-Nano-8B-v1
LoRA applied: 83,886,080 trainable / 8,114,147,328 total (1.03%)Time to train.
The actual run
Once it launched, the run itself was uneventful. That's the point.
The interesting work happens at the edges. The middle is just patience.
The launch config that actually shipped:
model: nvidia/Llama-3.1-Nemotron-Nano-8B-v1
torch_dtype: bfloat16
attn_implementation: sdpa
load_in_4bit: true
bnb_4bit_quant_type: nf4
lora:
r: 32
lora_alpha: 64
lora_dropout: 0.05
target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
training:
num_train_epochs: 3
per_device_train_batch_size: 16
gradient_accumulation_steps: 2 # effective batch = 32
learning_rate: 2.0e-4
warmup_ratio: 0.05
lr_scheduler_type: cosine
max_seq_length: 1536
bf16: true
gradient_checkpointing: true
optim: paged_adamw_8bitLaunched with three environment-level details that mattered:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
PYTHONUNBUFFERED=1 \
nohup python scripts/run_train.py --config configs/train_h100.yaml \
> train.log 2>&1 &The expandable_segments flag prevents memory fragmentation that shows up a few thousand steps in. PYTHONUNBUFFERED=1 makes the training log readable in real time (otherwise tqdm buffers indefinitely). nohup … & means the process survives if my SSH connection dies, which matters for a 14-hour run from a coffee shop with flaky wifi.
I also wrote a five-line monitor script that logged status every 30 minutes:
#!/bin/bash
while true; do
echo "=== $(date '+%H:%M:%S') ==="
grep -oP "'loss': [0-9.]+" /workspace/train.log | tail -1
tail -1 /workspace/train.log | grep -oP '\d+/11214'
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | awk '{print "VRAM: " $1 " MiB"}'
echo ''
sleep 1800
doneThis let me check progress from any SSH session with tail monitor.log without interrupting training. Useful when you're moving between coffee shops and your laptop occasionally sleeps.
The loss trajectory:
| Step | Epoch | Train loss | Eval loss |
|---|---|---|---|
| 1000 | 0.27 | 1.05 | 0.371 |
| 2000 | 0.54 | 0.66 | 0.320 |
| 3000 | 0.80 | 0.49 | 0.298 |
| 4000 | 1.07 | 0.33 | 0.289 |
| 5000 | 1.34 | 0.21 | 0.279 |
| 6000 | 1.61 | 0.21 | 0.268 |
| 7000 | 1.87 | 0.20 | 0.260 (best) |
| 8000 | 2.14 | 0.21 | 0.277 |
| 9000 | 2.41 | 0.14 | 0.274 |
| 10000 | 2.68 | 0.13 | 0.274 |
| 11000 | 2.94 | 0.13 | 0.273 |
The shape tells the story. Training loss kept falling. Eval loss bottomed at step 7000 and crept up afterward. Classic mild overfitting in the back half of the run.
Total wall time: 13 hours 44 minutes. 11,214 steps. ~3.85 seconds per step. Eleven full evaluations on the validation set, each taking about 8.5 minutes.
What I got wrong
This is the part that matters. Eight lessons, each one earned the hard way.
1. save_total_limit=3 cost me the best checkpoint
The Trainer's default is to keep the last three checkpoints. By the end of the run, only checkpoints 10000, 11000, and 11214 existed. The actual best checkpoint, step 7000 with eval loss 0.260, had been silently deleted around step 10000 when the rotation kicked in.
The version on HuggingFace is the final-step adapter. Its eval loss is roughly 5% worse than the step-7000 model I briefly had. Not catastrophic, but a real loss, and totally avoidable.
Fix for next time, three lines:
save_total_limit: 5
load_best_model_at_end: true
metric_for_best_model: eval_loss
greater_is_better: falseThese tell the Trainer to keep the last five checkpoints plus the all-time best, and to load the best one into the model object at the end of training. The "final" adapter saved to disk becomes the best one, not the last one.
2. Two epochs would have beaten three
Eval loss bottomed at epoch 1.87. Epochs 2 and 3 were memorization, not generalization. The model was learning specific phrasings from specific training examples while losing the general signal.
Next run: num_train_epochs: 2. Saves ~30% of wall time and ~$10 of GPU rental and produces a better model. The intuition that "more epochs is more learning" is wrong past a certain point, and the eval loss curve will tell you exactly where that point is.
3. The 128k vocab is a memory tax
Llama 3 family models have a 128,256-token vocabulary. That makes the logits tensor enormous during loss computation. Plan for it explicitly in your memory budget.
The math, again, because it's important:
batch * seq * vocab * dtype_bytes
16 * 1536 * 128256 * 4 = 12.6 GBThat's the logits alone, before any of the rest of training memory. If you're moving to Mistral (32k or 131k vocab depending on version) or Qwen, this will hit differently. For any Llama 3 derivative, do this math before you launch.
4. Pin your training stack
transformers==4.46.3 peft==0.13.2 accelerate==0.34.2 is a known-good triple for this work. These are 4–6 months old. They're battle-tested.
The latest transformers is almost always 1–2 weeks behind the latest torch on supporting "latest" torch features. The lag is real and predictable. Pin your versions. Don't let the dependency resolver pull latest on you.
5. Move the HF cache on day 1
The disk space crisis above. Three commands at pod setup time saves you a panic 30 minutes into your first training run.
mkdir -p /workspace/.cache
mv /root/.cache/huggingface /workspace/.cache/huggingface 2>/dev/null
ln -s /workspace/.cache/huggingface /root/.cache/huggingfaceIf /root/.cache/huggingface doesn't exist yet, just create the symlink first and the next HF download lands in the right place.
6. SDPA is fine
I spent 30 minutes trying to make Flash Attention 2 work. The wheels mismatched the PyTorch ABI. After three failed attempts I switched to SDPA and never looked back.
For single-GPU 8B training, SDPA is the right default. It's roughly 10–20% slower than Flash Attention 2 on H100, has no dependency hell, and is actively maintained as the default attention path in transformers. If you're memory-bound, multi-GPU, or working from a base image where Flash Attention is pre-installed and verified working, then chase it. Otherwise, set attn_implementation: "sdpa" and move on.
7. Verify inference before publishing
After training completed, I built a quick inference test and got garbled output. Random tokens, no structure.
Diagnosis: I'd built the prompt with a generic system message, but the training data uses Nemotron-specific format. Without the right prefixes, the model produces garbage.
Pulled the actual training data and discovered two non-obvious quirks:
- System prompt starts with
"detailed thinking off\n\n"(Nemotron's switch for skipping chain-of-thought) - Assistant content starts with
"<think>\n</think>\n\n"(an empty thinking block telling the model "I've thought, here's the answer")
Without these prefixes, the model produces garbage because it never saw a prompt without them during training.
Rebuilt the test with the correct format:
messages = [
{"role": "system", "content": "detailed thinking off\n\nYou are an expert in the U.S. Harmonized Tariff Schedule..."},
{"role": "user", "content": "Product: insulated copper electrical wire, 12 AWG\nUse: residential wiring"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt += "<think>\n</think>\n\n"Output was structured and parseable. The general principle: the inference prompt must match the training prompt byte-for-byte. The safest way to enforce that is to use the same apply_chat_template call with the same messages at both training and inference time, with a helper function that hides the model-specific quirks.
I almost shipped to HuggingFace without running this test. The model card would have told users the wrong prompt format. Anyone who downloaded it would have gotten garbage and concluded the model didn't work.
Mandatory checklist item: smoke-test inference before publishing. Always.
8. Pull configs from artifacts, not memory
For most of the training run, I would have told you the config was batch size 8 with grad accumulation 4 and max sequence length 2048. The actual config (per train.log line 6 and config_snapshot.yaml) was batch size 16 with grad accumulation 2 and max sequence length 1536.
I'd lost track during the OOM debugging cycle. Multiple knobs changed at once, and I'd remembered the wrong combination. The numbers were functionally equivalent in terms of effective batch, but the max sequence length difference is real, and I would have published the wrong values in the model card if I hadn't pulled the actual config files at the end.
The fix is mechanical: when reporting on a run, read the actual artifacts. The source of truth is config_snapshot.yaml on disk, not your recollection of what you typed an hour ago.
Cost breakdown
| Item | Cost |
|---|---|
| H100 SXM 80GB on RunPod, ~16 hrs | $43 |
| Dirty chai | $8 |
| Total | $51 |
The model is at mfbaig35r/hts-nemotron-8b-lora-v1 on HuggingFace.
If I'd known then what I know now (two epochs instead of three, correct checkpoint config, no Flash Attention dead end), it would have been closer to $30. The lessons-learned section is, in dollar terms, about $20 of tuition.
What's next
Two follow-on threads.
The first is a DPO/RLAIF second stage targeting six specific failure clusters I identified in the v1 outputs: parts-versus-finished goods, material-versus-end-use classification, the HTS 2022 subheading splits, knitted-versus-woven apparel, residual "nesoi" overuse, and fabricated citations in reasoning. The pattern is mining v1's wrong predictions, generating preference pairs (chosen = correct, rejected = v1's wrong output), and running DPO on top of the existing adapter. RLAIF rather than classic RLHF because the judge model is Claude, not human annotators.
The second is retrieval-augmented training, where each training example carries relevant CROSS rulings and chapter notes in the prompt. v1 is doing memorized pattern matching; v2 should be grounded reasoning over retrieved evidence.
I'll write the follow-up when DPO ships.
Closing
This was the day I proved the pipeline works end-to-end. Sixteen hours, $43, one rented H100, one base 8B model, one CROSS-rulings dataset. The result isn't state of the art, but it's a real working artifact running on commodity infrastructure for less than the cost of dinner.
If you're trying to fine-tune an 8B model on a single GPU and you've been intimidated by the tooling, the takeaway is: it's harder than it looks at the edges, easier than it looks in the middle, and almost entirely composed of small mistakes that someone has already made for you. Including, now, me.
Model: mfbaig35r/hts-nemotron-8b-lora-v1 (private until v2). Find me on LinkedIn or GitHub. Always happy to talk about applied LLM work.