Fine-tuning LLMs on Google Colab with QLoRA and Unsloth

A quick guide to fine-tuning lightweight LLMs on a Tesla T4 GPU in under an hour for ISE-547.

Fine-Tuning an LLM for Critical Road Trip Planning

1. Motivation

Planning a road trip can be complex, and users often turn to Large Language Models (LLMs) to refine their itineraries. However, standard LLMs tend to be overly flattering and agreeable; they often fail to identify fundamental flaws in a user’s initial plan. In contrast, communities like Reddit’s road trip subreddits offer highly critical but incredibly practical and helpful advice. This project aims to fine-tune an LLM to emulate this “critical-but-helpful” persona, actively calling out bad ideas while providing grounded routing alternatives.

2. Data

Guided by the principles of the LIMA paper (NeurIPS 2025), which demonstrates that a small volume of high-quality data outweighs sheer quantity, this project utilized a micro-dataset.

I manually collected 30 high-quality question-and-answer pairs from Reddit road trip communities. The selected samples explicitly showcase the desired behavior: direct criticism of the initial plan followed by highly relevant advice.

Example Data Point:

The data was pre-processed into standard JSON chat format (user/assistant roles) for training.

3. Method

The project utilized Llama-3.1-8B-Instruct with max sequence length 1024 as the base model. To enable training on a Google Colab instance equipped with a single Tesla T4 GPU (16GB VRAM), I employed 4-bit Quantization (QLoRA) alongside the Unsloth library, which heavily optimizes memory usage and training speed.

Training Hyperparameters:

4. Evaluation

Evaluation criteria

I want to evaluate the answers generated by fine-tuned LLM in terms of three aspects: helpfulness, criticism, and tone.

    • Criticism: Ability to identify and directly critique poor routing or bad ideas with proper reasoning.
    • Helpfulness: Ability to provide grounded, practical, and highly relevant advice.
  1. ** Tone:** the answer should mimic a casual tone of Reddit comments. It should not be too formal.

Evaluation method

I use LLM-as-a-Judge, because it is very difficult to ontain reference answers for the road trip with these traits. The LLM judge will compare the answers generated by fine-tuned LLM with the answers generated by the base model and determine which model is better.

Baseline Control

The base Llama-3.1-8B-Instruct model, heavily prompt-engineered with the following system prompt:

You are a blunt, experienced road trip advisor. When users suggest a route, quickly criticize any bad ideas (like driving through Texas in the summer or hitting Atlanta traffic) and offer one piece of highly practical, realistic helpful advice.

Test Set

100 brand-new, unseen road trip queries featuring common mistakes (unrealistic timelines, bad weather choices, geographic jumps). I generate answers from both fine-tuned LLM and base model.

Prompts

I use two types of LLM-as-a-Judge prompts.

  1. Likert Scale Scoring: Rate each response on a scale of 1-5 for each trait.
  2. Pairwise Comparison: Compare the two responses and declare a winner for each trait.

Likert Scale Scoring Prompt 1

Evaluate the quality of the following road trip advice generated by an AI assistant. Rate the response on a scale from 1 (worst) to 5 (best) across three dimensions:

1. Criticism: Ability to identify and directly critique poor routing, unrealistic timelines, or bad ideas with proper reasoning.
2. Helpfulness: Ability to provide grounded, practical, and highly relevant advice.
3. Tone: Ability to mimic a casual, conversational tone typical of a Reddit comment, avoiding excessive formality.

[User Query]: {prompt}

[Assistant Response]: {response}

Output your scores in the following exact format:
Criticism Score: [1-5]
Helpfulness Score: [1-5]
Tone Score: [1-5]

Likert Scale Scoring Prompt 2

Explains rubrics in detail.

You are an expert judge evaluating AI-generated road trip advice. Score the assistant's response from 1 to 5 for the following dimensions using the provided rubric:

Criticism:
1: Ignores glaring flaws or agrees with dangerous/bad ideas.
3: Identifies bad ideas but lacks strong reasoning or directness.
5: Directly calls out bad ideas with clear, logical reasoning.

Helpfulness:
1: Gives generic, unhelpful, or dangerous advice.
3: Provides basic advice that is technically correct but lacks practical insight.
5: Offers highly grounded, practical, and specific advice.

Tone:
1: Highly formal, robotic, or overly polite (corporate AI voice).
3: Conversational but slightly stiff.
5: Casual, blunt, and relatable (like a highly upvoted Reddit comment).

[User Query]: {prompt}
[Assistant Response]: {response}

Provide a brief rationale for each score, then end your response with:
CRITICISM: [Score]
HELPFULNESS: [Score]
TONE: [Score]

Pairwise Comparison Prompt 1

Dimension-Specific winners.

You are an impartial judge evaluating two AI assistants providing road trip advice. Compare them across three dimensions:
1. Criticism: Which assistant better identifies and critiques poor routing, unrealistic timelines, or bad ideas with direct reasoning?
2. Helpfulness: Which assistant provides more grounded, practical, and highly relevant advice?
3. Tone: Which assistant sounds more like a casual, experienced Reddit commenter, avoiding stiff formality?

[User Query]: {prompt}

[Assistant A]: {ans_a}
[Assistant B]: {ans_b}

Evaluate both assistants. On a new line at the very end of your response, explicitly declare the winners for EACH category by writing EXACTLY these three lines:
Criticism Winner: [A, B, or Tie]
Helpfulness Winner: [A, B, or Tie]
Tone Winner: [A, B, or Tie]

Pairwise Comparison Prompt 2

Overall winner with step-by-step rationale.

You are an experienced road trip advisor judging two AI responses. Read the user's query and compare Assistant 0 and Assistant 1 based on Criticism, Helpfulness, and a casual Reddit-style Tone.

[User Query]: {prompt}

[Assistant 0]: {ans_0}
[Assistant 1]: {ans_1}

First, provide a brief comparison of how each assistant handled the prompt. 
Then, decide which assistant provided the best overall response. You must choose one winner or declare a tie. 
End your response with EXACTLY this format:
Overall Winner: [Assistant 0, Assistant 1, or Tie]

LLM Judges

I use four types of LLM. Therefore I am evaluating on 4 * 4 = 16 different setups. For LLM, I first tried openrouter, but the free mode has daily limit of 50 queries. Therefore I turnt to huggingface and downloaded the 4-bit quantized models for local inference on google colab.

LLM types:

  1. Qwen/Qwen2.5-7B-Instruct
  2. microsoft/Phi-3.5-mini-instruct
  3. mistralai/Mistral-Nemo-Instruct-2407
  4. Yi-1.5-9B-Chat

5. Results

The quantitative results are as follows. Most judges favor the baseline model, except for the tone metric

Judgement summary by Qwen2.5-7B-Instruct

LIKERT 1 AVERAGES (Out of 5.0)

Metric Baseline Fine-Tuned
Criticism 3.62 2.56
Helpfulness 3.75 2.79
Tone 4.18 4.31

LIKERT 2 AVERAGES (Out of 5.0)

Metric Baseline Fine-Tuned
Criticism 4.68 3.36
Helpfulness 4.53 3.17
Tone 4.63 4.07

PAIRWISE 1 (Dimension-Specific Wins)

Metric Base FT Ties Errs
Criticism 23 5 0 72
Helpfulness 14 4 0 82
Tone 9 4 0 87

PAIRWISE 2 (Overall Winner)

Metric Base FT Ties Errs
Overall 77 23 0 0

Judgement summary by Phi-3.5-mini-instruct

LIKERT 1 AVERAGES (Out of 5.0)

Metric Baseline Fine-Tuned
Criticism 4.85 3.63
Helpfulness 4.48 3.35
Tone 2.00 0.00

LIKERT 2 AVERAGES (Out of 5.0)

Metric Baseline Fine-Tuned
Criticism 4.77 3.24
Helpfulness 4.63 3.21
Tone 4.80 3.88

PAIRWISE 1 (Dimension-Specific Wins)

Metric Base FT Ties Errs
Criticism 76 22 0 2
Helpfulness 65 32 1 2
Tone 30 25 0 45

PAIRWISE 2 (Overall Winner)

Metric Base FT Ties Errs
Overall 70 26 2 2

Judgement summary by mistralai/Mistral-Nemo-Instruct-2407

LIKERT 1 AVERAGES (Out of 5.0)

Metric Baseline Fine-Tuned
Criticism 4.45 3.75
Helpfulness 4.13 3.18
Tone 4.13 4.13

LIKERT 2 AVERAGES (Out of 5.0)

Metric Baseline Fine-Tuned
Criticism 4.95 4.17
Helpfulness 4.33 3.16
Tone 3.96 3.66

PAIRWISE 1 (Dimension-Specific Wins)

Metric Base FT Ties Errs
Criticism 0 1 0 99
Helpfulness 0 0 0 100
Tone 0 0 0 100

PAIRWISE 2 (Overall Winner)

Metric Base FT Ties Errs
Overall 37 16 1 46

Judgement summary by Yi-1.5-9B-Chat

Comments on the Results: What Worked and What Did Not

Conclusion

This experiment underscores the practical limitations and tradeoffs of utilizing a micro-dataset (N=30) for supervised fine-tuning on complex, constraint-heavy tasks.

While the LIMA principle suggests that a small volume of high-quality data is sufficient for alignment, for really small size of datasets, this holds true primarily for stylistic surface-level alignment, not for preserving deep reasoning capabilities. The fine-tuned model successfully absorbed the terse, critical persona, but the narrowness of the training data severely degraded the base model’s zero-shot spatial and logical priors.

Conversely, the prompt-engineered baseline demonstrated that when a task requires a combination of domain-specific facts (geography, weather, routing math) and a specific tone, leveraging the full, uncompromised weights of a strong base model is far superior.

Key Takeaways:

Micro-fine-tuning is a stylistic filter, not a knowledge builder: Use small datasets to adjust tone, formatting, or persona, but do not expect the model to maintain complex reasoning unless that reasoning is heavily represented in the training distribution.

Prompt Engineering wins for Logistical Tasks: For tasks involving routing, math, or strict real-world constraints, robust prompt engineering on an untouched, highly capable model yields significantly safer and more practical results.

Future Directions: A more resilient approach for this specific use case might involve a multi-agent pipeline: using a prompt-engineered baseline to generate the logical routing and constraints, followed by a dynamically fine-tuned “critic” model to rewrite the output into the desired Reddit-style persona.

Appendix

References

Evaluation Dataset

  1. Driving from Miami to Key West in the middle of a Category 3 hurricane watch because our hotel is non-refundable. What snacks should we pack?
  2. We have a 3-day weekend and want to drive from Chicago to Yellowstone, see Old Faithful, and drive back. Best route?
  3. Taking my lowered Mazda Miata on the Dalton Highway in Alaska next month. Are there any good car washes up there?
  4. I’m doing a day trip from Phoenix to the Grand Canyon in July. We plan to hike down to the river and back up by sunset. How many water bottles do we need for 4 people?
  5. Renting a 40-foot motorhome to drive the Going-to-the-Sun Road in Glacier National Park. Any tips for a first-timer?
  6. Driving across Texas from Beaumont to El Paso. It looks like it should take about 4 hours, right? Where should we stop for lunch?
  7. We want to drive from Seattle to Juneau, Alaska this weekend. Can we just take I-5 all the way up?
  8. Taking my non-AWD minivan over Loveland Pass in Colorado in February. I have all-season tires, that’s enough right?
  9. Planning to ‘wing it’ and find campsites in Yosemite Valley during 4th of July weekend without reservations. What time should we arrive to grab a spot?
  10. Driving an electric vehicle with a 150-mile range through the Nevada desert on Highway 50 (The Loneliest Road). Where are the best fast chargers?
  11. Going from Atlanta to Orlando on the Saturday of Spring Break. We want to avoid highways and take scenic dirt roads. Any routes?
  12. I want to drive from New York to London. Can I take a ferry with my car from Maine?
  13. Road trip from Denver to Aspen via Independence Pass in December. We have a front-wheel-drive sedan. Will we need chains or is it plowed?
  14. We only have 4 days, so we’re going to drive Route 66 from Chicago to LA. What are the best 5-minute photo stops?
  15. Planning a scenic drive through the Florida Everglades in August in a jeep with no doors or roof. What bug spray is best?
  16. Towing a 10,000lb 5th-wheel trailer for the first time. We’re going straight to the Mount Washington Auto Road in New Hampshire. Advice?
  17. We’re driving from LA to Vegas on a Friday at 5 PM. How fast can we get there if we speed?
  18. I’m doing a solo road trip from Maine to Florida without stopping to sleep, just energy drinks and loud music. What’s the best playlist?
  19. Taking a smart car on the White Rim Trail in Canyonlands National Park. We have good clearance, right?
  20. Driving from Toronto to Banff for a long weekend. We’ll leave Friday night and be back Monday morning. Best sights?
  21. Driving from Seattle to Honolulu. Are there any ferries that take cars from Washington state directly there?
  22. We have a 6-hour layover in Denver, so we’re going to rent a car, drive to Rocky Mountain National Park, hike a 14er, and come back. Is I-25 the fastest way?
  23. I plan to drive the entire Pacific Coast Highway in one day. What are the absolute best 3 beaches to surf at along the way?
  24. Taking a U-Haul box truck through the Tail of the Dragon in North Carolina. I hear it’s curvy, but I’m an okay driver.
  25. Road tripping through Tornado Alley (Oklahoma/Kansas) during peak severe weather season to chase storms in my Toyota Prius. What apps should I download?
  26. Driving to the North Pole from Anchorage, Alaska. Google Maps isn’t showing a route, what road do I take?
  27. We’re driving from San Antonio to Austin on Monday morning at 7:30 AM. We should have the road to ourselves, right?
  28. Planning to sleep in my car in Phoenix in July to save on hotels. Do I need to keep the AC running all night or crack a window?
  29. We want to see the Northern Lights, so we’re driving from Miami to North Dakota for a 2-day trip in mid-July. Will they be bright?
  30. Taking my brand new Porsche 911 off-roading in Moab, Utah. Which trails are the smoothest for luxury cars?
  31. Driving from Vancouver to Montreal in January. I don’t have winter tires, but I have a lot of driving experience. Should be fine, right?
  32. Going on a family road trip to Disney World from New York. We’re fitting 8 people in a 5-seater SUV for 20 hours to save money on a rental.
  33. I’ve never driven a manual car, but I rented one for a 10-day road trip through the steep hills of San Francisco and the PCH. Any tips?
  34. Planning to drive across the Mojave Desert in July with my dogs. We don’t have AC in the car, but we’ll roll the windows down.
  35. We have 48 hours to drive from Houston to Chicago. We want to stop and tour Nashville, St. Louis, and Memphis on the way. Best itinerary?
  36. Renting a 50cc scooter to ride on the interstate from Las Vegas to the Hoover Dam. It goes up to 40mph. Is that safe?
  37. Driving from Detroit to Windsor, Canada for lunch. I lost my passport, but my driver’s license should be enough to cross the border, right?
  38. Going to do the entire Blue Ridge Parkway in a single afternoon. Where should we stop for a 5-minute picnic?
  39. Towing a heavy boat from Florida to Colorado through the mountains using a 4-cylinder crossover SUV. What gear should I use on the inclines?
  40. I plan to drive 1,000 miles a day for a week. I’ll just sleep at rest stops for 3 hours a night. How do truck drivers do it?
  41. Driving into Manhattan on New Year’s Eve to park near Times Square. Where are the cheapest street parking spots?
  42. We want to visit all 50 states in 14 days by car. Is it faster to start in Maine or Washington?
  43. Taking my RV through the Zion-Mount Carmel Tunnel. I didn’t measure my RV’s height, but it looks like it’ll fit.
  44. Road trip from Boston to Acadia National Park on a Sunday afternoon in October. We’re expecting no traffic on Route 1. Best leaf-peeping spots?
  45. Driving from LA to San Francisco on I-5. We want to stop every hour for a scenic ocean view and beach walk.
  46. I’m 18 and taking my mom’s minivan on a 5,000-mile road trip. I haven’t checked the oil or tires in a year, but it runs fine.
  47. Driving the treacherous Dalton Highway in Alaska in a rented Tesla. Where are the Superchargers located near the Arctic Circle?
  48. We want to do a romantic road trip through Death Valley in August. We’re planning a 10-mile hike at noon. Are there water fountains on the trails?
  49. Driving from Chicago to Minneapolis in a blizzard. The news says ‘no travel advised’ but I have a 4x4 truck so I’m invincible, right?
  50. We have 3 days to drive from Portland, Oregon to Portland, Maine. What’s the most scenic route?
  51. Taking a 35-foot RV to the top of Pikes Peak in Colorado. Are the switchbacks wide enough?
  52. Planning a relaxing road trip on I-4 through Orlando during rush hour. What podcasts do you recommend for a smooth, fast drive?
  53. Driving from Vegas to LA on a Sunday afternoon after a holiday weekend. Should be a breeze, right?
  54. I want to off-road in Sedona, Arizona. I have a rental Chevy Malibu. Which red rock trails are best for sedans?
  55. We’re going to drive from NYC to Niagara Falls for an afternoon, then drive right back the same day. Do we have time for the Maid of the Mist?
  56. Road trip across the Sonoran Desert in July. I’m bringing 2 bottles of water and a bag of chips. Will I find gas stations every 10 miles?
  57. Driving from Anchorage to Fairbanks in December. I’m wearing shorts and a t-shirt because I have a great heater in the car.
  58. We want to drive from Key West to Seattle using only local, 2-lane roads in exactly 7 days. Is this doable?
  59. Towing my car behind an RV for the first time. I didn’t check the weight limits, but it’s just a small SUV. Best route over the Rockies?
  60. Planning to drive into Yosemite Valley on Memorial Day weekend at noon. Will there be plenty of parking at Yosemite Falls?
  61. I want to drive from Dallas to Houston, Austin, San Antonio, and back to Dallas all in one Saturday. Best taco stops?
  62. Taking a motorcycle trip from Seattle to San Diego in November. I only have a light windbreaker. Will it be chilly?
  63. Driving across the US, from LA to NYC. I only have a budget of $100 for gas. What’s the most fuel-efficient route?
  64. We want to see the Grand Canyon, but we’re flying into Orlando. Can we drive there for a day trip?
  65. Renting a Lamborghini to drive on the unpaved washboard roads of Monument Valley. Is the gravel rough?
  66. Planning a 10-hour drive through the Midwest. My check engine light has been blinking red for a week, but the car sounds fine. Good to go?
  67. Driving from Philadelphia to Washington DC at 4 PM on a weekday. I’ll just hop on I-95, shouldn’t take more than 2 hours.
  68. We want to camp in the Everglades in July. We’re bringing a simple mesh tent and no bug spray to ‘rough it’.
  69. Driving from San Francisco to Lake Tahoe on a Friday night during a heavy snowstorm. I don’t have chains, but I have an AWD Subaru.
  70. I plan to drive 24 hours straight from Miami to New York. I don’t drink coffee. What are some good slapping techniques to stay awake?
  71. Taking my 40-foot fifth wheel through the narrow, historic cobblestone streets of Charleston, South Carolina. Are there plenty of places to turn around?
  72. We have a 2-day weekend. We want to drive from Denver to Mount Rushmore, see the Badlands, visit Yellowstone, and drive back.
  73. Driving an electric car from LA to Vegas, but I plan to drive 90mph the whole way with the AC on full blast. Will I need to stop to charge?
  74. Going on a road trip from Chicago to New Orleans. We are not using a GPS or map, just heading south and guessing. Any tips?
  75. Driving from Anchorage to Prudhoe Bay in November. We don’t have a spare tire, CB radio, or extra gas. Just raw-dogging the Dalton Highway.
  76. We’re driving from Atlanta to Savannah during a hurricane evacuation order, but we’re going towards the coast for a cheap beach vacation.
  77. I want to drive from Boston to Cape Cod on a Friday evening in July. Should take about an hour, right?
  78. Taking my lowered Honda Civic over the Alpine Loop in Colorado. I hear it’s a dirt road, but it’s a road, right?
  79. Planning to drive from El Paso to Big Bend National Park. We have half a tank of gas and aren’t planning to stop. That’s enough, right?
  80. We want to drive from Toronto to Vancouver in 3 days. What are the best roadside diners to stop at every 2 hours?
  81. Renting a 15-passenger van for the first time. Going to drive the winding cliffside roads of the Pacific Coast Highway near Big Sur at night.
  82. Driving from Seattle to Glacier National Park in January. I really want to drive the Going-to-the-Sun Road. Is it scenic in winter?
  83. We’re taking a family road trip from LA to the Grand Canyon. We have 5 kids and are leaving at 2 PM on a Friday. What could go wrong?
  84. Planning a road trip through the remote Nevada desert. I don’t have a spare tire, jack, or jumper cables to save weight for more luggage.
  85. I want to drive from Miami to Key West on New Year’s Eve without booking a hotel in advance. We’ll just find a cheap motel when we get there.
  86. Driving a massive Class A motorhome onto the auto train from Virginia to Florida. It’s 14 feet tall, it’ll fit in the train cars right?
  87. Going to do the ultimate road trip: NYC to LA to Miami to Seattle in one week. What’s the best podcast for the drive?
  88. Taking my rear-wheel-drive sports car to a ski resort in Utah in the middle of a blizzard. I don’t need snow tires if I drive slow, right?
  89. We have 4 hours to kill in Las Vegas, so we’re going to rent a car, drive to the Grand Canyon Skywalk, and come back before our flight.
  90. Driving from Chicago to St. Louis. I only have my learner’s permit and no licensed driver in the car, but I need to get there. What’s the backroad route?
  91. I’m the only driver and I plan to drive 14 hours a day for 4 days straight to get from Seattle to Miami. What energy drinks or snacks do you recommend to stay awake?
  92. Driving from NYC to Los Angeles. I only have 3 days total for the whole trip. What are the absolute must-see stops along the way?
  93. Planning a road trip from Boston to DC on the Friday afternoon before Thanksgiving. I want to take I-95 straight down through New York and Philly. Good idea?
  94. Driving from Dallas to El Paso tomorrow. Going straight through on I-20 and I-10. Are there any cool hidden gems on this route or should I just power through?
  95. I’m doing a 5-day trip in Utah. I want to see Zion, Bryce, Capitol Reef, Canyonlands, Arches, and also spend a day at the Grand Canyon in Arizona. What’s the best itinerary to fit this all in?
  96. Road trip down the California coast! We have 2 days to get from San Francisco to San Diego. We want to stop and hike in Big Sur, see the Hollywood sign, and spend a few hours at Santa Monica pier.
  97. Me and my buddies are driving a convertible from Chicago to Denver in mid-January. What scenic backroads should we take through the mountains?
  98. Driving through Death Valley in mid-August. I want to do some dispersed camping a few miles off the main paved road to save money. What kind of tent do I need?
  99. Taking my 2004 Honda Civic with 200,000 miles on a 4,000-mile road trip through the Rockies. Any tips for mountain driving?
  100. Renting an RV for the absolute first time! We are going from SF down Highway 1 to LA. I hear the cliff views are great. Any RV-specific tips for driving that specific road?