Get Started

Fine-tuning Llama-2: The Definitive Guide

By Mark Hennings

January 1st, 2024

Llama-2 is an open source large language model (LLM) from Meta, released in 2023 under a custom license that permits commercial use. It was trained on 2 trillion tokens of publicly available data and matches the performance of GPT-3 on a number of metrics.

As far as open source models go, it’s pretty good.

Here are some quick facts about Llama-2:

Release Date	July 18, 2023
Pre-training Dataset Size	2 trillion tokens
Available Sizes	7B, 13B, 70B + chat version for each
Data Cutoff	September 2022 (with some exceptions for tuning data up to July 2023)
Context Window	4,096 tokens
Languages	Primarily English
Paper	https://arxiv.org/pdf/2307.09288.pdf
License	Research and commercial use

Fine-tuning allows you to train Llama-2 on your proprietary dataset to perform better at specific tasks. By learning how to fine-tune Llama-2 properly, you can create incredible tools and automations.

In this guide, we’ll show you how to fine-tune a simple Llama-2 classifier that predicts if a text’s sentiment is positive, neutral, or negative.

At the end, we’ll download the model weights.

And best of all, we’re going to do it without configuring a GPU or writing a line of code.

Llama-2 Fine-tuning APIs

The reason we can fine-tune Llama-2 without running our own GPU server is because we can leverage a fine-tuning API from a platform that’s already invested in handling infrastructure efficiently.

Gradient and Replicate are two startups that offer Llama-2 fine-tuning and inference via API. They handle the low-level libraries, configuration, and GPU servers to run fine-tuning jobs and inference for Llama-2 models. They offer those services through an API.

Here is a quick comparison of Gradient vs Replicate for fine-tuning:

	Gradient	Replicate
Models Available	Bloom 560M Llama-2 7B Chat Nous-hermes2 (13B)	Llama-2 7B Chat Llama-2 13B Chat Llama-2 70B Chat Llama-2 7B Llama-2 13B Llama-2 70B
Pricing	Per 1k tokens	Per second of server time
Max Context Window	512 tokens	4096 tokens
Downloadable Model Weights	No (coming soon)	Yes
Fine-tuning Methods	LoRA	LoRA or QLora
View Training Loss	Yes (via integration with Weights and Biases)	Yes (in logs)
Submit Validation Data	No	Yes
LoRA Training Layers	All Layers ⭐	Query and Value
Hyperparameters Available	Learning Rate LoRA: Rank Behind the scenes: Alpha = Rank Dropout = 0	Learning Rate Number of Epochs Batch Size Micro Batch Size LoRA: Rank Alpha Dropout
API Reliability	Needs work	Solid

In this guide, we’ll first run our first fine-tune on Replicate, using a parameter-efficient fine-tuning method called Low-rank Adaptation (LoRA). Dony' worry, you don’t need to understand how LoRA works to follow this guide.

While I'd strongly prefer that Replicate trains all the layers of the model like Gradient, it’s a reliable API that we can count on. For more information on LoRA and training all layers of the model, see our comprehensive guide on LoRA Fine-tuning.

To make this all possible without any code, and to avoid getting hung up on any idiosyncrasies of the Replicate API, we will leverage the Entry Point AI fine-tuning platform to import our data, write a prompt template, send off the training job, and evaluate performance.

There are a lot of details to get right when fine-tuning LLMs — by leveraging trusted products like Replicate and Entry Point you can feel confident that your stack is configured correctly. That helps us focus on high-level concepts and achieving results.

Before we run an actual fine-tune, let’s cover the key concepts that will lead us to success, including how much data we really need.

How Much Data is Needed for Fine-tuning?

One of the most common misconceptions about fine-tuning is that you need a massive amount of data to train a model. This idea is inherited from the old days of machine learning (e.g. a couple years ago) when models were smaller and had less pre-training.

Modern LLMs can be fine-tuned to perform a specific task better with just a few dozen examples.

Let’s look at why.

In the paper LIMA: Less is More for Alignment, the authors instruct-tuned a base model on just 1,000 highly curated examples with remarkable results. In other words, it only took 1,000 high-quality examples to create a pretty good chatbot.

When it came time to incrementally improve the model, the authors showed that an additional 30 examples could make the chatbot perform well at multi-turn dialogue.

To improve its performance at complex formatting tasks, like summarizing an article into bullet points, as few as 6 additional examples could unlock this behavior, and related behaviors outside the scope of those few examples.

We evaluate LIMA by comparing it to state-of-the-art language models, and find that it outperforms OpenAI’s RLHF-based DaVinci003 and a 65B-parameter reproduction of Alpaca trained on 52,000 examples, and often produces better-or-equal responses than GPT-4. Analyzing of LIMA generations finds that 50% of its outputs are considered excellent. The fact that simple fine-tuning over so few examples is enough to compete with the state of the art strongly supports the [hypothesis].

The paper suggests that almost all of the model’s knowledge and capabilities are learned during pre-training. Fine-tuning simply allows us to apply that knowledge to a specific task.

We hypothesize that alignment can be a simple process where the model learns the style or format for interacting with users, to expose the knowledge and capabilities that were already acquired during pretraining.

The authors conclude that example quality and diversity are much more important than the sheer volume of training data.

We observe that, for the purpose of alignment, scaling up input diversity and output quality have measurable positive effects, while scaling up quantity alone might not.

Their findings suggest that if we start with a high-quality chat model, we should be able to fine-tune it on a handful of very different inputs and high-quality responses for our specific task and start to see changes in the output.

I repeatedly find this to be true in my own experience and we’ll demonstrate it with fine-tuning Llama-2.

Now, let’s discuss which model to use.

Select a Llama-2 Model for Fine-tuning

Llama-2 comes in 7B, 13B, and 70B variants. Each one has a base model and a chat-based counterpart. The chat-based ones respond to human instructions, which also comes in handy for fine-tuning, as we’ll show.

Unless you have a special use case or want to train a chatbot from scratch like the authors of the LIMA paper above, it’s easier to start with a chat version.

Classification is a fairly simple task for these models, so we don’t need a very large model. We will fine-tune the Llama-2 7B Chat model in this guide.

Steer the Fine-tune with Prompt Engineering

When it comes to fine-tuning, Llama-2 is more like a wild stallion than a cute wooly animal, and prompt engineering is our lasso.

We’ll write a simple prompt and prepend it to every example we submit for fine-tuning.

When it comes to fine-tuning, there are (3) key ways that prompt engineering helps:

1. Directly steer behavior

In the context of fine-tuning, the prompt provides instructions for how to arrive at the output shown in each example, which allows the model to learn more clearly. When you include the same prompt for inference, it again helps guide the model to perform the task correctly.

For example, sometimes Llama-2 has an issue where it tends to keep generating tokens for too long, which can be fixed with simple language in the prompt:

Only generate one word.
Generate a maximum of 3 sentences.
Stop generating after _____.

Your prompt will vary based on the task and can include many specific instructions.

2. Indirectly steer behavior with contextual cues

Not only do the actual instructions in your prompt help the model, but the presence of any prompt whatsoever helps to cue the model to perform the right behavior.

This benefit is something I discovered while doing extensive fine-tuning tests with Llama-2.

Let’s consider why this works.

In a transformer model, your prompt is transformed into a set of vectors called embeddings that capture the meaning of each word individually. The model then applies the attention mechanism, multiplying these vectors together in a specific way, to relate the words to each other, so that they can be understood in context.

Every word has an impact on the final interpretation of all the other words.

By starting each prompt with the same sequence of words, it primes the model to interpret your training examples in a similar way to all the other examples. The internal representations of each example share commonalities that help cue the learned behavior.

Using the same prompt for inference helps trigger the correct behavior, also.

Here’s how I discovered this. In my experiments, I was frustrated that Llama-2 didn’t seem to be learning from my training data in the same way that GPT-3 would. If I added a specific prompt with detailed instructions, I could get results — but how could I be sure it was learning from my data and that the prompt wasn’t doing all the heavy lifting?

So I decided to test with a generic prompt that would not mention any instructions that specifically help the model with its task.

With no prompt at all, my eval set got every single answer wrong on a classification task. The outputs didn’t even resemble the single-word outputs I had in my training examples — they were completely off-the-rails.

That’s what kept happening, even when trying to heavily overfit with more epochs and higher learning rate.

I was becoming discouraged because fine-tuning GPT-3.5 Turbo seemed so much easier and I couldn’t seem to figure out why my training data wouldn’t "take" with Llama-2. This led me down a few rabbit holes, including a deep-dive into LoRA and QLoRA.

But then, I wondered what would happen if I took out the specific instructions from the prompt. There would be a prompt, but there would be no way for the model to know exactly what it’s supposed to do except from its training data.

I ran the fine-tuning job again with this prompt:

You are a fine-tuned AI model that has been trained on a special task. Remember that task and perform only it. Do not perform other tasks or include extraneous outputs. Only perform the task you were trained on in response to the input.

Suddenly the model got 3 out of 100 validation examples correct — in the format I was expecting! While it was still a low score, it clearly showed that my fine-tuning data had an effect on the Llama-2 model.

By moving from a generic prompt to a specific prompt that had details about the task, I was able to jump up to 80% classification accuracy. In other words, 80 out of 100 outputs matched my validation examples exactly.

I was using a subset of 1,000 records from the sentiment140 dataset, which I’ve since found to be low quality data with a lot of noise in it, so 80% was a good score.

Here is the specific prompt:

You are a fine-tuned classifier model that has been trained to classify the sentiment of a tweet as either "Positive" or "Negative". Remember that task and perform only it. Your output response must be a single word for the classification.

While I can’t say for sure if longer is better, I would expect that the more specific, unique, and expansive your prompt is, the more likely it is to cue your model to follow its learned behavior from your training data.

On the other hand, it is possible that if you are too specific in the prompt, the model may respond more to your prompt than the training data, because LLMs are very sensitive to anything in the prompt. There is a delicate balance to strike here.

3. Preserve model flexibility

We have found that by including a specific prompt with each training example, models can retain an ability to respond differently when you change the prompt at inference.

For example, if you fine-tune a model to write funny tweets with this system prompt included in every example:

Write a funny tweet.

But later, you find that the model is making inappropriate jokes, you can modify the prompt to correct for it, using the same model:

Write a funny tweet. Don’t be offensive.

The ideal solution might be to go back to your dataset and remove any examples of offensive outputs, or provide counter-examples that decline to write offensive tweets, but in some cases a change to the prompt is sufficient.

Review the Llama-2 Chat Syntax

Now that we understand the importance of prompt engineering in our fine-tuning dataset, let’s look at the correct prompt syntax for Llama-2 chat models.

This is what it looks like:

<s>[INST] <<SYS>>\nHere are your system prompt instructions.\n<</SYS>>\n\nPut user prompt instructions here, e.g. untrusted data.[/INST] Finally, the output goes here.</s>

The \n represents a line break. Here’s what it looks like when you include the line breaks visually instead:

<s>[INST] <<SYS>> Here are your system prompt instructions. <</SYS>> Put user prompt instructions here, e.g. untrusted data.[/INST] Finally, the output goes here.</s>

That’s a bit nicer, but to say this is “ugly” is a bit of an understatement.

I’ve written plenty of HTML in my life, which shares some similarities, but this syntax is egregious. For example, why is <s> in single <>’s and <<SYS>> in doubles? One is lowercase and the other is capitalized. Then [INST] uses square brackets. I could go on. Why the inconsistency? ChatML is much cleaner.

At any rate, getting this format exactly right is critically important.

Entry Point AI provides a chat template that handles this syntax for us, so you never have to look at that heap of confusion.

Note: Llama-2 base models do not have any formatting expectations, so using this syntax with them won’t do anything to help.

Let’s Fine-tune

Connect Entry Point to Replicate

In order to fine-tuning Llama-2, we first need to create our Entry Point and Replicate accounts.

Create a free Entry Point account
Create a Replicate account

All Replicate training runs on 8x Nvidia A40 (Large) GPU, which at the time of this article, costs $0.0058 per second on Replicate.

The training jobs I ran with the dataset we’ll use took about 4 minutes each, costing $1.39 each. If you don’t get any free credits when you sign up, you'll need to enter a credit card to complete this tutorial.

Anyway, if you can swing the $1.39, then create a Replicate API key, give it a relevant name like “Entry Point,” make sure to copy it, and paste it into the Replicate integration page on Entry Point.

Make sure to press save.

Now, Entry Point can run fine-tuning jobs and predictions for evaluation purposes on your Replicate account.

Create a project in Entry Point

From the app home in Entry Point, press the plus button

Choose "Blank," and name it "Text Sentiment," and press Create.

Import the fine-tuning dataset

Earlier I shared the story of how I discovered the importance of prompt engineering for fine-tuning. At the time, I used a 1k subset of the sentiment140 tweet sentiment dataset, which offers a total of 1,600,000 records of tweets with positive or negative sentiment classifications.

Over a million records! Wouldn’t that be a great dataset for a fine-tuning tutorial, then?

Not exactly.

From the Less is More for Alignment discussion above, we learned that data diversity and example output quality are more important than quantity. I have found that there are two major issues with using any subset of the sentiment140 dataset for fine-tuning:

All the inputs are tweets. They are similar to each other in both tone and length, so they lack diversity.
This dataset forces each tweet into positive or negative classifications, with no option for neutral. Some of the tweets don’t lean positive or negative but are still labeled with one of these classes, so we’re teaching our model patterns that don’t exist. That makes it hard for the model to learn properly and for us to gauge performance.

We can do better and along the way, we will also show that less is indeed more for alignment.

For this guide, I hand-curated a dataset of 60 diverse and high-quality sentiment examples that works much better than the sentiment140 dataset for fine-tuning.

There are approximately 20 examples of each class: Positive, Neutral, and Negative. The inputs are very diverse, including article headlines, restaurant and movie reviews, single words, Wikipedia excerpts, website copy, poems, support questions, dictionary definitions, and yes - a few tweets.

A true smorgasbord.

I labeled them by hand to ensure high quality responses, looking carefully at the nuance of each input and coming to a reasonable conclusion.

Putting together this dataset from scratch took less than an hour.

Now, you can download it as a CSV and import it into your Entry Point AI project.

Text Sentiment 60 Dataset

To import into Entry Point, make sure you have the project you created open. Then click the import button in the sidebar.

Next, select the CSV file you downloaded.

Now, we’re going to map each column in the CSV to a new field in Entry Point. This helps keep our data organized and provides a better user experience in the playground and when editing examples.

The two things you need to do are (1) change the Sentiment column to Completion and (2) change the type for it to Predefined Options, which will provide a nice dropdown for our classifications in the UI.

set field to completion and predefined options

You can leave all the other defaults as they are. It should look like the screenshot above when you’re done.

Press "Continue" and Entry Point will ask you how many of the examples to reserve for validation. Our dataset is only 60 examples, but we’re still going to reserve 20% or 12 of them for validation, because we need to demonstrate that our model actually works.

Validation examples are split out from training examples so that the model never learns from them. We want to be confident that our model can generalize its training on unique, unseen data.

Wait a moment and Entry Point will automatically update after importing your examples. You will see a number appear in the sidebar next to the Examples tab.

If you don’t see it after a minute, refresh the page.

Write the Fine-tuning Template

Now, navigate to the Template tab in the sidebar.

When importing a CSV into a blank project, Entry Point automatically creates a default Template for you.

It looks like this:

This template uses the Handlebars templating language to insert your fields into templates for each training example. This allows you to format your training examples all-at-once without editing your actual data, so you can test different prompts and formatting with ease.

It’s just like if you were writing a bulk email and wanted to insert someone’s first name from their contact record, but instead of an email it’s a fine-tuning example, and instead of a first name it’s your custom fields.

Double check that you are using the Chat template type, which handles the Llama-2 syntax behind the scenes that we looked at earlier.

Now, let’s add a System prompt:

Classify the text sentiment as Positive, Negative, or Neutral.

Your template should look like this:

Remember to press "Save".

Start the Fine-tuning Job

Let’s get this thing off the ground, shall we?

Click "Fine-tunes" in the sidebar, and press the + button.

Give it a name at the top like "Text Sentiment v1" or leave it blank for Entry Point to choose a colorful animal to distinguish this fine-tune.

For the platform, select Replicate. Then choose Lama-2 7B Chat for the base model, if it’s not already selected.

The Base Model Version will be automatically populated with the latest version. To reproduce the results of this guide more precisely, you can manually enter this version ID:

13c3cdee13ee059ab779f0291d29054dab00a47dad8261375654de5540165fb0

Next, we need to enter our model details under Destination so Entry Point knows which model on Replicate to train a new version for. A model on Replicate is like a workspace where you can run fine-tuning jobs.

You can manually create a model on Replicate, but it’s easier to create one on the fly through Entry Point, which we’ll do now.

Under Destination, you need to enter your Replicate username in the Model Owner box. For the model name, you can enter "text-sentiment" or whatever you want to call the model, and a "Create" button will appear.

Press create, and a green checkbox will appear to indicate that your model now exists on Replicate and is ready for us to add our first version.

Note: If you already have a model on Replicate, you can enter it here and the green checkbox will indicate that Entry Point found it successfully. Once you run your first fine-tune on Replicate for a project, these fields will populate for you and you can add new versions without re-entering details.

By default, all models that Entry Point creates for you on Replicate are Private.

Hyperparameters

Next, click advanced. Change the Number of Epochs to 4. This means that the model will learn from our examples 4 times over, which is usually good for training classifiers.

Entry Point sets reasonable default values for the other hyperparameters, so we can leave them as-is. To learn more about these, see our guide to LoRA fine-tuning.

Here is the full set of hyperparameters I would recommend for this dataset:

Number of Epochs: 4
Learning Rate: 0.0002 or 0.0003
Batch Size: 1
Micro Batch Size: 1
LoRA Rank: 32
LoRA Alpha: 32 or 64
LoRA Dropout: 0.05

For more information on LoRA hyperparameters, see our guide.

Press Start and let’s do this!

Our fine-tune status will say "Preparing" which means we are creating a JSONL file to send to Replicate. Then it will change to “Started” which means Replicate is running the job. If you want, at this point, you can watch the logs. Or throw a bag of popcorn in the microwave, but hurry.

In about 4 minutes, our model should be ready.

Evaluating the Model

After our training finishes, Entry Point will run predictions on Replicate with the 12 validation examples you set aside when importing your data. These can take a few minutes.

To view the evaluation results, go to the Fine-tunes tab. Find the fine-tune and click on its Name.

This takes you to the fine-tune's detail page, where you can see what template the model was trained on, its hyperparameters, and performance.

Scroll down to the Performance section with a table that shows the model’s completions for our validation examples.

Fine-tune performance and evaluation table

Note: Your validation examples may be different because they are randomly selected.

Any answer that matches exactly automatically gets 5 stars. If the correct answer is Positive but the model responded Negative or vice versa, go ahead and manually give it one star. If the answer is closer, like Neutral vs Positive, or Neutral vs Negative, give it 3 stars.

Now, we have an overall score for the model. My results are 95.8%.

If we make changes to our template, adjust our hyperparameters, or want to try a different platform or base model, we can run the fine-tune job again and compare the score to see if the new version performs better or worse.

Validation examples are the best way to see if our model works. They’re like the end-to-end tests in software development.

We can also open our fine-tune detail and click the Training link to see the training job on Replicate. From here, we can open the logs and see interesting details like how our training loss decreased over each step and epoch. Training loss

Entry Point also passes your validation examples to supported platforms to calculate validation loss during training.

We will publish a more detailed guide for how to interpret training and validation loss in the future.

Of course, we can also experiment in the playground.

Playground

Go to the Playground tab in Entry Point and select the model you want to test. There should only be one, unless you’ve trained extras.

Set the temperature to 0, because this is a classifier and we don’t want any randomness in the outputs.

Next, enter some text in the prompt, like "This is the worst" and press Generate.

Classify negative sentiment on the playground

Usually, classifiers are blazingly fast, but if it's slow for any reason, check the Replicate status page.

If your model works correctly, it should respond "Negative".

Note: If you set up the OpenAI integration in Entry Point, you can press the Synthesize button to get a quick prompt/completion pair you can use to run through the model.

Now let’s try something a little more tricky. How about "I’m having a bad day, but I think it’s going to turn around really soon!" Your model should be able to understand the nuance of this and respond "Positive," even though it starts with "I’m having a bad day".

You can view all your past completions from the Completions tab:

Now what?

You can download your model weights to run anywhere by opening the fine-tune in Entry Point, following the link to Training on Replicate, and clicking the Download Weights button.

You may also want to share your model by enabling a share link from the fine-tune detail page in Entry Point. This allows you to see how it performs when other people try it out with more diverse and surprising inputs.

I hope you enjoyed fine-tuning on Entry Point, the modern fine-tuning platform that lets you train LLMs across providers.

Since you started with Entry Point, you’re not locked in to any one LLM provider! Next, you could just as easily run your fine-tuning job on Gradient, OpenAI, or AI21 to compare outcomes.

Or export your data and take it anywhere.

Just press the Export button in the sidebar and choose CSV or JSONL:

Continue the Journey

If you would like to stay on top of the latest in fine-tuning LLMs, I’d invite you to join our Discord, subscribe to our YouTube channel, or follow Entry Point on LinkedIn.

Happy tuning!