Top 7 Large Language Model Fine-tuning Mistakes to Avoid

Mark Hennings Blog Profile Pic
By Mark Hennings
July 12th, 2023
fine-tuning mistakes

Fine-tuning a large language model can seem challenging. The process has a lot of twists and turns, and it's easy to make mistakes.

But don't worry — with a good understanding of these common errors, you can avoid them and make your fine-tuning journey smoother and more successful.

We'll also introduce you to Entry Point, a tool that makes the whole process even easier.

Let's get started on making fine-tuning a breeze.

1. Trying to Teach Specific Facts

When many people hear about fine-tuning and training custom AI, they assume that means they can write some questions with answers about their topic and the AI model will remember those specific facts the next time it’s asked.

Sounds nice, but that’s not exactly how it works.

While fine-tuning does adjust the weights and biases of a large language model (LLM), which can make it more likely to respond with facts that you are trying to teach it, the likelihood that a single example will supersede the billions of parameters of information the model has already been trained on is very low.

The key point to understand is this: LLMs don’t remember facts, they remember the likelihood of what the next token (part of a word, or word) should be.

That means if you actually want to teach a LLM new information, you need to provide the same fact in many different ways, in response to many different types of prompts. You’d need many variations of the fact in your training data. You may also have to play with parameters like number of epochs for training and the prompt loss ratio. It’s possible to do, but not necessarily the best solution.

If you need your model to know a lot of domain-specific information, consider an approach like looking up information with semantic search and injecting the related information into a prompt for — the LLM can refer to the information and use it to answer a question without being specifically trained on the information ahead of time.

Fine-tuning, on the other hand, is most effective when you leverage the model’s existing knowledge about the world and craft your training data to show the pattern, style, and structure of information we expect to give it and get back in return. LLMs pick up on these cues in your training data very quickly, with as few as 20 examples in many cases.

Think of it like this: a large language model is more like a frontend web app than a database. It can process and interact with information, but you should plan to provide it with any specific data it needs in the prompt.

Even though it’s not a magic way to teach LLMs new information, fine-tuning is useful for a ton of things. 

2. Forgetting the Separator

Is your fine-tuned model repeating your prompt back to you? You probably forgot to include a separator in your training data, in your prompt, in the playground settings, or both.

A separator is a sequence of characters like ### or -> that you need to append to the end of every prompt and should not appear anywhere else in your training data. It indicates that it’s time for the model to write the completion. If you don’t have it, the model will try to keep writing your prompt instead, which generally means repeating what you already wrote.

Entry Point handles separators for you so that you never run into this issue by accident.

3. Missing a Stop Sequence

Without a stop sequence, your fine-tuned model is like somebody who just doesn’t know when to stop talking. Sure, you can set a limit on the max number of tokens to output, but that might lead to an abrupt ending.

Make sure to include a stop sequence like \n\n###\n\n at the end of every completion. Otherwise you can get run-on outputs.

You guessed it — Entry Point handles this for you by default.

4. Testing with a High Temperature

Is your model spitting out nonsense or gibberish? Temperature is a parameter used for completions that tells the model how much risk it should take with its token choices. A higher temperature can mean a more creative output, but the line between creativity and insanity is a fine one.

If your output has no correlation to the examples you trained your model on, check the Temperature you’re using in the playground and reset it to 0. Then work your way up in increments of 0.1.

5. Using Too Few Examples

When you're fine-tuning an LLM, it might be tempting to skimp on the number of examples used in your training set. After all, these models are capable of learning from just a few examples, right?

While it's true that an LLM can pick up on patterns fairly quickly, feeding it too few examples can leave your model starving for context. It's like trying to learn a new language by only studying a handful of phrases — you'll have a hard time understanding the nuanced grammar rules, vocabulary variations, or idiomatic expressions.

Here’s a simple example. Let’s say you trained a model on a set of training examples that have a question for the prompt and a statement for the completion, like: Prompt: "What color is the sun?" Completion: "Why good sir, the sun is yellow."

Now, you try to test it with a statement like, “I want to know what color the sun is.” You may not get an answer like you expect, because none of the examples in your dataset included statements, only questions.

If you want your model to respond to questions and statements, you need to have examples with both.

Don't expect your model to make leaps of logic that it hasn't been trained to make. The fewer examples you use, the less opportunity the model has to understand the full scope of the pattern you're trying to teach it. You need to provide enough data to sufficiently represent the complexity of the problem at hand and handle edge cases.

The sweet spot can take patience to find. Providing too few examples can lead to underfitting, where the model fails to learn the underlying pattern, and your model will perform poorly. 

In the end, training a LLM is much like software development — you can launch a product but you’re never truly done — there are always features to add, edge cases to handle, and bugs to fix. In terms of training, this means adding examples or refining your existing dataset.

6. Lack of Variation Between Examples

Now, let's say you have plenty of examples. That's great! But, if your examples are essentially identical to each other, you're falling into another common pitfall — lack of variation between examples.

Imagine trying to learn to paint, but all you're allowed to paint are sunflowers. Sure, you'll become an expert in painting sunflowers, but as soon as someone asks you to paint a landscape or a portrait, you'll be lost. Similarly, if all your examples are nearly the same, your model will struggle when presented with a slightly different prompt.

In order for your model to generalize and respond accurately to a wide range of inputs, it needs to see a wide range of examples. Varying your examples helps the model understand the pattern you're teaching it from every angle.

By having diversity in your examples, you’re teaching your model to be more resilient to new or slightly different prompts. It's a little like a workout routine — the more varied the exercises, the more muscles you work, and the stronger and more adaptable you become.

So, spice up your examples. Challenge your model with different but relevant scenarios. Your model will thank you for it by becoming more reliable and versatile.

7. Not Leveraging the Right Tools

Finally, one of the most common and costly mistakes you can make when fine-tuning a large language model is trying to tackle everything manually. In the age of sophisticated tools and platforms, doing it all by hand is like setting sail across the ocean with a rowboat when you could be cruising on a state-of-the-art yacht.

Take formatting a JSONL file, for instance. It's vital for proper communication with the APIs, but it can be tedious and prone to human error if done manually. Then there's the process of writing prompt/completion templates, keeping your data structured and editable, counting tokens, estimating costs, and handling API calls. Let's not even get started on testing your models, which can feel like shooting in the dark if you're not equipped with the right tools.

All these tasks require time, precision, and a deep understanding of the intricacies of LLMs. You could spend hours, days, even weeks tangled in these processes, only to end up with a model that's not performing up to your expectations.

But what if there was a way to bypass these hurdles and streamline your fine-tuning process? This is where Entry Point AI comes into the picture.

Entry Point is a comprehensive platform designed to sit as a layer on top of the most popular fine-tuning APIs. It's like your personal AI assistant, taking care of the mundane and complex tasks so you can focus on what matters the most - creating a high-performing, fine-tuned model.

Entry Point handles your JSONL formatting, allows you to write prompt and completion templates in the Handlebars language, and offers a field architecture to keep your data intact and easily editable. It's a vigilant bookkeeper, counting tokens and estimating costs, ensuring you stay within budget. Entry Point smoothly carries out API calls, including uploading the JSONL file and starting the fine-tuning.

And for the crucial task of testing your models, Entry Point provides a playground that makes it easy to evaluate your models. It's like a training field, giving you a safe and accurate environment to understand how your models behave and tweak them for better performance.

By using Entry Point, you can avoid a multitude of common mistakes, saving yourself time, energy, and headaches. It's a wise investment, enhancing your efficiency, accuracy, and ultimately, the performance of your fine-tuned LLMs.

Wrapping It Up

In conclusion, fine-tuning a large language model is both an art and a science. It's full of nuances and complexities that can be overwhelming. But by avoiding common mistakes and leveraging the right tools, like Entry Point, you can navigate the challenges and come out victorious, with a well-tuned model ready to make the most of your data.