How Does a Large Language Model Actually Work?

Mark Hennings Blog Profile Pic
By Mark Hennings
August 3rd, 2023
How does a LLM actually work

The historical rise of ChatGPT, reaching 100 million users just two months after launch, has brought the term “large language model” to the forefront in many startup and business circles.

But what is a large language model and how does it actually work?

Whether you seek to get a job in AI or are simply interested in understanding the new technology that people are talking about, this article is for you. We'll explore the ins and outs of large language models – what they are, how they work, their applications, limitations, and what the future might hold for them. 

Since large language models evolved from the fields of artificial intelligence and machine learning, let’s start there.

What is Artificial Intelligence and Machine Learning?

Artificial intelligence (AI) refers to the broad capability of a machine to imitate “intelligent” human behavior. 

Machine learning (ML), a subset of AI, refers to a suite of algorithms and statistical models that allow computers to learn, make predictions, and adapt to new data over time, without being explicitly programmed to do so. The methodologies within machine learning can range from simple linear regression to complex deep learning models.

Large language models (LLMs) are a type of machine learning model, based on deep learning techniques, that handle human language. They are trained on vast volumes of text to predict the likelihood of a word given previous words in a text, which enables them to generate human-like text, answer questions, summarize texts, and more. 

The advancement of LLMs like GPT-3, GPT-4, and others has significantly pushed forward the field of AI in recent years.

The term “large” in “large language model” not only refers to the amount of data these models are trained on but also the complexity and size of the model in terms of parameters.

What are Parameters?

Let's try to understand parameters with an analogy.

Imagine you're trying to learn a language. Each word, its meaning, the context in which it's used, its relationship with other words, grammatical rules — these are all pieces of information you need to internalize. As you practice the language, your brain gets better at recognizing when and where to apply these rules and when to note exceptions. In this analogy, your brain's ability to apply these rules and exceptions is somewhat similar to the role of parameters in an LLM.

However, unlike human memory, parameters in an LLM are not individual pieces of information or rules. Instead, they are statistical probabilities or “weights” that are adjusted during the training process. They allow the model to recognize complex patterns in the input data, but they do so in a distributed manner, with each parameter contributing to many different patterns rather than being tied to a specific word or rule.

The more parameters a model has, the more capacity it has to recognize complex patterns in the data. For example, GPT-3, developed by OpenAI, has a staggering 175 billion parameters, and GPT-4 has even more — over 1 trillion!

These parameters don't directly store information about language, but by adjusting them during training, the model learns to predict the next word in a sentence based on the previous words. In doing so, it implicitly learns about various aspects of language, from basic rules like which words tend to follow others, to the structure of sentences, to more abstract concepts like tone or style.

So how do LLMs use parameters? First, they are adjusted in the training process, then they are used to make predictions. Let’s dive into both.

How Does Training a Large Language Model Work?

When large language models are trained, they are fed enormous amounts of text data to process. This happens in a few steps that are repeated many times.

First, each chunk of training text is divided into smaller units, or tokens, through a process called tokenization. Tokens can represent words, parts of words, or even single characters, depending on the language and the specific model.

These tokens are then converted into mathematical representations known as vectors through a process called embedding. The model learns these embeddings during training, positioning similar words close to each other in a mathematical space.

The vectors are then passed through a neural network (often a transformer architecture), where they adjust the model's parameters. These parameters, which include weights and biases within the network, are optimized to minimize the difference between the model's predictions and the actual next tokens in the training data. 

This adjustment process is guided by a “loss function,” which calculates the difference between the model's predictions and the actual outcomes, and an optimization algorithm, which iteratively adjusts the parameters to minimize the loss.

The model continues to tweak its parameters across many loops over the training data, learning to predict the next token in a sequence given the preceding tokens. As it processes a wide variety of text, the model becomes more balanced and capable of handling diverse input data.

Once you have a trained LLM, you can give it a prompt and it will predict the next token, over and over again, until it generates a “completion” or output. This process is called inference.

How Does LLM Inference Work?

When a LLM infers the completion for a prompt, the process begins much like the training process. 

Words in the prompt are converted into tokens and then a vector representation. The sequence of vectors is then passed through the model’s neural network, which includes several layers that each contribute to the model’s understanding of the relationships and context among the tokens.

Finally, the model generates a probability distribution over its entire vocabulary for the next token in the sequence. The token with the highest probability is typically chosen as the next token in the sequence, although other strategies such as top-k sampling or temperature scaling can be used to introduce randomness and generate more diverse responses.

The selected token is then appended to the original input sequence, and the process repeats: the new sequence is again passed through the network, and another token is predicted and appended. This continues until a stop condition is met, such as reaching a maximum length or encountering a specific end token.

Finally, the sequence of predicted tokens is converted (or "detokenized") back into a coherent string of text, which serves as the model's final response.

How Does a Large Language Model Become a Chatbot?

While many people might assume that you get a chatbot out-of-the-box from a LLM, that is simply not the case.

When a large language model is first trained on a variety of text data, it doesn’t naturally respond to prompts in a conversational style like ChatGPT. As described in the section above, it simply tries to guess what token should come next.

With a freshly trained LLM, its output might seem like it’s trying to continue your thoughts. For example, if you ask it a question, it may expand on your question with more questions instead of trying to answer it.

It acts like this because although the new LLM has a ton of knowledge, no one has told it what it should do with that knowledge.

In this way, LLMs are like a blank canvas, where “what should come next” is subjective to the needs of the user. Want a chatbot? Teach it to respond to questions with answers. Want it to generate tags for your content in a comma delimited list? You can teach it that, instead. 

LLMs can be fine-tuned to do very specific, useful things. Fine-tuning means to train the model on additional examples—typically much less than its original training dataset.

ChatGPT itself was fine-tuned to be conversational using a process called Reinforcement Learning with Human Feedback (RLHF),  and then fine-tuned further for safety to prevent it from generating text that could be unlawful, unethical, or dangerous.

How Transformer Architecture Makes LLMs Better

Transformer architecture plays a crucial role in improving Large Language Models (LLMs). They are used in machine learning, specifically for handling tasks that deal with data where the order matters, such as sentences in a language. For instance, "The cat chased the dog" tells a different story from "The dog chased the cat," though the words are the same.

Earlier models, like Recurrent Neural Networks (RNNs), processed sentences word by word, which was slow and struggled with understanding relationships between words that were far apart. They often missed out on important links between words in long sentences.

Transformers, introduced in 2017, changed the game. They can focus on different parts of a sentence at once, decide which words are essential at each step, and can even handle all words in a sentence at the same time. This makes them faster and better at understanding long sentences. 

Essentially, transformers allow computers to interpret and generate human language more effectively and efficiently.efficiently, as it doesn't need to process words strictly in sequence. Thus, transformers offer a powerful and effective architecture for tasks such as language modeling.

What Are the Limitations of LLMs?

If you've spent any time using ChatGPT at all, you've probably discovered that it can't count words or characters, will make up false citations for sources, and no matter how hard it tries, can't write a real quote from a famous person—although it may sound like something they're known for saying.

Despite their impressive capabilities, LLMs have several important limitations:

Lack of Understanding and Contextual Awareness: LLMs generate text based on patterns in the training data but do not truly understand the content like humans do. They lack real-world knowledge, consciousness, and beliefs, leading to potential mistakes, nonsensical or incorrect answers.

Sensitivity to Input: LLM outputs can vary significantly with slight changes in input phrasing or context. Asking the same question with minor differences in wording may yield different responses, raising challenges in ensuring consistency.

Risk of Generating Harmful Content: Without proper control mechanisms, LLMs can generate harmful, offensive, or biased content, as they learn from all aspects of internet text, both good and bad. Fine-tuning an LLM can significantly reduce such risks.

Inability to Ask Clarifying Questions: When faced with ambiguous requests, conversational LLMs often guess the user's intention instead of seeking clarification, which could result in incorrect, irrelevant, or ambiguous outputs. However, ongoing research is working on models that can ask clarifying questions to enhance their responses.

Resource Intensity: Training LLMs and using them for inference requires significant computational resources and energy, which can have environmental implications and limit access to these models.

Researchers are actively exploring ways to mitigate these issues, including fine-tuning models for specific tasks or creating models that can handle ambiguous input more effectively.

How to Fine-tune Your Own LLM

In conclusion, large language models (LLMs) like GPT-3 are powerful AI tools that can understand and generate human-like text. Utilizing billions of parameters and the efficient transformer network architecture, they are capable of performing various language tasks that were once solely the domain of human intelligence.

However, LLMs do face limitations, including their inability to ask clarifying questions and the risk of generating inappropriate content. That's where Entry Point comes in. As a dedicated platform for fine-tuning LLMs, Entry Point offers you the ability to make LLMs more useful and navigate these issues effectively.

Experience the full potential of large language models with Entry Point. Learn how to fine-tune on Entry Point today to unlock new possibilities and take your AI applications to the next level!