import React from 'react';
import '../../styles/subsection.css';
import Header from '../../components/Header';
import Footer from '../../components/Footer';
import { Link } from 'react-router-dom';
import 'katex/dist/katex.min.css';
import { LightAsync as SyntaxHighlighter } from 'react-syntax-highlighter';
import { docco } from 'react-syntax-highlighter/dist/esm/styles/hljs';

function Training() {
    return (
        <div className="subsubsection-container">
            <Header />
            <div class="side-nav-container">
                <aside className="subsubsection-side-nav">
                    <a href="#data">LLM Architecture</a>
                    <a href="#initial">Fine Tuning</a>
                    <a href="#strats">LLM Information  </a>
                    <a href="#hyp">Hyperparameter Tuning</a>
                </aside>
            </div>
            
            <main className="subsubsection-content">
                <div className="titles"><h1>LLM Training</h1></div>

                <section id="data" className="code-cleaned">
                <h2>LLM Architecture</h2>
                <p className="subsubsection-paragraph">
                Large language models like ChatGPT use transformers to generate their responses (in conjunction with some Reinforcement Learning). They use a specific architecture known as the “decoder-only” transformer. Previously,
                 we went over an example of a transformer that followed the encoder-decoder architecture and I mentioned that you could stop at the end of the encoder step depending on what your
                  purpose is -- LLMs are one of those types of scenarios.
                </p>

                <p className="subsubsection-paragraph">
                There are a few differences between the encoder-decoder architecture discussed in the original paper “Attention is all you need” and the architecture implemented to generate 
                ChatGPT namely in the implementation of attention. In the encoder-decoder of the transformer we discussed earlier, we applied self-attention in the encoder i.e., every token looked at 
                all other tokens to get its updated embedding however, in the case of decoder only models, we only see use of masked self-attention. In the original architecture, this use of attention 
                was limited to the input of the decoder; when we were generating the output, we only allowed attention weights to be applied to words that the output had seen thus far (despite having access
                 to the entire sequence since we know the correct output). In decoder-only transformers, this approach is used on the inputs as well as the outputs.
                </p>

                <p className="subsubsection-paragraph">
                    Let’s work through the steps to understand exactly how a decoder-only transformer would generate a response to a prompt such as “What is your name?”. 

                    <ol>
                        <li><strong>Tokenization: </strong>During this step, every word and piece of punctuation is given its own space and treated separately.
                         In this case, we split “What is your name?” into: “what”, “is”, “your”, “name”, “?”</li>
                        <li><strong>Embeddings: </strong>This is very similar to what we have discussed before; we will either use a pre-trained model to generate the embeddings for each token
                         or we will impose a linear layer so that the weights corresponding to each specific token are trained and act as the embeddings. The output of this layer is then treated as
                          the embedding for the token.</li>
                        <li><strong>Positional Encoding: </strong>During this step, we apply a summation to each token depending on which position it is in the original sequence. The value added comes 
                        from some set of functions (the number of functions being equal to the embedding size). This lets us keep track of the order as a specific set of numbers is added to the embedding
                         corresponding to the position it should have in the functions that the values come from. </li>
                        <li><strong>Masked Self-Attention: </strong>Now, we apply attention to each of the tokens. Remember that this is being done in parallel across all of the tokens because no token 
                        requires input from any of the other passes. Let’s define a few things:
                        
                        <ul>
                            <li>Let K be the key matrix (learned through training)</li>
                            <li>Let V be the value matrix (learned through training)</li>
                            <li>Let Q be the query matrix (learned through training)</li>
                        </ul>

                        Let’s say we are at the word “your”. This is what would happen:

                        <ol>
                            <li>First, we would multiply the positionally encoding embedding for “your” with Q</li>
                            <li>Then, we would multiply all other embeddings for all words in the sequence inclusive of “your” with the key matrix. This is where masked attention comes into play – we do not 
                                include the embeddings for “name” in our calculations</li>
                            <li>Then, we would multiply the outputs from the two steps above with each other, element wise (for dot product, but we can use other measures as well like 
                                scaled dot product or cosine similarity). This will give us a linear combination “your” with each word in the sequence thus 
                                far i.e. your_attention = a*your.what + b*your.is + c*your.your</li>
                            <li>Then, we apply the softmax function to turn a, b, and c into proportions</li>
                            <li>Next, we multiply the embeddings for all of the words in the sequence by V to get the value vectors</li>
                            <li>We now take a linear combination of the value vectors weighted by the softmax outputs (attention scores). This gives us our final output from the attention step</li>
                        </ol>
                        
                        </li>


                        <li><strong>Linear Step: </strong>Remember that we can have many K, V, and Qs; in this step, we first concatenate all of the embeddings that came from all of the K, V, and Q sets of weight matrices. We then use the linear layer to weight the 
                        concatenated vector and project it onto a space with the same dimensionality as the original embedding. This is important for the next step. The weights in this linear layer are learned during training.</li>
                        <li><strong>Residual Connection: </strong>In this step, we add back the original embedding for “your” to the attention weighted, dimensionality reduced embedding from the previous
                         step. The reason we do this is to not put too much “pressure” on any particular embedding i.e. we want the positional embedding to take care of the position and characterizing the
                          word and we want the output from the attention step to focus on attention.</li>
                        <li><strong>Normalization: </strong>We normalize the embeddings. </li>
                        <li><strong>Feed-Forward Layer: </strong>Now, we pass the embedding into a vanilla neural network where it will go through a series of transformations depending on the architecture and ultimately 
                        have an output.</li>
                        <li><strong>Repeats: </strong>The attention to feed forward steps now can be repeated depending on how complex you want the infrastructure to be. </li>
                        <li><strong>Prediction: </strong>The final vector is fed through a linear layer that has a softmax activation function outputting a probability distribution over the vocabulary. The highest probability is the chosen output for that position. </li>
                    </ol>


                    Okay, so that’s the steps but we need to note some things:

                        <ol>
                            <li>This continues until we get to the end of the input sequence. From there, we begin the output using an SOS token. This token symbolizes the beginning of the response that the model will give. </li>
                            <li>The SOS token will also use masked attention but will include the prompt. This is because we want the output to include information from the prompt as it generates its own response. </li>
                            <li>These steps are repeated until an EOS token is output (or we reach the end of the expected sequence size).</li>
                        </ol>

                        Remember, the reason we need to do masked attention in the output sequence is because the output should not know information about the future since it might influence what it will
                         say now. It doesn't have that context as of yet so there is no need to give attention to it. This basically summarizes how ChatGPT works or the GPT model rather. There is an 
                         additional component after the initial supervised learning phase, where OpenAI uses a technique called Reinforcement Learning from Human Feedback to further refine the model's outputs. 
                         This involves several steps:

                         <ul>
                            <li>Human AI trainers provide model-generated responses to a variety of prompts.</li>
                            <li>These responses are then rated or ranked according to their quality, relevance, and safety.</li>
                            <li>A smaller model, often referred to as a "reward model," is trained to predict these human preferences.</li>
                            <li>Finally, reinforcement learning is applied, using the reward model as a guide to adjust the larger model's parameters. This process encourages the model to generate responses 
                            that are more closely aligned with human preferences.</li>
                         </ul>

                         This combination helps generate responses that are more inline with how we speak to each other.
                </p>


            </section>

            <section id="initial" className="code-cleaned">
                <h2>Fine Tuning</h2>
                <p className="subsubsection-paragraph">

                Fine-tuning a Large Language Model like BART or GPT involves adjusting the pre-trained model's parameters to perform well on a specific task, such as text classification,
                 question answering, or text generation, with a smaller, task-specific dataset. This process allows the model to transfer the general language understanding it gained during 
                 pre-training to the particular task at hand.

                </p>

                <p className="subsubsection-paragraph">

                    Let's just clarify for completeness: pre-training vs. fine-tuning

                    <ul>
                        <li><strong>Pre-Training: </strong>LLMs like BART and GPT are initially pre-trained on vast amounts of text data. This phase involves learning a general understanding of 
                        language, including syntax, semantics, and common knowledge. For example, GPT models are pre-trained using a language modeling objective, predicting the next word in a sequence 
                        given the previous words. BART is pre-trained as a denoising autoencoder, reconstructing corrupted text.</li>

                        <li><strong>Fine-Tuning: </strong>After pre-training, the model undergoes fine-tuning on a specific dataset related to the target task. This step adjusts the model's
                         weights slightly to adapt its general language capabilities to the nuances of the new task.</li>
                    </ul>

                    The exact process of fine tuning might go something like this:

                    <ol>

                            <li><strong>Selecting a Pre-trained Model: </strong>Choose a pre-trained model that best fits the nature of your task. For instance, GPT models might be more 
                            suited for generative tasks, while BART's architecture allows it to excel in both generative and discriminative tasks.</li>
                            <li><strong>Preparing the Task-Specific Dataset: </strong>Your dataset should include input-output pairs relevant to your specific task. For a sentiment analysis 
                            task, your dataset would consist of text samples and their corresponding sentiment labels.</li>
                            <li><strong>Adapting the Model for the Task: </strong>This might involve adding a task-specific layer or head to the model. For example, for classification tasks,
                             you would typically add a softmax layer on top of the pre-trained model to output probabilities over the class labels.</li>
                            <li><strong>Fine-Tuning Parameters: </strong> Look up best practices when it comes to tuning; you might need to play around with learn rate, etc. until you find something that works. </li>
                            <li><strong>Training the Model: </strong>The model is then trained (fine-tuned) on the task-specific dataset. The gradients are computed based on the loss between the model's 
                            predictions and the actual task-specific outputs. The model's weights are updated to minimize this loss.</li>
                            <li><strong>Evaluation and Iteration: </strong>Evaluate the fine-tuned model on a separate validation set. Adjust hyperparameters and training strategies based on performance, 
                            and iterate as necessary.</li>

                    </ol>

                </p>

                    
                <p className="subsubsection-paragraph">

                    Here is a python script with comments on how fine-tuning might work although this code may have errors:

                    <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
            {`import torch
from transformers import BartTokenizer, BartForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

# Step 1: Load the dataset
dataset = load_dataset("imdb")
train_dataset = dataset['train']
test_dataset = dataset['test']

# Step 2: Preprocess the dataset
def preprocess_data(examples):
    return tokenizer(examples['text'], padding=True, truncation=True, max_length=512)

tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
train_dataset = train_dataset.map(preprocess_data, batched=True)
test_dataset = test_dataset.map(preprocess_data, batched=True)

# Step 3: Load the pre-trained BART model
model = BartForSequenceClassification.from_pretrained('facebook/bart-large', num_labels=2)

# Step 4: Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # Output directory for model checkpoints
    num_train_epochs=3,              # Number of training epochs
    per_device_train_batch_size=8,   # Batch size for training
    per_device_eval_batch_size=8,    # Batch size for evaluation
    warmup_steps=500,                # Number of warmup steps
    weight_decay=0.01,               # Strength of weight decay
    logging_dir='./logs',            # Directory for logs
    logging_steps=10,                # Log every X updates steps
    evaluation_strategy="epoch",     # Evaluate after each epoch
    save_strategy="epoch",           # Save the model after each epoch
    load_best_model_at_end=True,     # Load the best model when finished training
)

# Step 5: Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

# Step 6: Fine-tune the model
trainer.train()

# Step 7: Evaluate the model
results = trainer.evaluate()

print("Evaluation results:", results)

# Step 8: Save the fine-tuned model
model.save_pretrained('./fine_tuned_bart')

# You can now use the fine-tuned model for sentiment analysis on new data.
`}
                        </SyntaxHighlighter>

                </p>        




            </section>

            <section id="strats" className="code-cleaned">
                <h2>LLM Information</h2>
                <p className="subsubsection-paragraph">

                    The following is a table containing some information about popular LLMs! <br/><br/>

                    <table className="small_table">
                        <thead>
                            <tr>
                            <th> Name</th>
                            <th>Developer</th>
                            <th>Parameters</th>
                            <th>Release</th>
                            <th>Training</th>
                            <th> Features</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                            <td>GPT-3</td>
                            <td>OpenAI</td>
                            <td>175 billion</td>
                            <td>June 2020</td>
                            <td>Common Crawl, WebText2, Books1, Books2, Wikipedia</td>
                            <td>Text generation, translation, few-shot learning</td>
                            </tr>
                            <tr>
                            <td>BERT</td>
                            <td>Google</td>
                            <td>110 million (Base) - 340 million (Large)</td>
                            <td>October 2018</td>
                            <td>Wikipedia, BookCorpus</td>
                            <td>Bidirectional context, masked language modeling</td>
                            </tr>
                            <tr>
                            <td>GPT-2</td>
                            <td>OpenAI</td>
                            <td>1.5 billion</td>
                            <td>February 2019</td>
                            <td>WebText</td>
                            <td>Text generation, zero-shot learning</td>
                            </tr>
                            <tr>
                            <td>T5</td>
                            <td>Google</td>
                            <td>11 billion</td>
                            <td>October 2019</td>
                            <td>C4 (Colossal Clean Crawled Corpus)</td>
                            <td>Text-to-text approach, multiple NLP tasks</td>
                            </tr>
                            <tr>
                            <td>ELECTRA</td>
                            <td>Google</td>
                            <td>335 million</td>
                            <td>March 2020</td>
                            <td>Wikipedia, BookCorpus</td>
                            <td>Efficiency in pre-training, replaced token detection</td>
                            </tr>
                            <tr>
                            <td>RoBERTa</td>
                            <td>Facebook AI</td>
                            <td>355 million</td>
                            <td>July 2019</td>
                            <td>Common Crawl, Books, Wikipedia, CC-News</td>
                            <td>Optimized BERT training approach, larger training data</td>
                            </tr>
                            <tr>
                            <td>XLNet</td>
                            <td>Google/CMU</td>
                            <td>340 million</td>
                            <td>June 2019</td>
                            <td>Wikipedia, BooksCorpus, Giga5, ClueWeb, Common Crawl</td>
                            <td>Generalized autoregressive pretraining, permutation-based language modeling</td>
                            </tr>
                            <tr>
                            <td>ALBERT</td>
                            <td>Google</td>
                            <td>12 million (Base) - 18 million (XXLarge)</td>
                            <td>September 2019</td>
                            <td>Wikipedia, BookCorpus</td>
                            <td>Parameter-reduction techniques, cross-layer parameter sharing</td>
                            </tr>
                            <tr>
                            <td>BART</td>
                            <td>Facebook AI</td>
                            <td>139 million (Base) - 406 million (Large)</td>
                            <td>October 2019</td>
                            <td>Books, Wikipedia, Toronto Book Corpus, CC-News</td>
                            <td>Combines bidirectional and auto-regressive transformers, text infilling</td>
                            </tr>
                            <tr>
                            <td>DeBERTa</td>
                            <td>Microsoft</td>
                            <td>48 million (Base) - 1.5 billion (XXLarge)</td>
                            <td>January 2021</td>
                            <td>Wikipedia, BookCorpus, OpenWebText, CC-News</td>
                            <td>Disentangled attention mechanism, enhanced mask decoder</td>
                            </tr>
                        </tbody>
                        </table>


                </p>
            </section>

            <section id="hyp">
                <h2>Hyperparameter Tuning</h2>
                <p className="subsubsection-paragraph">
                    Hyperparameter tuning in LLMs is a critical step to optimize the model's performance. Key hyperparameters include learning rate, batch size, number of training epochs, and 
                    the architecture-specific parameters like the number of layers and attention heads in Transformer models. The tuning process often involves experimentation and iterative 
                    refinement, using techniques like grid search or Bayesian optimization. Effective hyperparameter tuning can lead to significant improvements in model accuracy, training 
                    efficiency, and generalization capabilities.
                </p>

                {/* <p className='subsubsection-paragraph'>
                Training LLMs poses several challenges. The most prominent is the computational requirement; training state-of-the-art LLMs often requires extensive computational resources.
                     Another challenge is the potential for overfitting due to the large number of parameters. This risk necessitates careful regularization and validation strategies. Additionally,
                      LLMs can inadvertently learn and amplify biases present in the training data, making the model's fairness and ethical use a significant concern. Ensuring robustness and 
                      generalizability of the model, especially in diverse and real-world settings, is another critical challenge. Addressing these challenges requires a combination of technical
                       solutions, ethical considerations, and ongoing research and experimentation.
                </p> */}
            </section>

                
                
                <div className="subsubsection-navigation">
                    <Link to="/llms/basics">← LLM Basics</Link>
                    <Link to="/llms/distillation">LLM Distillation →</Link>
                </div>
            </main>
            
            <Footer />
        </div>
    );
}

export default Training;
