import React from 'react';
import '../../styles/subsection.css';
import Header from '../../components/Header';
import Footer from '../../components/Footer';
import { Link } from 'react-router-dom';
import 'katex/dist/katex.min.css';
import { InlineMath, BlockMath } from 'react-katex';
import { LightAsync as SyntaxHighlighter } from 'react-syntax-highlighter';
import { docco } from 'react-syntax-highlighter/dist/esm/styles/hljs';

import seq2seq from '../../media/Seq2Seq/seq2seq_arch.png';
import transformer from '../../media/Seq2Seq/transformer_arch.png';


function Seq2Seq() {
    return (
        <div className="subsubsection-container">
            <Header />
            <div class="side-nav-container">
                <aside className="subsubsection-side-nav">
                    <a href="#seq2seq">Seq2Seq</a>
                    <a href="#attention">Attention Mechanisms</a>
                    <a href="#sparse">Transformers</a>
                    <a href="#advattention">Types</a>
                </aside>
            </div>
            
            <main className="subsubsection-content">
                <div className="titles"><h1>Transformers & Attention Mechanisms</h1></div>

                <section id="seq2seq" className="code-cleaned">
                <h2>Seq2Seq Fundamentals</h2>
                <p className="subsubsection-paragraph">
                    Sequence-to-Sequence (Seq2Seq) models are a cornerstone in the field of natural language processing, particularly for tasks that involve generating sequences from other sequences,
                     like machine translation, text summarization, and question answering.
                </p>

                <p className="subsubsection-paragraph">
                <table style={{ width: '100%', borderCollapse: 'collapse', margin: '10px 0' }}>
                    <tbody>
                        <tr>
                            <td style={{ padding: '8px', border: '1px solid #ddd' }}>Use Cases</td>
                            <td style={{ padding: '8px', border: '1px solid #ddd' }}>
                                <span style={{ color: '#333399' }}>Machine translation</span>,
                                <span style={{ color: '#008000' }}> Text summarization</span>,
                                <span style={{ color: '#ff4500' }}> Question answering</span>,
                                <span style={{ color: '#1e90ff' }}> Chatbots</span>
                            </td>
                        </tr>
                        <tr>
                            <td style={{ padding: '8px', border: '1px solid #ddd' }}>Python Libraries</td>
                            <td style={{ padding: '8px', border: '1px solid #ddd' }}>
                                <span style={{ color: '#6a5acd' }}>TensorFlow (tf.keras.models.Sequential for building simple Seq2Seq models)</span>,
                                <span style={{ color: '#20b2aa' }}> PyTorch (torch.nn for custom Seq2Seq architectures)</span>
                            </td>
                        </tr>
                        <tr>
                            <td style={{ padding: '8px', border: '1px solid #ddd' }}>O-Complexity (Worst Case)</td>
                            <td style={{ padding: '8px', border: '1px solid #ddd' }}>
                                Depends on the specific architecture (e.g., LSTM, GRU) used for the encoder and decoder; typically <span>O(t*n^2)</span> for each component, where <i>t</i> is the sequence length and <i>n</i> is the number of hidden units
                            </td>
                        </tr>
                        <tr>
                            <td style={{ padding: '8px', border: '1px solid #ddd' }}>Relevant Papers</td>
                            <td style={{ padding: '8px', border: '1px solid #ddd' }}>
                                <span>"Sequence to Sequence Learning with Neural Networks"</span> by Sutskever, Vinyals, and Le, 2014; foundational paper introducing the Seq2Seq framework using deep neural networks
                            </td>
                        </tr>
                    </tbody>
                </table>
            </p>



                <h2>Seq2Seq Foundations</h2>
                <p className="subsubsection-paragraph">
                    The Seq2Seq model comprises two key components: an encoder and a decoder, both of which are typically recurrent neural networks. The encoder processes the input
                     sequence and compresses the information into a context vector, a fixed-size representation that captures the essence of the input. The decoder then uses this context 
                     vector to generate the output sequence. The encoder-decoder architecture is powerful because it allows for variable-length input and output sequences, a common requirement 
                     in language tasks.
                </p>

                <p className="subsubsection-paragraph">
                    Let's work through an example to explain clearly what exactly happens in the encoder-decoder infrastructure. Imagine we have a phrase, something like "hockey is awesome" and we 
                    want to translate this into Japanese -- this is the kind of task (machine translation) that this kind of framework would be helpful for. Here are the steps:

                    <ol>

                        <li><strong>Input Processing: </strong>The phrase "Hockey is awesome" is input into the encoder. If the encoder is an LSTM, the phrase is processed one word (or token) at a time, 
                        and each word is typically represented as a vector (using techniques like one hot encoding).</li>
                        <li><strong>Initial Encoder State: </strong>Before training, the encoder's weights and biases are initialized (often randomly or using some predefined strategy). The hidden 
                        states are also typically initialized to zeros at the start of processing a new sequence. This step is part of the model setup before any actual training or inference occurs.</li>
                        <li><strong>Context Vector Generation: </strong>The encoder processes the entire input sequence and generates a context vector. In a simple model, this might be the final hidden 
                        state of the encoder. For example, let's just say that the hidden state vector was of size 3 and that there was 1 hidden layer in the LSTM; then, the final hidden state will be a 
                        vector of size 3.</li>
                        <li><strong>Decoder Initialization: </strong>The decoder's first hidden state is initialized with the encoder's context vector. This step bridges the encoder and decoder, providing 
                        the decoder with information about the entire input sequence.</li>
                        <li><strong>Decoding Process Start: </strong>The decoding process begins, often with a special start-of-sequence token like SOS (Start of Sequence)
                             which signifies that we are now going to try to predict the first word of the translated sentence. The decoder then generates the first word of the output sequence
                             by using the intialized states from the decoder. The LSTM in the decoder has a final output which... </li>
                        <li><strong>Decoder's Prediction: </strong>The output from the decoder's LSTM layer is passed through a fully connected layer (or layers), which projects the LSTM output to
                         the size of the output vocabulary. A softmax function is then applied to this projection to generate a probability distribution over all possible output words. The word 
                         with the highest probability is typically chosen as the prediction for the current time step.</li>
                        <li><strong>Teacher Forcing during Training: </strong> In the next time step (<InlineMath math="t = 2" />) during training, the actual correct first word from the target sequence is often fed into the
                         decoder, regardless of the decoder's previous output. This technique is known as "teacher forcing" and is used to speed up training and improve model stability.</li>
                        <li><strong>Sequence Generation: </strong>This process continues, with the decoder generating one word at a time, until an EOS (end of sentence) token is generated or a maximum sequence length is
                             reached. During inference (actual use of the model), the decoder's previous output is used as the next input, without teacher forcing.</li>
                        <li><strong>Training and Backpropagation: </strong>Once the entire output sequence is generated, the model's predictions are compared to the actual target sequence to compute a 
                        loss (e.g., cross-entropy loss). This loss is then backpropagated through the entire model (decoder and encoder), and the weights and biases are updated accordingly via an 
                        optimization algorithm (e.g., SGD, Adam).</li>
                        <li><strong>Iterative Training: </strong>The above process is repeated across multiple epochs (full passes through the training dataset) until the model's performance meets the 
                        desired criteria or no further improvement is observed.</li>

                    </ol>

                    This diagram illustrates the entire process: 

                    

                    <figure className="flex-container-caption">
                        <div className="flex-container"><img src={seq2seq} alt="Broken" className="image-medium"/></div>
                        <figcaption>The inputs come in sequentially as per the usual RNN case. The final hidden state vector is then output from the encoder and used to initialize the decoder; <a href="https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346" target="_blank" rel="noopener noreferrer">image source</a>.</figcaption>
                        </figure>

                    <ul>
                    <li><strong>More on EOS: </strong> This is essentially a way for us to tell the model that the sentence has ended. For example, if our input phrase was "hockey is awesome", then 
                    the time steps would be something like :
                    <BlockMath math = "t = 1: Hockey" />
                    <BlockMath math = "t = 2: is" /> 
                    <BlockMath math = "t = 3: awesome" />
                    <BlockMath math = "t = 4: EOS" />



                    </li>
                        <li><strong>On Multiple Layers: </strong>As you can imagine, we can blow up the complexity of these models. Let's just consider multiple layers first. In this case, 
                        we could have context vectors for each hidden layer of the encoder which would initialize the corresponding hidden layers within the decoder. You have extreme flexibility in 
                        this situation as you can decide where you want these initializations to occur. There is also not a hard rule that the number of layers must be equal between the encoder 
                        and decoder.</li>
                        <li><strong>On Multiple Neurons Per Layer: </strong>When we have multiple neurons per layer in the encoder, we often concactenate them at the end to form the context vector. 
                        Then, we can initialize however many neurons we want in the decoder with the hidden state having a dimensionality of 6. If we wanted to have a different size hidden state, 
                        we could connect the context vector of the encoder to a simple feed forward neural network that has an output vector equal to the desire dimensionality. </li>
                    </ul>

                </p>

                <h4>Hyperparameters</h4>
                <p className="subsubsection-paragraph">
                <ul>
                    <li>
                    <strong>Number of Layers in Encoder/Decoder:</strong> Determines the depth of the model. More layers can capture more complex patterns but also increase computational complexity.
                    </li>
                    
                    <li>
                    <strong>Number of Neurons per Layer:</strong> Influences the capacity of each layer to process and represent information. More neurons allow for more detailed representations.
                    </li>
                    
                    <li>
                    <strong>Size of Hidden States:</strong> Dictates the dimensionality of the hidden states within LSTM/GRU cells, represented as <InlineMath math="D_h" />. Larger hidden state sizes can 
                    enhance the model's memory but also increase computational requirements.
                    </li>
                    
                    <li>
                    <strong>Input/Output Vocabulary Size:</strong> The size of the vocabulary for the input and output sequences, crucial for defining the model's input and output layers.
                     Represented as <InlineMath math="V_{\text{input}}" /> and <InlineMath math="V_{\text{output}}" />, respectively.
                    </li>
                    
                    <li>
                    <strong>Embedding Dimension:</strong> Size of the word embeddings used to convert tokens to vectors, affecting the representation of words.
                     Denoted by <InlineMath math="D_{\text{embedding}}" />.
                    </li>
                    
                    <li>
                    <strong>Type of RNN Cell:</strong> Options include LSTM, GRU, or basic RNN. Each type has different characteristics that affect memory usage and computational load.
                    </li>
                    
                    <li>
                    <strong>Encoder/Decoder Initialization Method:</strong> The strategy for initializing the weights of the encoder and decoder, which can impact model convergence and performance.
                    </li>
                    
                    <li>
                    <strong>Use of Bidirectional Encoder:</strong> A boolean indicating whether the encoder processes input sequences in both forward and backward directions, enhancing the contextual 
                    understanding of the input sequence.
                    </li>
                    
                    <li>
                    <strong>Dropout Rate:</strong> Applied within RNN layers or between layers to prevent overfitting. Denoted by <InlineMath math="P_{\text{dropout}}" />, it randomly omits a 
                    subset of features during training.
                    </li>
                </ul>
                </p>

                <h4>In Code</h4>
                <p className="subsubsection-paragraph">
                    Implementing a Seq2Seq model in Python can be achieved using deep learning libraries such as TensorFlow. Here's an example that outlines the structure of a basic Seq2Seq 
                    model for a simple translation task:
                    <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
            {`# To get data
#!wget http://www.manythings.org/anki/fra-eng.zip
#!unzip fra-eng.zip

# Libraries
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

# Load the dataset
with open('fra.txt', 'r', encoding='utf-8') as f:
    lines = f.read().split("\n")

# Extract sentence pairs
input_texts = []
target_texts = []

# Use "\\t" as the start sequence character and "\\n" as the end sequence character
for line in lines[:10000]:  # Using the first 10,000 sentence pairs for simplicity
    input_text, target_text, _ = line.split("\\t")
    target_text = "\\t" + target_text + "\\n"  
    input_texts.append(input_text)
    target_texts.append(target_text)

# Tokenization
tokenizer_in = Tokenizer(char_level=True)
tokenizer_in.fit_on_texts(input_texts)
encoder_input_data = tokenizer_in.texts_to_sequences(input_texts)
encoder_input_data = pad_sequences(encoder_input_data, padding='post')

tokenizer_out = Tokenizer(char_level=True)
tokenizer_out.fit_on_texts(target_texts)
decoder_input_data = tokenizer_out.texts_to_sequences(target_texts)
decoder_input_data = pad_sequences(decoder_input_data, padding='post')
decoder_target_data = np.roll(decoder_input_data, -1, axis=1)  # Shift decoder input for target data

# Convert to one-hot encoding
encoder_input_data = to_categorical(encoder_input_data)
decoder_input_data = to_categorical(decoder_input_data)
decoder_target_data = to_categorical(decoder_target_data)

# Set model parameters
num_encoder_tokens = encoder_input_data.shape[2]
num_decoder_tokens = decoder_input_data.shape[2]
max_encoder_seq_length = encoder_input_data.shape[1]
max_decoder_seq_length = decoder_input_data.shape[1]

# Split data
encoder_input_train, encoder_input_val, decoder_input_train, decoder_input_val, decoder_target_train, decoder_target_val = train_test_split(encoder_input_data, decoder_input_data, decoder_target_data, test_size=0.2)

# Define the encoder
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(256, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]

# Define the decoder
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(256, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the Seq2Seq model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Training
model.fit([encoder_input_train, decoder_input_train], decoder_target_train,
          batch_size=64,
          epochs=30,  # Increase epochs for better results
          validation_data=([encoder_input_val, decoder_input_val], decoder_target_val))

# Define the encoder model for inference
encoder_model = Model(encoder_inputs, encoder_states)

# Define the decoder model for inference
decoder_state_input_h = Input(shape=(256,))
decoder_state_input_c = Input(shape=(256,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

# Create a function to decode sequences
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, tokenizer_out.word_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = tokenizer_out.index_word.get(sampled_token_index, '')

        if sampled_char == '\\n' or len(decoded_sentence) > max_decoder_seq_length:
            stop_condition = True
        else:
            decoded_sentence += sampled_char

            target_seq = np.zeros((1, 1, num_decoder_tokens))
            target_seq[0, 0, sampled_token_index] = 1.

            states_value = [h, c]

    return decoded_sentence

# Predict
for seq_index in range(10):
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)
`}
                        </SyntaxHighlighter>
                    </p>
                </section>

                
                <section id="attention" className="code-cleaned">
                <h2>Attention Mechanisms</h2>
                <p className="subsubsection-paragraph">
                    Attention mechanisms have evolved NLP, offering a more efficient and effective way for models to process and relate different 
                    parts of a sequence. Originally introduced in the context of neural machine translation, attention mechanisms are now ubiquitous in various sequence modeling tasks.
                </p>

                <p className="subsubsection-paragraph">
                    <table style={{ width: '100%', borderCollapse: 'collapse', margin: '10px 0' }}>
                        <tbody>
                            <tr>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>Use Cases</td>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>
                                    <span style={{ color: '#333399' }}>Enhancing Seq2Seq models</span>,
                                    <span style={{ color: '#008000' }}> Machine translation</span>,
                                    <span style={{ color: '#ff4500' }}> Document summarization</span>,
                                    <span style={{ color: '#1e90ff' }}> Speech recognition</span>
                                </td>
                            </tr>
                            <tr>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>Python Libraries</td>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>
                                    <span style={{ color: '#6a5acd' }}>TensorFlow (tf.keras.layers.Attention)</span>,
                                    <span style={{ color: '#20b2aa' }}> PyTorch (torch.nn.MultiheadAttention for multi-head attention)</span>
                                </td>
                            </tr>
                            <tr>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>O-Complexity (Worst Case)</td>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>
                                    Typically <span>O(n^2*d)</span>, where <i>n</i> is the sequence length and <i>d</i> is the dimensionality of the model; complexity can vary based on the type of attention and the implementation
                                </td>
                            </tr>
                            <tr>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>Relevant Papers</td>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>
                                    <span>"Neural Machine Translation by Jointly Learning to Align and Translate"</span> by Bahdanau, Cho, and Bengio, 2014; introduced the attention mechanism in the context of neural machine translation
                                </td>
                            </tr>
                        </tbody>
                    </table>
                </p>


                <h4>Attention Mechanics in Encoder-Decoder Networks</h4>
                <p className="subsubsection-paragraph">
                One issue with encoder-decoder models is that they are at the mercy of any shortcomings of the model components used in the architecture. For example, if we used LSTMs,
                 even though they are better than vanilla RNNs, they can still have trouble remembering the relevance or importance of some words for longer sequences. We would ideally like to add in
                  some other mechanism of telling our function (the decoder) on what words it should focus as it generates outputs. In the vanilla encoder-decoder architecture, we squeeze all of the 
                  information contained in the input vector into the final hidden state (which is used to initialize the decoder) but, as you can imagine, this can be lossy.
                </p>

                <p className="subsubsection-paragraph">
                Instead, we should look to what other information is available in our encoder that we could use to help us guide the decoder. This can be found within the hidden states that exist 
                at each specific timestep of encoder. For example, if we had the phrase "hockey is awesome", we would have a hidden state associated with each of the 4 steps in the sentence's 
                continuum. This is what we will use to weight the decoder's hidden states at each of its time steps since each time step of the decoder will also be associated with a particular
                 word (from the target dictionary). Let's work out an example for "Hockey is awesome":

                <ol>
                    
                    <li>We run through the encoder generating hidden states for each of "hockey", "is", "awesome", EOS and get a final hidden state.</li>
                    <li>We initialize the decoder's hidden state with this final output. </li>
                    <li>We then use the initialized state of the decoder as state representing SOS for the decoder and compare that state with all the intermediary hidden states in the encoder.
                    For example, the initialized hidden state might be [2, 2] for the decoder and say the hidden states for the words above are hockey = [5, 5], is = [1, 1], awesome = [1, 2], 
                    EOS = [2, 2]... we can do some kind of comparison between each of the above with the hidden initialized state for the decoder.
                    </li>
                    <li>Here, there are a number of ways to go but for simplicity, we can use the dot product (other choice include cosine similarity, some neural network, etc.)
                    <BlockMath math="\text{Hockey} = 2 \times 5 + 2 \times 5 = 20" />  
                    <BlockMath math="\text{is} = 2 \times 1 + 2 \times 1 = 4" />  
                    <BlockMath math="\text{awesome} = 2 \times 1 + 2 \times 2 = 6" />  
                    <BlockMath math="\text{EOS} = 2 \times 2 + 2 \times 2 = 8" />  
                    </li>
                    <li>We put this through a softmax function to get proportions of weights, say it's...

                    <BlockMath math="\text{Hockey} = 0.5" />  
                    <BlockMath math="\text{is} = 0.1" />  
                    <BlockMath math="\text{awesome} = 0.15" />  
                    <BlockMath math="\text{EOS} = 0.25" />  

                    These aren't the actual values, I am too lazy to do that calculation.

                    </li>
                    <li>We then get a new weighted vector which we will be used as a part of the input vector into the LSTM for SOS; the calculation for the vector would be:
                        <BlockMath math ="0.5\times[5, 5] + 0.1\times[1, 1] + 0.15\times[1, 2] + 0.25\times[2, 2] = [3.25, 3.4]" />
                        and the resulting input vector becomes:
                        <BlockMath math ="\text{Input}_{t = 1} = \{\text{<SOS> Embedding Vector}; 3.25, 3.4\}" />
                    </li>
                    <li>Now, we pass on this input vector as the input into the forward pass in the decoder and get an output!</li>
                    <li>This process is repeated where the attention vector is concactenated onto the input vector after doing the attention calculation</li>

                </ol>

                </p>

                <p className="subsubsection-paragraph">
                    Various types of attention have been proposed, each with unique characteristics. Key types include:
                    <ul>
                        <li><strong>Additive or Bahdanau Attention:</strong> It uses a feed-forward network to compute the alignment score.</li>
                        <li><strong>Multiplicative or Luong Attention:</strong> It simplifies the scoring function to a dot product between states (what I used in the example above).</li>
                        <li><strong>Self-Attention:</strong> It allows inputs to interact with each other ('self') and is a key component in Transformer models -- more on this soon.</li>
                    </ul>
                    Each type of attention provides different computational advantages and model characteristics, influencing how the model captures dependencies in the data. Now, this is actually a simpler 
                    form of attention and a more general form will be discussed in the next section as we head into transformers but hopefully, this makes sense!
                </p>

                <h4>Hyperparameters</h4>
                <p className="subsubsection-paragraph">
                <ul>
                    <li>
                    <strong>Attention Score Function:</strong> The function used to compute the alignment scores between encoder and decoder states. Common choices include "dot product", "scaled dot product", 
                    and "additive" or "concatenation" based.
                    </li>
                    <li>
                    <strong>Attention Size:</strong> For additive attention, the size of the hidden layer used to compute the attention scores can be a tunable hyperparameter, I guess.
                    </li>
                    </ul>
                </p>

                <h4>In Code</h4>
                <p className="subsubsection-paragraph">
                    Here's a Python example using TensorFlow to implement an attention mechanism in a neural machine translation task:
                    <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
            {`import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense, Concatenate, Attention
from tensorflow.keras.models import Model

# Sample encoder-decoder model with attention
input_seq = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(input_seq)
encoder_states = [state_h, state_c]

decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
attention = Attention()
attention_out = attention([decoder_outputs, encoder_outputs])
decoder_concat = Concatenate(axis=-1)([decoder_outputs, attention_out])
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_concat)

model = Model([input_seq, decoder_inputs], decoder_outputs)

# Model compilation, training, and evaluation code would follow`}
                        </SyntaxHighlighter>
                    </p>
                </section>



                <section id="sparse" className="code-cleaned">
                <h2>Transformers</h2>
                <p className="subsubsection-paragraph">
                    Transformers represent a massive advancement in neural network architectures; they significantly improved performance through the use of attention blocks 
                    in conjunction with vanilla feed-forward neural networks and by-passed the problems that come along with RNN units. They are the backbone of modern large 
                    language models and I will go into a lot of detail about them here (and also, more specifically for LLMs in a later section).
                </p>

                <p className="subsubsection-paragraph">
                    <table style={{ width: '100%', borderCollapse: 'collapse', margin: '10px 0' }}>
                        <tbody>
                            <tr>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>Use Cases</td>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>
                                    <span style={{ color: '#333399' }}>Language understanding</span>,
                                    <span style={{ color: '#008000' }}> Text generation</span>,
                                    <span style={{ color: '#ff4500' }}> Machine translation</span>,
                                    <span style={{ color: '#1e90ff' }}> Text summarization</span>
                                </td>
                            </tr>
                            <tr>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>Python Libraries</td>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>
                                    <span style={{ color: '#6a5acd' }}>Transformers (Hugging Face's library)</span>,
                                    <span style={{ color: '#20b2aa' }}> TensorFlow (tf.keras.layers.Transformer)</span>,
                                    <span style={{ color: '#ff6347' }}> PyTorch (torch.nn.Transformer)</span>
                                </td>
                            </tr>
                            <tr>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>O-Complexity (Worst Case)</td>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>
                                    Typically <span>O(n^2*d)</span>, where <i>n</i> is the sequence length and <i>d</i> is the dimensionality of the model; due to the self-attention mechanism
                                </td>
                            </tr>
                            <tr>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>Relevant Papers</td>
                                <td style={{ padding: '8px', border: '1px solid #ddd' }}>
                                    <span>"Attention is All You Need"</span> by Vaswani et al., 2017; introduced the Transformer model, which has since become foundational in many state-of-the-art NLP systems
                                </td>
                            </tr>
                        </tbody>
                    </table>
                </p>


                <h4>Revisiting Attention</h4>
                <p className="subsubsection-paragraph">
                    Previously, we discussed attention in the context of encoder-decoder architectures with LSTM units however, we didn't discuss the essence of what attention is doing. Before 
                    jumping into transformers, I want to spend some time discussing attention since they are an extremely important part of the transformer architecture
                </p>

                <p className="subsubsection-paragraph">
                The general idea with attention is that when we see a word in a sentence that is ambiguous by itself (for example, in the sentence "that slapshot was sick", if we were to consider 
                the word "sick", it wouldn't be clear whether we were talking about someone being sick or if were talking about like something was awesome), we can look at the
                 rest of the sentence to try and figure out what the meaning of sick should be. The way we do this is through linear transformations on the embeddings of the words we are comparing. 
                </p>

                <p className="subsubsection-paragraph">
                Let's work with a simple example to try and shed light on this part thus far. In the sentence above, we can have the following embeddings:
                <BlockMath math="\text{that - }[1, 2, 3]" />
                <BlockMath math="\text{slapshot - }[5, 5, 4]" />
                <BlockMath math="\text{was - }[1, 1, 1]" />
                <BlockMath math="\text{sick - }[4, 5, 4]" />
                    
                    and say that we had an embedding for coughing which was <InlineMath math="[4, 4, 4]" />. Now, say we want to find out the similarity between slapshot and sick; we could use
                     something like the dot product to determine this. This will provide us some measure of relationship between the words so in the case of sick and slapshot, the value would 
                     be: 4 x 5 + 5 x 5 + 4 x 4 = 61 which is a high number indicating that  it's similar and between slapshot and coughing, the similarity score would be: 4 x 4 + 5 x 4 + 4 x 4 = 52 which 
                     is also a very high number indicating some relationship between them as well! We could build this similarity matrix for every word in the dictionary; as an aside, this matrix will 
                     have a trace equal to the number of words in the dictionary when the similarity function is cosine similarity. Now, we can calculate a linear combination within sentences of every 
                     other word that appears in that sentence. For example, and I am just making up numbers here, but for the word sick, the linear combination could be something like:

                     <BlockMath math="5\times\text{that} + 61\times\text{slapshot} + 8\times\text{was} + 40\times\text{sick}" />

                     Essentially, we made each word a combination of every other word in the sentence; similarly, if we had sentence like “she’s coughing and is sick”, we could get a combination like:

                     <BlockMath math = "4\times\text{she’s} + 52\times\text{coughing} + 2\times\text{and} + 3\times\text{is} + 40\times\text{sick}" />

                     Next, we need to normalize since we have massive coefficients; we can normalize through the softmax function so that the updated weights may look something like: 

                     <BlockMath math = "\text{sentence1}_{sick} = 0.1\times\text{that} + 0.5\times\text{slapshot} + 0.15\times\text{was} + 0.25\times\text{sick}" />
                     <BlockMath math = "\text{sentence2}_{sick} = 0.1\times\text{she’s} + 0.4\times\text{coughing} + 0.1\times\text{and} + 0.1\times\text{is} + 0.3\times\text{sick}" />

                     Of course, they add to 1; now, we can look at the coefficients as proportions. For example, in the first sentence, we can think of that 0.5 as a 50% movement in the direction
                      of slapshot for the word sick when it comes to the word embeddings. This will give us new co-ordinates for the word sick which is a function of all the other words in the sentence. 
                      We will use these co-ordinates when we are working within those sentences. 

                    </p>


                    <p className="subsubsection-paragraph">
                    Next, we need to introduce the concept of keys (<InlineMath math="K" />), queries (<InlineMath math="Q" />), and values (<InlineMath math="V" />). These are matrices. Let’s focus first on
                     keys and queries. One issue we might have with just applying the weights
                     above directly is that if sick is much closer to coughing, then even if we shift sick closer to slapshot, it might still be too close to coughing for it to matter. What we want
                      to do is shift the embeddings in such a way that the separation can occur and we can have more of a guarantee that the contextual information within a particular token sequence
                       (or sentence) will overwrite this issue. We can do this by using keys and queries. These matrices will modify the embedding of each of the two words we are comparing such that 
                       they are better representative of the current context. We then figure out the similarity above in the transformed embeddings because they’ll provide us with a better separation 
                       if needed. 
                    </p>

                    <p className="subsubsection-paragraph">
                    Now, we can bring in the values matrix. The above transformed embeddings are only useful for similarities as that transformation may not preserve other structural relationships 
                    between the words and is better for looking at similar features. We want another set of embeddings that are better for finding the next word in a sequence and aren’t concerned 
                    with capturing features. Essentially, we have a context embedding and a feature embedding; we use the latter to figure out the updated weights and the former to continue on in 
                    the model path. We just multiply the new embeddings by the value matrix to get the context embeddings. I will show the exact calculations later when we work through an example. 
                    </p>

                    <p className="subsubsection-paragraph">
                    We don’t limit ourselves to just 1 set of Q, K, and V however, we use many and it’ll be the job of the transformer later to optimize this process to find the best one for 
                    the task of predicting the next word in a sentence (you’ll learn more soon). All of the values in Q, K, and V are going to be learned by the transformer through the training process
                     but you can think of these “many” matrices as almost like separate neurons (it's not really like this but helps to visualize it a bit). This is called multi-head attention. Note, there
                      needs to be the same number of Q, K, and V matrices.
                        </p>

                        <p className="subsubsection-paragraph">
                        So, to summarize: 

                        <ol>
                            <li>We want to be able to have words in a sentences take into account other words within the sentence as contextual information</li>
                            <li>Every word has embeddings and we can shift these embeddings based on the information we can pull from within any particular document (such as a sentence)</li>
                            <li>One way to figure this out is to figure out some kind of similarity score between all of the words within a sentence (including with itself: self-attention)</li>
                            <li>We can figure out these scores by apply some similarity function (e.g. dot product, cosine similarity, etc.)</li>
                            <li>While we can apply this function directly, we first want to figure out another space in which the similarity function will provide us with a better
                                separation from other words in the dictionary. This is done via key and query matrices which transform the word in question and the words it is being compared to 
                                (within the sentence)
                            </li>
                            <li>We then calculate the attention scores through the similarity/softmax approach on the transformed embeddings and apply these scores to the token (word) we are considering.</li>
                            <li>Even though we have these transformed embeddings now, they might not be in the best space for the task at hand (e.g. machine translation) so we want to once again
                                transform the key-query transformed embeddings back into another space. This is done through the values matrix. 
                            </li>
                            <li>In the end, we have attention weighted embeddings.</li>
                        </ol>

                        As a last note, in the original transformer paper, they used something called scaled dot product: Scaled Dot-Product Attention is a fundamental component in the Transformer model.
                         It addresses the issue of the dot-product growing large in 
                    magnitude by scaling it down, leading to more stable gradients. This type of attention calculates the dot product of the query with all keys, scales it by the dimension 
                    of the keys, and applies a softmax to obtain the weights on the values:
                    <BlockMath math="Attention(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V" />
                    where <InlineMath math="Q" />, <InlineMath math="K" />, and <InlineMath math="V" /> are the query, key, and value matrices, respectively, and <InlineMath math="d_k" /> is 
                    the dimension of the keys. This scaling factor counteracts the effect of large dot products swamping the softmax function.
        
                    </p>




                <h4>Transformers Foundations</h4>
                <p className="subsubsection-paragraph">
                    We will now talk about transformers and what I will do is just walk through an entire first pass (and then some information on a second pass) of the example we been working with in the 
                    previous sub-section. 

                    

                    <figure className="flex-container-caption">
                        <div className="flex-container"><img src={transformer} alt="Broken" className="image-medium"/></div>
                        <figcaption>The original architecture proposed in "Attention is all you need"; <a href="https://arxiv.org/pdf/1706.03762.pdf" target="_blank" rel="noopener noreferrer">image source</a>.</figcaption>
                        </figure>

                    <ol>

                        <li><strong>Tokenization: </strong> The input sentence is "that slapshot was sick" so once we tokenize, it would look something like: ["that", "slapshot", "was", "sick"] <br/></li>

                        <li><strong>Embeddings: </strong>Assume we have a simple embedding table where each word is mapped to a 3-dimensional vector; in other cases, we could either get embeddings 
                        from a pre-trained model (such as a word2vec model) or include embeddings as a part of the transformers training process. In this example, let's go with the following: 
                            <BlockMath math ="\text{that:} [1, 2, 3]" />
                            <BlockMath math ="\text{slapshot:} [4, 5, 6]" />
                            <BlockMath math ="\text{was:}  [7, 8, 9]" />
                            <BlockMath math ="\text{sick:} [2, 4, 6]" />

                        </li>

                        <li><strong>Positional Encoding: </strong>Positional encodings are added to give the model information about the position of each word. For 
                        simplicity, let's say we add a small vector that increases with the position:
                            <ul>
                                <li>Position 1 (that): [+0, +0, +0]</li>
                                <li>Position 2 (slapshot): [+1, +1, +1]</li>
                                <li>Position 3 (was): [+2, +2, +2]</li>
                                <li>Position 4 (sick): [+3, +3, +3]</li>
                            </ul>

                            Positional Encoding is used in models like Transformers to give the model information about the relative or absolute position of the tokens in the sequence. Since 
                            the self-attention mechanism does not inherently capture sequence order, positional encodings are added to the input embeddings at the bottom of the encoder and 
                            decoder stacks. These encodings can be either learned or fixed, with one popular choice being sinusoidal functions:
                                <BlockMath math="PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{\text{model}}})" />
                                <BlockMath math="PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{\text{model}}})" />
                                where <InlineMath math="pos" /> is the position, <InlineMath math="i" /> is the dimension, and <InlineMath math="d_{\text{model}}" /> is the dimension of the model. 
                                This allows the model to use the relative position of tokens when processing a sequence. The embeddings, using the above encodings, will now look like:

                            <ul>
                                <li>that: [1+0, 2+0, 3+0] = [1, 2, 3]</li>
                                <li>slapshot: [4+1, 5+1, 6+1] = [5, 6, 7]</li>
                                <li>was: [7+2, 8+2, 9+2] = [9, 10, 11]</li>
                                <li>sick: [2+3, 4+3, 6+3] = [5, 7, 9]</li>
                            </ul>

                            <br/>

                        </li>

                        <li><strong>Self-Attention in the Encoder: </strong>For simplicity, let's compute attention from the perspective of just the word "sick". 
                        Compute the dot product of the "query" for "sick" with the "key" of every word (including itself) to determine the attention scores. In a 
                        simplified model, the "query" (say Q1) and "key" (say K1) might just be the embeddings themselves (i.e. the identity transformation). Here's an example of those calculations:

                        <BlockMath math = "\text{sick}_{that} = 5 \times 1 + 7 \times 2 + 9 \times 3 = 50" />
                        <BlockMath math = "\text{sick}_{slapshot} = 5 \times 5 + 7 \times 6 + 9 \times7 = 158" />
                        <BlockMath math = "\text{sick}_{was} = 5 \times 9 + 7 \times 10 + 9 \times 11 = 236" />
                        <BlockMath math = "\text{sick}_{sick} = 5 \times 5 + 7 \times 7 + 9 \times 9 = 155" />
                        
                        
                        If the query and keys were different, then the above embeddings would be shifted before, if Let's assume after softmax,
                         the weights (just as an example) for "that", "slapshot", "was", and "sick" relative to "sick" are 0.1, 0.2, 0.6, and 0.1, respectively. The new 
                         embedding for "sick" is a weighted sum of all embeddings, based on these attention scores:
                         
                         <BlockMath math="\text{sick}_{new} = 0.1 * [1, 2, 3] + 0.2 * [5, 6, 7] + 0.6 * [9, 10, 11] + 0.1 * [5, 7, 9] = [6.6, 7.9, 9.2]" />
                         
                         Now, notice that these embeddings are the same as the positionally encoded ones however, they are actually multiplied by the value matrix V1 mentioned 
                         earlier! In this case, that transformation was the identity transformation for simplicity however, you could do any kind of linear transformation, even 
                         changing the dimensionality of the embeddings. 

                         To summarize:

                         <ol>
                            <li><strong>Transform "sick" with Q1: </strong>Multiply the embeddings of "sick" by the Q1 matrix to get the query vector for "sick". This is Q1_sick.</li>
                            <li><strong>Transform all words with K1: </strong>Multiply the embeddings of each word in the sentence ("that", "slapshot", "was", "sick") by the K1 matrix 
                            to get their key vectors. This gives you K1_that, K1_slapshot, K1_was, and K1_sick.</li>
                            <li><strong>Calculate Attention Scores: </strong>For "sick", compute the dot product of its query vector (Q1_sick) with the key vectors of all words (K1_that, K1_slapshot, K1_was, K1_sick).
                             This results in four scores. Apply the softmax function to these scores to get the attention scores for "sick". These scores will sum up to 1 and represent the amount of "attention" "sick" should 
                             pay to each word in the sentence, including itself.</li>
                            <li><strong>Transform all words with V1: </strong>Multiply the embeddings of each word by the V1 matrix to get their value vectors. This gives you V1_that, V1_slapshot, V1_was, and V1_sick.
                             Note that these value vectors can indeed be of a different dimension than the original word embeddings.</li>
                            <li><strong>Compute Weighted Sum of Value Vectors: </strong>Use the attention scores from step 3 to weight the value vectors obtained in step 4. For "sick", this involves multiplying each valu
                             vector (V1_that, V1_slapshot, V1_was, V1_sick) by the corresponding attention score and summing these products to get a single vector. This resulting vector is a new representation of "sick" that
                              incorporates information from all the words in the sentence, weighted by their relevance to "sick" as determined by the attention scores.</li>
                         </ol>
                         </li>

                         <b>MultiHead Attention:</b> We repeat this process for every set of key, query, and value matrices to get a set of embeddings equal to the number of 
                         these sets of matrices i.e. we calculate all of the attention scores again and apply them to the value vectors (the value matrix transformed embeddings). These are then all 
                         concatenated. So, let's shift away from the previous example a bit and imagine that we had two sets of keys, values, and 
                         queries. Each will output, similar to the above process, a final embedding that represents "sick" as a function of its contextual information. Let's say for 
                         the first set, the embedding was: [6.0, 8.8] and let's say for the second set, it was: [5.8, 4.4]; we would just concatenate these at the end of the attention step into: 
                         [6.0, 8.8, 5.8, 4.4] giving a single representation for the word "sick". 

                         <li><strong>Linear Step:</strong> Let's continue with the "sick" example, where the concatenated output from the attention mechanism for "sick" was a 4-dimensional vector, 
                         [6.0, 8.8, 5.8, 4.4]. We now want to project this back to a 3-dimensional space to match the original embedding size. To achieve this, we use a linear transformation, which is 
                         essentially a matrix multiplication. Let's say our transformation matrix (let's call it W) is a 3x4 matrix, as we're projecting from 4 dimensions back to 3. The values in W are
                          learnable parameters that the model will adjust during training. For illustration, let's make up a simple matrix:
                          
                          <BlockMath math={"W = \\begin{bmatrix} 1 & 0 & -1 & 0 \\\\ 0 & 1 & 0 & -1 \\\\ -1 & 0 & 1 & 0 \\end{bmatrix}"} />

                        To apply the linear transformation, we multiply our 4-dimensional vector by the matrix W. The result will be a 3 dimensional vector given the dimensionality of W and our 
                        embedding vector. So, after the linear transformation, the new 3-dimensional embedding for "sick" is [0.2, 4.4, -0.2]. This vector now serves as the input to the next layer 
                        in the encoder or as part of the final output from the encoder to the decoder in an encoder-decoder architecture, depending on the model's design. This linear step is crucial for 
                        allowing the model to combine information from different attention heads (in the case of multi-head attention) and to adjust the dimensionality of the embeddings as needed for
                         further processing. You will end up with some 3 dimensional embedidng for each of the 4 tokens in this sentence. The embedidng size you want after this transformation should 
                         match the original input embedding as it will be required in the next step.  
                          
                          </li>

                          <li><strong>Residual Connection: </strong>Often, the output from the linear transformation is added to the original input embeddings
                          via a residual connection. This helps with the flow of gradients during training and can improve performance by allowing the model to more easily learn identity functions. 
                          We add the new embedding vector from the linear transformation to the original embedding vector, element-wise. The original embedding for "sick" was [5, 7, 9] 
                          and the new embedding from linear step is [0.2, 4.4, -0.2]. The result of the sum is: : [5.2, 11.4, 8.8].</li>

                          <li><strong>Layer Normalization: </strong>The sum of the linear transformation output and the original input embeddings usually passes through layer normalization, which 
                          helps stabilize the training process. For simplicity, let's say layer normalization scales the values such that the mean of the updated embedding vector is 0 and the standard deviation is 1. The actual layer normalization calculation involves more steps, but for illustration, we might end up with something 
                          like: Normalized embedding for "sick": [-1, 0, 1]  (This is just an illustrative example; actual normalization would depend on the mean and variance of the [5.2, 11.4, 8.8] vector). </li>

                          <li><strong>Feedforward NN: </strong>Each updated embedding vector then goes through a position-wise feedforward neural network. This network is the same for each
                           position but operates independently on each vector. The FFN serves several purposes in the Transformer architecture:
                           
                           <ul>
                            <li><strong>Non-linearity: </strong>The FFN introduces non-linearity into the model, which is essential for the model to learn complex functions and relationships within the data. Without 
                            this non-linearity, the model, regardless of its depth, could be equivalent to a single linear transformation, limiting its expressive power.</li>
                            <li><strong>Increased Model Capacity: </strong>The FFN allows the model to increase its capacity by adding more parameters in the form of weights of the feedforward network. This capacity can
                             help the model to better fit the training data and capture a wider range of linguistic phenomena.</li>
                            <li><strong>Independent Processing: </strong>Each position's (word's) embedding is processed independently by the FFN, allowing for parallel computation. This design aligns with the self-attention
                             mechanism's parallel nature and maintains the Transformer's efficiency.</li>
                           </ul>

                           Continuing with our example, let's say the normalized embedding for "sick" after the residual connection and normalization is [-1, 0, 1]. The FFN might process this as follows:

                           <ul>
                            <li><strong>First Linear Layer: </strong>Expands the dimensionality. For example, it might transform the 3-dimensional vector into a 5-dimensional vector using a learned weight matrix.

Example transformation: [-1, 0, 1] → [2, -1, 3, 0, -2] (using a made-up weight matrix for illustration)</li>
                            <li><strong>Activation Function: </strong>Applies a non-linear activation function like ReLU (Rectified Linear Unit) to each element of the 5-dimensional vector.

Example after ReLU: [2, 0, 3, 0, 0] (ReLU sets negative values to 0)</li>
                            <li><strong>Second Linear Layer: </strong>Projects the dimensionality back to the original embedding size (3-dimensional in this example).

Example back-projection: [2, 0, 3, 0, 0] → [0.5, -1, 1] (using another made-up weight matrix)</li>
                           </ul>

                           The output of the FFN for "sick" would be this final 3-dimensional vector [0.5, -1, 1], which has been transformed by the FFN to capture more complex representations. This output 
                           then typically goes through another residual connection (added to the input of the FFN) and layer normalization before being passed to the next encoder layer or used as the 
                           final encoder output in a single-layer Transformer model. 


                           
                           </li>

                           <li><strong>A Fork: </strong> Now, from here, we could be done if we were talking just about encoder-only architectures. For example, for sentiment analysis, we could pass in 
                           the final output into a softmax layer to get class probabilities however, let's continue on an discuss what would happen in a encoder-decoder architecture like the original paper.
                           
                           Before I continue though, just to not confuse you, the sentiment analysis task is at the sequence level, so we would actually append on another token to our sentence called 
                           a CLS token -- the output associated with this token will be the one used to predict the sentiment. For token level tasks, such as named-entity recognition, we would 
                           just look at the individual outputs associated with a particular token. </li>

                    </ol>

                    <strong>The Decoder: </strong> Let's assume the final output from the encoder's neural network layer for "sick" is a 3-dimensional vector, say [0.5, -1, 1]. This vector is part of the encoder's 
                    final output, which includes similar vectors for each word in the source sentence ("that", "slapshot", "was", "sick"). The decoder's job is to generate the target sequence one token at a time. It 
                    does this by attending over the encoder's output and its own previous outputs. For this example, let's say we are generating the first word of the translation. The decoder starts with a special
                     start-of-sequence token as its initial input. Let's work through an entire example with 2 forward passes from the decoder. Assume that the vectors for the other words in encoder are: 
                     "that" = [1, 2, 1], "slapshot" = [0.5, 1, 2], "was" = [2, 2, 1]. 

                     <ol>

                        <li><strong>Step 1: Input Embedding for [SOS]: </strong> Let's say the embedding for the [SOS] token is [1,0,0]. We would normally add positional encoding here but 
                        for simplicity, let's skip the details of the positional encoding calculation and assume it doesn't alter the embedding significantly. </li>

                        <li><strong>Step 2: Self-Attention for [SOS]: </strong> Since this is the first token, and we're using the embedding as is, we'll have <InlineMath math="Q_{sos} =
                        K_{sos} = V_{sos} = [1, 0, 0]" />. The attention score of [SOS] with itself is calculated by the dot is just 1 and the softmax of 1 is also 1. The context vector then 
                        just ends up being [1, 0, 0]</li>

                        <li><strong>Step 3: Encoder-Decoder Attention for [SOS]: </strong>We use the query from the previous step and the values and keys come from the encoder outputs. We can 
                        calculate attention scores in this step for each key as follows: 
                        
                        <BlockMath math="\text{that: } 1\times 1 + 0\times 2 + 0\times 1 = 1" />
                        <BlockMath math="\text{slapshot: } 1\times 0.5 + 0\times 1 + 0\times 2 = 0.5" />
                        <BlockMath math="\text{was: } 1\times 2 + 0\times 2 + 0\times 1 = 2" />
                        <BlockMath math="\text{sick: } 1\times 0.5 + 0\times -1 + 0\times 1 = 0.5" />

                        Apply softmax to [1,0.5,2,0.5] to get attention weights. For simplicity, let's say this results in [0.2,0.1,0.6,0.1] after softmax. Then, we compute a weighted sum of the value 
                        value vectors using these attention weights: 

                        <BlockMath math="0.2\times [1, 2, 1] + 0.1\times[0.5, 1, 2] + 0.6\times[2, 2, 1] + 0.1\times[0.5, -1, 1] = [1.45, 2.05, 1.15]" />


                        
                        </li>

                        <li><strong>Residual Connection + Normalization: </strong>We apply these steps now. </li>

                        <li><strong>FFN: </strong>We now pass in the vector into a FFN the output of which will once again go through the residual connection and normalization step. </li>

                        <li><strong>Prediction: </strong>This process above can be repeated with as many layers as you like (continuing to use the encoder outputs as the keys and values)
                        until we are ready to make a prediction. A prediction is made using a linear layer put through a softmax function which will give a probability distribution over the 
                        entire dictionary. The highest probability becomes to the prediction and is then added back to the sequence. For example, if the output was "the" then the current 
                        decoder sequence becomes "[SOS] the"</li>

                        <li><strong>Second Pass: </strong>The second pass will be very similar except that we will have a self-attention step in the beginning where the current word in the 
                        sequence will look at all words thus far predicted and apply attention. There will be a set of keys, queries, and values matrices in this step that are distinct 
                        from the keys and queries that come from the encoder-decoder. The context vector from this initial step will then be passed onto the usual encoder based layers 
                        in the decoder (i.e. keys and values from the encoder) and another prediction will be made like above!</li>

                        <li><strong>Parallel Processing and Masking: </strong>One of the big advatanges of transformers is that all of the processing is done in parallel. This is possible 
                        because we know the entire correct sequence attached with the input (for example, we would know the true translation of "that slapshot was sick" before hand. 
                        However, during the training process, we don't want the decoder to use future words (in the sequence) if it's predicting a word earlier in the sequence. Hence, 
                        transformers using "masking" so that all attention score calculations are limited to the current position of the input or before. </li>

                     </ol>
                     
                     Okay, we're nearing the end. Lastly, there is training. We train everything in this network including all of the Q, K, and V matrices, all of the weights 
                     in every feed forward neural network, all of the linear transformation matrices, etc. etc. Everything will be updated. In the case that we don't use embeddings from some other
                     pre-trained models, this will also be updated (they will exist as a part of some look up table that will be indexed by the input token). 
                     
                   
                </p>

                <h4>Hyperparameters</h4>
                <p className="subsubsection-paragraph">
                <ul>
                    <li>
                    <strong>Number of Layers:</strong> Both the encoder and decoder are composed of a stack of identical layers. This hyperparameter controls how many layers are in each stack.
                    </li>
                    <li>
                    <strong>Model Dimension (<InlineMath math="d_{\text{model}}" />):</strong> The size of the input token embeddings as well as the output size of each sub-layer in the model, including the FFN.
                    </li>
                    <li>
                    <strong>FFN Inner-Layer Dimension (<InlineMath math="d_{\text{ff}}" />):</strong> The dimensionality of the inner feedforward network's hidden layers. Typically, <InlineMath math="d_{\text{ff}}" /> is larger than <InlineMath math="d_{\text{model}}" />.
                    </li>
                    <li>
                    <strong>Number of Attention Heads (<InlineMath math="h" />):</strong> The number of heads in the multi-head attention mechanism. Each head attends to different parts of the input sequence.
                    <BlockMath math="h = d_{\text{model}} / d_{\text{k}}" />
                    where <InlineMath math="d_{\text{k}}" /> is the dimension of the key vectors in each head.
                    </li>
                    <li>
                    <strong>Attention Key/Query/Value Dimensions (<InlineMath math="d_{\text{k}}, d_{\text{q}}, d_{\text{v}}" />):</strong> The dimensions of the key, query, and value vectors within the attention mechanism. Typically, <InlineMath math="d_{\text{model}}" /> is divisible by the number of attention heads, and <InlineMath math="d_{\text{k}} = d_{\text{q}}" />.
                    <BlockMath math="d_{\text{k}} = d_{\text{q}} = d_{\text{model}} / h" />
                    </li>
                    <li>
                    <strong>Positional Encoding Size:</strong> Must match the model dimension <InlineMath math="d_{\text{model}}" /> to be added to the embeddings.
                    </li>
                    <li>
                    <strong>Dropout Rate:</strong> The proportion of the output of each sub-layer and embedding that is randomly set to zero during training to prevent overfitting.
                    </li>
                    <li>
                    <strong>Attention Dropout Rate:</strong> Similar to the dropout rate, but specifically for the attention weights to promote more robust learning of attention patterns.
                    </li>
                    <li>
                    <strong>Layer Normalization Epsilon (<InlineMath math="\epsilon" />):</strong> A small constant added to the variance when computing layer normalization, to avoid division by zero.
                    </li>
                </ul>
                </p>

                <h4>In Code</h4>
                <p className="subsubsection-paragraph">
                    Here is an example of Transformers in code; this is when fine-tuning a transformer model (you'll learn more about this later) because creating one from scratch is fairly involved:
                    <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
            {`from datasets import load_dataset
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
import numpy as np

# 1. Load a smaller subset of the IMDB dataset for quick training
dataset = load_dataset('imdb', split={'train': 'train[:10%]', 'test': 'test[:10%]'})

# 2. Load the DistilBert tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Function to tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

# Tokenize the entire dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 3. Load the DistilBert model for sequence classification
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

# 4. Training arguments
training_args = TrainingArguments(
    output_dir='./results',         # output directory for model checkpoints
    num_train_epochs=2,             # number of training epochs for demonstration
    per_device_train_batch_size=8,  # batch size for training
    per_device_eval_batch_size=8,   # batch size for evaluation
    logging_dir='./logs',           # directory for storing logs
    logging_steps=10,
    evaluation_strategy="epoch",    # evaluate at the end of each epoch
)

# Function to compute accuracy
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {'accuracy': accuracy_score(labels, predictions)}

# 5. Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()

# 6. Perform inference on a new sentence
new_sentences = ["This movie is fantastic!", "I did not like this movie at all."]
new_inputs = tokenizer(new_sentences, padding=True, truncation=True, return_tensors="pt")
predictions = model(**new_inputs).logits
predicted_classes = np.argmax(predictions.detach().numpy(), axis=1)
print("Predictions:", predicted_classes)

`}
                        </SyntaxHighlighter>
                    </p>
                </section>


                <section id="advattention" className="code-cleaned">
                    <h2>Types of Transformers</h2>

                    <h4>Sparse Transformers</h4>
                    <p className="subsubsection-paragraph">
                        Sparse Transformers, introduced to address the scalability issue with standard Transformers, employ sparse attention patterns to reduce the computational complexity 
                        from <InlineMath math="O(n^2)" /> to <InlineMath math="O(n \sqrt{n})" /> for sequence length <InlineMath math="n" />. By selectively focusing on a subset of the input tokens, 
                        Sparse Transformers maintain performance while significantly reducing the resources required for long sequences. This approach is particularly beneficial in tasks requiring 
                        long-range dependencies, such as document summarization and music generation.
                    </p>

                    <h4>Convolutional Transformers</h4>
                    <p className="subsubsection-paragraph">
                        Convolutional Transformers integrate convolutional layers into the Transformer architecture, aiming to capture local dependencies more effectively. While the self-attention 
                        mechanism in standard Transformers treats all tokens equally, irrespective of their positions, the convolutional layers in Convolutional Transformers enforce a notion of locality,
                         making them particularly adept at tasks where spatial relationships are key, such as in certain language understanding tasks and in processing structured data.
                        <BlockMath math="\text{ConvAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + C(Q, K)}{\sqrt{d_k}}\right)V" />
                        Here, <InlineMath math="C(Q, K)" /> represents the convolution operation applied to the queries and keys, enhancing the model's ability to capture local context.
                    </p>

                    <h4>Vision Transformers (ViT)</h4>
                    <p className="subsubsection-paragraph">
                        Vision Transformers (ViT) adapt the Transformer architecture for image classification tasks by treating images as sequences of patches. Each patch is flattened, linearly projected, 
                        and then processed in a manner akin to tokens in NLP tasks. This approach allows ViT to leverage the powerful self-attention mechanism to capture complex dependencies between patches,
                         making it highly effective for tasks requiring detailed visual understanding.
                        <BlockMath math="\text{ImageSequence} = [\text{Patch}_1; \text{Patch}_2; \ldots; \text{Patch}_N]" />
                        <BlockMath math="\text{ViTOutput} = \text{Transformer}(\text{ImageSequence})" />
                        ViT represents a significant shift from conventional convolutional neural networks, offering an alternative that benefits from the global receptive field of the self-attention
                         mechanism.
                    </p>
                    </section>

                
                
                <div className="subsubsection-navigation">
                    <Link to="/ml/rnn">← Recurrent Neural Networks</Link>
                    <Link to="/ml/adv">Other Architectures →</Link>
                </div>
            </main>
            
            <Footer />
        </div>
    );
}

export default Seq2Seq;
