import React from 'react';
import '../../styles/subsection.css';
import Header from '../../components/Header';
import Footer from '../../components/Footer';
import { Link } from 'react-router-dom';
import 'katex/dist/katex.min.css';
import { InlineMath, BlockMath } from 'react-katex';

function NeuralNetworks() {
    return (
        <div className="subsubsection-container">
            <Header />
            <div class="side-nav-container">
                <aside className="subsubsection-side-nav">
                    <a href="#foundations">Foundations</a>
                    <a href="#optimization">Optimization</a>
                    <a href="#hyp">Hyperparameters</a>
                    <a href="#regularization">Regularization</a>
                </aside>
            </div>
            
            <main className="subsubsection-content">
                <div className="titles"><h1>Neural Networks</h1></div>

                <section id="foundations" className="code-cleaned">
                    <h2>Foundations</h2>
                    <p className="subsubsection-paragraph">
                        This section is about neural networks and explains their complex architecture, fundamental principles, 
                        and the mechanics that underpin their functionality. From the basic building blocks, like neurons and layers, to more advanced concepts such as optimization algorithms, 
                        regularization techniques (some of which you've already read about), and the unique challenges posed by language data, this writeup will hopefully illuminate various facets of
                         neural networks, especially in the 
                        context of NLP!
                    </p>

                    <p className="subsubsection-paragraph">
                        The utility of neural networks in NLP is rooted in their ability to capture and model the nuances and complexities inherent in language. Through their layered 
                        structure and non-linear processing, neural networks can learn and represent diverse linguistic patterns – from simple syntactic structures to intricate semantic 
                        relationships. As you learn about activation functions, weight initialization, and the pivotal role of hyperparameters, you will see how these elements converge 
                        to make neural networks adept at tasks such as language translation, sentiment analysis, and text classification. The section will also address critical techniques like 
                        dropout and early stopping, elucidating their significance in enhancing model performance and preventing overfitting (re: bias-variance trade-off).
                    </p>

                    <h4>Building Blocks</h4>

                    <p className="subsubsection-paragraph">
                        Traditional neural networks consist of interconnected layers, with each layer containing a set of neurons. The architecture begins with an input layer, followed
                         by one or more hidden layers, and concludes with an output layer. The number of neurons in the <InlineMath math="u" />-th hidden layer is 
                         denoted as <InlineMath math="n_u" />, and each neuron performs a non-linear transformation of the inputs it receives.
                    </p>
                    <p className="subsubsection-paragraph">
                        The operation of each neuron involves aggregating the inputs, which are activations from the previous layer, into a linear combination followed by a non-linear activation 
                        function. For instance, the output from the first hidden layer can be mathematically described as:
                        <BlockMath math="v^{(1)} = g(W^{(1)}x + b^{(1)})" />
                        where <InlineMath math="x" /> represents the input vector of <InlineMath math="J" /> covariates, <InlineMath math="W^{(1)}" /> is the weight 
                        matrix, <InlineMath math="b^{(1)}" /> is the bias vector, and <InlineMath math="g(\cdot)" /> is the activation function. Common choices 
                        for <InlineMath math="g" /> include the rectified linear unit (ReLU) and the sigmoid function -- I will talk more about this in the next subsection. 
                    </p>
                    <p className="subsubsection-paragraph">
                        The dimensionality of the weight matrix <InlineMath math="W^{(1)}" /> and the bias vector <InlineMath math="b^{(1)}" /> are determined by the number of neurons in the 
                        respective layers they connect. The activation function introduces non-linearity, allowing the network to model complex relationships. The resulting 
                        output vector <InlineMath math="v^{(1)}" /> is <InlineMath math="n_1" />-dimensional and serves as the input to the next layer. This process of linear combination 
                        and non-linear transformation is repeated across subsequent layers, culminating in the output layer which provides the final prediction or classification.
                    </p>



                    <h4>Activation Functions</h4>
                    <p className="subsubsection-paragraph">
                        Activation functions in neural networks are critical elements that introduce non-linear properties to the model, allowing it to learn and represent more complex
                        patterns beyond what a linear model could. The essence of a neural network's power lies in chaining these non-linear transformations to model the intricate structures 
                        found in real-world data.
                    </p>

                    <p className="subsubsection-paragraph">
                        At its core, a neural network without activation functions is just a linear regression model, unable to address non-linear problems. The role of the activation function is 
                        to take a linear combination of inputs and weights, and transform them into outputs that can be non-linearly separated. The mathematical representation of an activation 
                        function applied to the linear combination of inputs <InlineMath math="z" /> is typically denoted as:
                        <BlockMath math="a = g(z)" />
                        where <InlineMath math="g" /> is the activation function, and <InlineMath math="a" /> is the output that will serve as an input to the next layer or as a final prediction.
                    </p>

                    <p className="subsubsection-paragraph">
                        There are several widely-used activation functions, each with distinct characteristics and mathematical properties:
                        <ul>
                            <li>
                                <b>Sigmoid:</b> A function that maps any input into a value between 0 and 1, ideal for binary classification.
                                <BlockMath math="g(z) = \frac{1}{1 + e^{-z}}" />
                            </li>
                            <li>
                                <b>Hyperbolic Tangent (tanh):</b> Similar to the sigmoid but outputs values between -1 and 1, providing better scaling of data.
                                <BlockMath math="g(z) = \tanh(z)" />
                            </li>
                            <li>
                                <b>ReLU (Rectified Linear Unit):</b> Provides a simple, efficient non-linear transformation, which has become the default choice for many types of neural networks.
                                <BlockMath math="g(z) = \max(0, z)" />
                            </li>
                            <li>
                                <b>Leaky ReLU:</b> A variant of ReLU that allows for a small gradient when the unit is not active, preventing dead neurons during training.
                                <BlockMath math="g(z) = \max(\alpha z, z)" />
                            </li>
                        </ul>
                    </p>

                    <p className="subsubsection-paragraph">
                        The interplay between linear algebra and non-linear modeling in neural networks is seen in how activation functions operate on the linear output of neurons. The 
                        linear operations (weighted sums and biases) are simple linear algebraic transformations. When an activation function is applied, it warps the linear space, bending 
                        and stretching the data into non-linear manifolds. This allows neural networks to approximate functions that linear models cannot, such as XOR functions, and to capture 
                        the complex patterns often present in language for NLP tasks.
                    </p>

                    <p className="subsubsection-paragraph">
                        Activation functions enable neural networks to handle the intricacies of language, which are inherently non-linear and hierarchical. They allow the network to 
                        capture nuances like context, tone, and semantic meaning from sequential data, which are crucial for tasks such as language translation, sentiment analysis, and content 
                        generation. The choice of activation function in an NLP model can significantly influence its performance, as it affects the model's ability to capture and represent the 
                        non-linear dependencies between words and their contextual meanings.
                    </p>

                    <h4>Deep Learning</h4>

                    <p className="subsubsection-paragraph">
                        When a neural network has more than 1 hidden layer, we call it deep learning.
                    </p>

                </section>

                <section id="optimization" className="code-cleaned">
                    <h2>Optimization</h2>
                    <p className="subsubsection-paragraph"></p>

                    <h4>Loss Functions</h4>

                    <p className="subsubsection-paragraph">
                        A loss function, or cost function quantifies the difference between the predicted values and the actual 
                        values in a dataset. It's a method to evaluate how well a specific algorithm models the given data. The goal in training a model is to find the parameters that minimize the 
                        loss function.
                    </p>

                    <p className="subsubsection-paragraph">
                        A loss function is typically denoted as <InlineMath math="L(Y, \hat{Y})" />, where <InlineMath math="Y" /> is the true value 
                        and <InlineMath math="\hat{Y}" /> is the predicted value. The exact form of the loss function can vary depending on the specific learning task.
                    </p>

                    <p className="subsubsection-paragraph">
                        Some commonly used loss functions include:
                        <ul>
                            <li>
                                <b>Mean Squared Error (MSE):</b> Used primarily in regression tasks.
                                <BlockMath math="MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2" />
                            </li>
                            <li>
                                <b>Cross-Entropy Loss:</b> Often used in classification problems.
                                <BlockMath math="H(p, q) = - \sum_{x} p(x) \log q(x)" />
                                where <InlineMath math="p" /> is the true distribution, and <InlineMath math="q" /> is the predicted distribution.
                            </li>
                            <li>
                                <b>Hinge Loss:</b> Commonly applied in Support Vector Machines for classification.
                                <BlockMath math="Hinge = \max(0, 1 - y_i \cdot \hat{y}_i)" />
                            </li>
                        </ul>
                    </p>

                    <p className="subsubsection-paragraph">
                        Selecting an appropriate loss function is crucial as it directly impacts the model's ability to understand and generate language. For instance:
                        <ul>
                            <li>In language modeling and text generation tasks, cross-entropy loss is used to compare the predicted probability distribution of the next word with the actual 
                                distribution.</li>
                            <li>For sequence-to-sequence models, as in machine translation, the cross-entropy loss can be calculated at each time step, summing over the sequence.</li>
                            <li>In text classification, such as sentiment analysis or spam detection, cross-entropy loss helps in comparing the predicted class probabilities with the true labels.</li>
                        </ul>
                        Basically, use cross-entropy loss.
                    </p>

                    <h4>Optimizers</h4>
                    <p className="subsubsection-paragraph">
                        We covered this in the previous section but for completeness, optimizers are algorithms that adjust the parameters of models to minimize the loss function. The choice of optimizer can significantly 
                        affect the speed and quality of the training process.
                    </p>
                    
                    <p className="subsubsection-paragraph">
                        Several optimization algorithms are commonly used, each with its unique approach to navigating the loss landscape:
                        <ul>
                            <li>
                                <b>Gradient Descent:</b> The simplest optimizer that updates parameters in the direction of the negative gradient of the loss function.
                            </li>
                            <li>
                                <b>Stochastic Gradient Descent (SGD):</b> An extension of gradient descent that updates parameters more frequently, which can lead to faster convergence but with
                                 more noise in the updates.
                            </li>
                            <li>
                                <b>Momentum:</b> Builds upon SGD by incorporating a fraction of the previous update into the current update, aiming to accelerate the convergence and reduce 
                                oscillations.
                            </li>
                            <li>
                                <b>Adagrad:</b> Modifies the learning rate for each parameter individually, based on the historical gradients, which can be beneficial for sparse data.
                            </li>
                            <li>
                                <b>RMSprop:</b> An unpublished but popular optimizer that divides the learning rate by an exponentially decaying average of squared gradients.
                            </li>
                            <li>
                                <b>Adam (Adaptive Moment Estimation):</b> Combines ideas from Momentum and RMSprop, keeping an exponentially decaying average of past gradients and squared gradients.
                            </li>
                        </ul>
                    </p>

                    {/* <p className="subsubsection-paragraph">
                        In NLP, the choice of optimizer is crucial due to the complexity of language data and models. For instance, Adam is often preferred in training deep neural networks for 
                        tasks like machine translation and text classification because of its efficient handling of sparse gradients and adaptive learning rate adjustments. This is particularly
                         useful in dealing with large vocabularies and varying sentence structures common in NLP.

                        For models like Word2Vec, which involve sparse data, optimizers like SGD or Adagrad can be more appropriate due to their update mechanisms. In scenarios involving recurrent
                         neural networks (RNNs), which are susceptible to issues like exploding or vanishing gradients, RMSprop or Adam can help mitigate these issues through their adaptive 
                         learning rates.

                        Ultimately, the choice of optimizer in NLP must consider the specific characteristics of the language task, the architecture of the model, and the nature of the training data. 
                        The right optimizer not only improves training efficiency but also enhances the model's ability to learn and generalize from the linguistic patterns present in the data.
                    </p> */}

                    <h4>Initialization</h4>
                    <p className="subsubsection-paragraph">
                        Weight initialization is a critical step in the training of neural networks that can significantly influence the convergence and performance of the model. Proper initialization 
                        sets the stage for an efficient optimization process by providing a starting point that balances the need to break symmetry between neurons while preventing the gradients 
                        from vanishing or exploding during the initial epochs of training.
                    </p>
                    
                    <p className="subsubsection-paragraph">
                        Several strategies have been developed for initializing the weights of neural networks, each with its theoretical justifications and practical considerations:
                        <ul>
                            <li>
                                <b>Zeros or Ones Initialization:</b> Setting all weights to zeros or ones, which is typically avoided due to issues with symmetry breaking.
                            </li>
                            <li>
                                <b>Random Initialization:</b> Weights are initialized randomly using a Gaussian or uniform distribution, which helps break symmetry but may lead to variance issues.
                            </li>
                            <li>
                                <b>Xavier/Glorot Initialization:</b> Intended for networks with tanh activations, it initializes weights by drawing from a distribution with zero mean and a specific 
                                variance that depends on the number of input and output neurons.
                                <BlockMath math="\text{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}}" />
                            </li>
                            <li>
                                <b>He Initialization:</b> Similar to Xavier initialization but tailored for ReLU activations, it uses a larger variance to account for the ReLU's non-linearity.
                                <BlockMath math="\text{Var}(W) = \frac{2}{n_{\text{in}}}" />
                            </li>
                        </ul>
                    </p>

                    <p className="subsubsection-paragraph">
                        Models in NLP often deal with sparse and high-dimensional data and so, effective weight initialization becomes even more critical. For example, in models like Word2Vec or 
                        GloVe, where word embeddings are learned, initialization affects how quickly and how well the semantic relationships between words are captured. In some models, especially 
                        encoder-decoder architectures, there's other ways to initialize parameters such as using the output from the encoder. You will learn about this later.
                        </p>

                </section>

                <section id="hyp" className="code-cleaned">
                    <h2>Hyperparameters</h2>
                    <p className="subsubsection-paragraph"></p>

                    <h4>Model Tuning</h4>
                    <p className="subsubsection-paragraph">
                        Hyperparameters are the user configuration settings used to structure machine learning models. Unlike model parameters that are learned during training, hyperparameters are 
                        set prior to the training process and govern the overall behavior of the model. They play a crucial role in the learning process as they can significantly influence the 
                        performance of the model.
                    </p>
                    
                    <p className="subsubsection-paragraph">
                        Examples of hyperparameters include learning rate, the number of hidden layers and neurons in a neural network, batch size, number of epochs, regularization parameters, 
                        and the choice of optimizer. Each of these hyperparameters can affect the training speed, the model's ability to generalize, or even the feasibility of the training 
                        converging to a solution.
                    </p>

                    <p className="subsubsection-paragraph">
                        Hyperparameter tuning, therefore, becomes a vital step in machine learning to find the optimal combination of hyperparameters that yields the best predictive performance. 
                        Techniques for hyperparameter optimization include grid search, random search, Bayesian optimization, and automated machine learning (AutoML) tools.
                    </p>
{/* 
                    <p className="subsubsection-paragraph">
                        In the context of NLP, hyperparameters must be carefully selected to address the complexities of language data. For example, the size of word 
                        embeddings, the architecture of a neural network (such as the number of layers in a Transformer), or the window size in a model like Word2Vec, are all hyperparameters that 
                        can influence how well a model captures and processes linguistic patterns.
                    </p>

                    <p className="subsubsection-paragraph">
                        The ultimate goal of hyperparameter tuning in NLP is to achieve a model that not only performs well on the training data but also generalizes effectively to new, unseen 
                        textual data, thus maintaining high performance in real-world applications.
                    </p> */}

                    <h4>Search Techniques</h4>
                    <p className="subsubsection-paragraph">
                        Grid search is a brute force method to determine the best combination of hyperparameters. It involves exhaustively searching through a manually specified subset of the 
                        hyperparameter space. For a given model, a grid search algorithm would:
                        <ol>
                            <li>Define a grid over the model's hyperparameter space that specifies the hyperparameter combinations to be evaluated.</li>
                            <li>Evaluate the model's performance for each combination using a cross-validated training approach.</li>
                            <li>Select the combination that provides the best performance metric, typically accuracy or loss.</li>
                        </ol>
                        The process can be represented mathematically as finding the arg max (or arg min for a loss function) over the grid:
                        <div className="custom-math-size"><BlockMath math="\text{arg max}_{(h_1, h_2, \ldots, h_k) \in H} \; \text{Performance}(M(h_1, h_2, \ldots, h_k))" /></div>
                        where <InlineMath math="H" /> is the set of all hyperparameter combinations in the grid, and <InlineMath math="M" /> denotes the model trained with 
                        hyperparameters <InlineMath math="h_1, h_2, \ldots, h_k" />.
                    </p>
                    
                    <p className="subsubsection-paragraph">
                        Random search differs from grid search in that it samples hyperparameter combinations randomly rather than exhaustively. The search space is defined in the same way, 
                        but instead of trying out every single combination, a set number of combinations are randomly selected and evaluated. This can be more efficient than grid search, 
                        especially when dealing with a large number of hyperparameters, and is mathematically represented as:
                        <div className="custom-math-size"><BlockMath math="\text{arg max}_{(h_1, h_2, \ldots, h_k) \sim \mathcal{D}} \; \text{Performance}(M(h_1, h_2, \ldots, h_k))" /></div>
                        where <InlineMath math="\mathcal{D}" /> is the probability distribution over the hyperparameter space from which samples are drawn.
                    </p>
                    
                    <p className="subsubsection-paragraph">
                        Bayesian optimization is a strategy for the optimization of hyperparameters that builds a probabilistic model mapping hyperparameters to a probability of a performance 
                        metric. Bayesian methods, such as Gaussian Processes, model the objective function and then choose the next hyperparameters to evaluate by balancing exploration and 
                        exploitation of the search space.
                        <div className="custom-math-size"><BlockMath math="\text{arg max}_{h \in H} \; \mathbb{E}[\text{Performance}(M(h)) | \text{data}]" /></div>
                        where <InlineMath math="\mathbb{E}" /> is the expected performance, and "data" includes the observations made so far.
                    </p>
                    
                    {/* <p className="subsubsection-paragraph">
                        In NLP, these optimization techniques are used to find the best hyperparameters for complex models like RNNs, LSTMs, and Transformers. Due to the high cost of training 
                        such models, efficient hyperparameter tuning is crucial. Grid search may be practical for smaller models or when the range of possible hyperparameter values is 
                        limited. However, random search and Bayesian optimization are often preferred for larger models due to their efficiency and effectiveness, even when the hyperparameter 
                        space is high-dimensional and complex, as is typical in NLP applications.
                    </p> */}


                </section>
                
                <section id="regularization" className="code-cleaned">
                    <h2>Regularization</h2>
                    <p className="subsubsection-paragraph"></p>

                    <h4>Overfitting</h4>
                    <p className="subsubsection-paragraph">
                        Overfitting occurs when a machine learning model learns the training data too closely, capturing noise and fluctuations that do not generalize to unseen data. While an 
                        overfitted model may achieve low training error, it typically performs poorly on validation or test data, indicating a lack of generalization.
                    </p>
                    
                    <p className="subsubsection-paragraph">
                        We can discuss again the bias-variance trade-off which is a fundamental concept that provides insight into the problem of overfitting. It can be mathematically described 
                        by decomposing the expected mean squared error of a model:
                        <BlockMath math="\text{E}[(Y - \hat{f}(X))^2] = \text{Bias}[\hat{f}(X)]^2 + \text{Var}[\hat{f}(X)] + \sigma^2" />
                        where:
                        <ul>
                            <li><InlineMath math="\text{Bias}[\hat{f}(X)]^2" /> represents the error due to the model's simplifying assumptions.</li>
                            <li><InlineMath math="\text{Var}[\hat{f}(X)]" /> is the error from the model's sensitivity to fluctuations in the training data.</li>
                            <li><InlineMath math="\sigma^2" /> denotes the irreducible error inherent in the data.</li>
                        </ul>
                    </p>
                    
                    <p className="subsubsection-paragraph">
                        Several techniques can mitigate overfitting, aiming to balance bias and variance:
                        <ul>
                            <li><b>Regularization:</b> Introducing regularization terms like L1 (lasso) or L2 (ridge) in the loss function can penalize large weights, effectively reducing variance.</li>
                            <li><b>Cross-Validation:</b> Using cross-validation techniques helps ensure that the model's performance is consistent across different subsets of the data.</li>
                            <li><b>Pruning:</b> In models like decision trees, pruning can remove branches that have little power to classify instances, thus reducing complexity and variance.</li>
                            <li><b>Ensembling:</b> Techniques like bagging and boosting can combine multiple models to reduce variance without substantially increasing bias.</li>
                        </ul>
                    </p>

                    <h4>L1 & L2 Norms</h4>

                    <p className="subsubsection-paragraph">
                        The L1 norm of a vector, also known as the Manhattan distance or ℓ1 norm, is the sum of the absolute values of its components. Mathematically, for a 
                        vector <InlineMath math="\mathbf{w}" />, it is defined as:
                        <BlockMath math="\|\mathbf{w}\|_1 = \sum_{i=1}^{n} |w_i|" />
                        L1 regularization, often referred to as lasso regularization, involves adding the L1 norm of the model's weight vector to the loss function. This can be represented as:
                        <BlockMath math="\text{Loss}_{\text{lasso}}(\mathbf{w}) = \text{Loss}(\mathbf{w}) + \lambda \|\mathbf{w}\|_1" />
                        where <InlineMath math="\lambda" /> is the regularization parameter that controls the strength of the regularization. L1 regularization encourages sparsity in the weight
                         vector, often setting some weights to zero, which can serve as a form of automatic feature selection.
                    </p>
                    
                    <p className="subsubsection-paragraph">
                        The L2 norm, also known as the Euclidean distance or ℓ2 norm, is the square root of the sum of the squares of the vector's components. For 
                        vector <InlineMath math="\mathbf{w}" />, it is:
                        <BlockMath math="\|\mathbf{w}\|_2 = \sqrt{\sum_{i=1}^{n} w_i^2}" />
                        L2 regularization, commonly known as ridge regularization, adds the square of the L2 norm to the loss function:
                        <BlockMath math="\text{Loss}_{\text{ridge}}(\mathbf{w}) = \text{Loss}(\mathbf{w}) + \lambda \|\mathbf{w}\|_2^2" />
                        L2 regularization penalizes the weights proportionally to their size, promoting weight values to be small yet does not enforce them to be zero.
                    </p>

                    <p className="subsubsection-paragraph">
                        By controlling the complexity of the model, 
                        regularization improves the generalization abilities of the model to new, unseen data. The regularization term's influence is controlled by the 
                        parameter <InlineMath math="\lambda" />, which when tuned correctly, can help find a good balance between bias and variance.
                    </p>

                    <p className="subsubsection-paragraph">
                        In NLP, regularization is particularly important due to the high dimensionality of the data. For example, in text classification tasks, there could be thousands of 
                        features corresponding to the vocabulary of the text corpus. L1 regularization is useful when we believe many features are irrelevant, as it can produce a more 
                        interpretable model with only a subset of features. L2 regularization is typically employed when we expect that many features could be important but with small effects, 
                        which is often the case with text data where the semantic effect of words can be subtle.
                    </p>

                    <h4>Dropout</h4>
                    <p className="subsubsection-paragraph">
                        Dropout is a regularization technique used in neural networks to prevent overfitting. The key idea is to randomly 'drop' units (neurons) and their connections from the 
                        neural network during training. This is done at each training step, where each neuron has a probability <InlineMath math="p" /> of being dropped. For a 
                        given neuron's output <InlineMath math="x" />, the dropout operation can be represented as:
                        <BlockMath math="x' = x \cdot \text{Bernoulli}(p)" />
                        where <InlineMath math="\text{Bernoulli}(p)" /> is a random variable that is 0 with probability <InlineMath math="p" /> (dropout probability) and 1 with 
                        probability <InlineMath math="1-p" />. This effectively makes the representation of the data sparse, forcing the network to learn more robust features.
                    </p>

                    <p className="subsubsection-paragraph">
                        Dropout layers are integrated into neural networks at specific points, typically after the fully connected layers. During training, these layers randomly drop a fraction 
                        of the units in the preceding layer, while during inference (testing), dropout is deactivated, and the network uses all neurons.
                    </p>

                    <p className="subsubsection-paragraph">
                        Dropout acts as a form of regularization as it reduces the complex co-adaptations of neurons on the training data. By dropping different sets of neurons, it's akin to 
                        training a large ensemble of networks with shared weights, which improves generalization to new data. The dropout rate <InlineMath math="p" /> is a hyperparameter that can 
                        be tuned, with typical values ranging from 0.2 to 0.5.
                    </p>

                    <p className="subsubsection-paragraph">
                        Dropout is particularly effective in NLP due to the high dimensionality and complexity of language data. For instance, in deep learning models like RNNs, LSTMs, and 
                        Transformers used for tasks such as language modeling, text classification, and machine translation, dropout helps in mitigating overfitting. 
                    </p>

                    {/* <p className="subsubsection-paragraph">
                        By randomly omitting parts of the neural representations at each training iteration, dropout encourages the model to develop more robust and generalized representations of the 
                        linguistic features. This is crucial in NLP where the model needs to capture the essence of language patterns without overfitting to the peculiarities (e.g., random emojis) of the training dataset.
                    </p> */}

                    <h4>Early Stopping</h4>

                    <p className="subsubsection-paragraph">
                        Early stopping is a regularization technique used during the training of a machine learning model. The method involves monitoring the model's performance on a validation 
                        set during training and stopping the training process once the performance begins to degrade, indicating the onset of overfitting.
                    </p>
                    
                    <p className="subsubsection-paragraph">
                        The training process of a model involves iteratively minimizing a loss function to improve its accuracy. However, after a certain point, further training can lead the model 
                        to start learning noise and patterns specific to the training set, thereby losing its generalization ability. Early stopping intervenes by tracking a performance metric 
                        (like validation loss or accuracy) and halting the training when this metric stops improving or starts worsening. Mathematically, the stopping criterion can be defined as:
                        <BlockMath math="\text{if } \text{ValidationLoss}_{t} > \text{ValidationLoss}_{t-k}, \text{ then stop}" />
                        where <InlineMath math="\text{ValidationLoss}_{t}" /> is the validation loss at epoch <InlineMath math="t" />, and <InlineMath math="k" /> is a patience parameter, 
                        indicating the number of epochs to wait before stopping after the minimum loss has been achieved.
                    </p>

                    <p className="subsubsection-paragraph">
                        Early stopping acts as a form of regularization by effectively limiting the capacity of the model to learn from the training data, thereby preventing overfitting. It 
                        balances the model's ability to learn complex patterns with the risk of learning noise from the training data. The key advantage of early stopping is its simplicity and 
                        the fact that it doesn't require modifying the underlying model architecture or the learning algorithm.
                    </p>
                    
                </section>
                
                
                <div className="subsubsection-navigation">
                    <Link to="/foundations/probstat">← Probability & Statistics</Link>
                    <Link to="/NLPBasics">NLP Basics →</Link>
                </div>
            </main>
            
            <Footer />
        </div>
    );
}

export default NeuralNetworks;
