import React from 'react';
import '../../styles/subsection.css';
import Header from '../../components/Header';
import Footer from '../../components/Footer';
import { Link } from 'react-router-dom';
import 'katex/dist/katex.min.css';
import { InlineMath, BlockMath } from 'react-katex';

function Probability() {
    return (
        <div className="subsubsection-container">
            <Header />
            <div class="side-nav-container">
                <aside className="subsubsection-side-nav">
                    <a href="#probability">Probability</a>
                    <a href="#bayes">Bayes Theorem</a>
                    <a href="#rv">Random Variables</a>
                    <a href="#statistics">Statistics</a>
                    <a href="#regression">Regression</a>

                </aside>
            </div>
            
            <main className="subsubsection-content">
                <div className="titles"><h1>Probability & Statistics</h1></div>
                
                <section id="intro" className="code-cleaned">
                <p className="subsubsection-paragraph">
                        Probability and statistics are foundational to NLP (as is the case with most of ML). They provide the framework for dealing with uncertainty and variability 
                        in language, enabling the development of algorithms and models that can understand, interpret, and generate human language.
                    </p>

                    <p className="subsubsection-paragraph">
                        <b>Probability</b> is the study of randomness and uncertainty. It quantifies the likelihood of events and forms the basis of statistical inference.
                    </p>

                    <p className="subsubsection-paragraph">
                        <b>Statistics</b> is the discipline that concerns the collection, analysis, interpretation, presentation, and organization of data.
                    </p>

                    <p className="subsubsection-paragraph">
                        Probabilistic models and statistical methods are employed extensively for various tasks:
                        <ul>
                            <li><b>Language Modeling:</b> Probabilistic language models, such as n-gram models and hidden Markov models, estimate the probability of a sequence of words or characters, 
                            which is fundamental in applications like speech recognition and machine translation.</li>
                            <li><b>Text Classification:</b> Statistical methods are used to develop algorithms for classifying texts into different categories based on features extracted from the 
                            texts. Techniques like Naive Bayes classifiers rely heavily on probability theory.</li>
                            <li><b>Sentiment Analysis:</b> By applying statistical analysis to text data, NLP models can determine the sentiment expressed in the text, which is widely used in customer
                             service, marketing, and social media monitoring.</li>
                            <li><b>Topic Modeling:</b> Algorithms like Latent Dirichlet Allocation use statistical methods to discover abstract topics within a collection of documents.</li>
                        </ul>
                    </p>

                    <p className="subsubsection-paragraph">
                        Moreover, statistical testing and data analysis techniques are crucial for evaluating the performance of NLP models, ensuring their validity and robustness. 
                    </p>

                    </section>

                    <section id="probability" className="code-cleaned">
                    <h2>Core Probability</h2>
                    

                    <p className="subsubsection-paragraph">
                        Probability is a mathematical framework that quantifies the likelihood of events occurring in a random experiment.
                    </p>

                    <p className="subsubsection-paragraph">
                    The probability of an event <InlineMath math="A" /> within a sample space <InlineMath math="S" /> is defined based on the following axioms:
                    <ul>
                        <li><InlineMath math="0 \leq P(A) \leq 1" />: The probability of any event is a non-negative number that does not exceed 1.</li>
                        <li><InlineMath math="P(S) = 1" />: The probability that some outcome in the sample space will occur is 1.</li>
                        <li>If <InlineMath math="A_1, A_2, A_3, \ldots" /> are mutually exclusive events (i.e., no two events have any outcomes in common), 
                        then <InlineMath math="P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)" />, which is the countable additivity property.</li>
                    </ul>
                    For finite sample spaces, if each outcome is equally likely, then the probability of an event <InlineMath math="A" /> can be simplified to:
                    <BlockMath math="P(A) = \frac{|A|}{|S|}" />
                    However, in the general case, the probability of <InlineMath math="A" /> is determined by the measure <InlineMath math="P" /> defined on the subsets 
                    of <InlineMath math="S" />, which satisfies the above axioms.
                </p>

                    <p className="subsubsection-paragraph">
                        As a simple example, the probability of rolling a 4 on a standard six-sided die is <InlineMath math="P(\text{'roll a 4'}) = \frac{1}{6}" />, as there is one favorable outcome 
                        (rolling a 4) out of six possible outcomes.
                    </p>

                    <p className="subsubsection-paragraph">
                        There isn't too much to add here, this stuff is used everywhere.
                    </p>

                </section>
                
                <section id="bayes" className="code-cleaned">
                    <h2>Bayesian Probability</h2>
                    <p className="subsubsection-paragraph"></p>

                    <h4>Bayes Theorem</h4>
                    <p className="subsubsection-paragraph">
                        Bayes' Theorem is a fundamental principle in probability theory that describes how to update the probabilities of hypotheses when given evidence. It forms the basis of 
                        Bayesian probability, a framework for probabilistic modeling that has found extensive applications in NLP.
                    </p>

                    <p className="subsubsection-paragraph">
                        Bayes' Theorem relates the conditional and marginal probabilities of random events. It is expressed as:
                        <BlockMath math="P(A | B) = \frac{P(B | A) \, P(A)}{P(B)}" />
                        where <InlineMath math="P(A | B)" /> is the probability of event <InlineMath math="A" /> given that <InlineMath math="B" /> is true, <InlineMath math="P(B | A)" /> is the 
                        probability of event <InlineMath math="B" /> given that <InlineMath math="A" /> is true, <InlineMath math="P(A)" /> is the probability of event <InlineMath math="A" />, 
                        and <InlineMath math="P(B)" /> is the probability of event <InlineMath math="B" />.
                    </p>

                    <p className="subsubsection-paragraph">
                        For example, if <InlineMath math="A" /> represents a specific class of text (like 'spam' or 'not spam'), and <InlineMath math="B" /> represents the occurrence of 
                        certain words in a text, Bayes' Theorem can be used to compute the probability that the text belongs to a certain class based on the presence of these words.
                    </p>

                    <p className="subsubsection-paragraph">
                        Bayesian probability provides a statistical approach to NLP that incorporates prior knowledge and evidence. It allows for the creation of models that can learn and adapt 
                        based on new data, making it highly suitable for language tasks where the context and usage of language evolve over time.
                    </p>

                    <p className="subsubsection-paragraph">
                        This kinds of Bayesian models are used in various applications, such as:
                        <ul>
                            <li><b>Text Classification:</b> Bayesian classifiers, like the Naive Bayes classifier, are widely used for categorizing texts into different classes based on word frequencies and their conditional probabilities.</li>
                            <li><b>Spam Filtering:</b> Bayesian spam filtering uses Bayes' Theorem to predict the likelihood that an email is spam, based on the probabilities of certain words appearing in spam and non-spam emails.</li>
                            <li><b>Language Modeling:</b> Bayesian models are employed in building language models that predict the probability of word sequences, which is crucial in applications like speech recognition and machine translation.</li>
                        </ul>
                        These models are particularly valued for their ability to handle uncertainty, incorporate prior linguistic knowledge, and continuously update their understanding based on 
                        new data.
                    </p>

                    <p className="subsubsection-paragraph">
                        Bayesian methods exemplify how probabilistic modeling can effectively capture the complexities and nuances of human language, providing a robust framework for 
                        developing versatile and adaptive NLP applications.
                    </p>

                    <h4>Probabilistic Modelling</h4>
                    <p className="subsubsection-paragraph">
                        Probabilistic modeling is a general statistical approach used extensively in various fields to model uncertainty and make 
                        predictions based on observed data. It involves the use of probability theory to construct models that can predict or infer unknown quantities.
                    </p>

                    <p className="subsubsection-paragraph">
                        A probabilistic model assigns probabilities to different outcomes based on certain input data. For example, consider a simple model for coin flipping:
                        <BlockMath math="P(\text{'Head'}) = 0.5, \quad P(\text{'Tail'}) = 0.5" />
                        This model assumes an equal probability for heads and tails in a fair coin flip.
                    </p>

                    <p className="subsubsection-paragraph">
                        A probabilistic model might be used to determine the likelihood of a particular word sequence in a sentence. For instance, a bigram model, 
                        which considers pairs of words, assigns probabilities to word sequences based on observed frequencies in a language corpus. In general, this kind of modelling will 
                        show up time and time again in various forms throughout modelling approaches.
                    </p>

                </section>


                <section id="rv" className="code-cleaned">
                    <h2>Random Variables</h2>
                    <p className="subsubsection-paragraph"></p>

                    <p className="subsubsection-paragraph">
                        In probability theory, a random variable is a variable whose possible values are numerical outcomes of a random phenomenon. Random variables are central to probabilistic 
                        analysis and statistical modeling, serving as a bridge between abstract probability concepts and real-world observations.
                    </p>


                    <p className="subsubsection-paragraph">
                        A random variable is typically denoted by a capital letter, such as <InlineMath math="X" />, <InlineMath math="Y" />, or <InlineMath math="Z" />. It can be classified into 
                        two main types:
                        <ul>
                            <li><b>Discrete Random Variables:</b> These take on a countable number of distinct values. An example is the roll of a die, where <InlineMath math="X" /> can be any 
                            integer from 1 to 6.</li>
                            <li><b>Continuous Random Variables:</b> These can take on any value in a continuous range. An example is the measurement of time or temperature.</li>
                        </ul>
                    </p>

                    <p className="subsubsection-paragraph">
                        The value of a random variable is determined by the outcome of a random process. For instance, if <InlineMath math="X" /> represents the outcome of a die roll, 
                        then <InlineMath math="X" /> is a discrete random variable that can take one of the values [1, 2, 3, 4, 5, 6] with equal probability.
                    </p>

                    <p className="subsubsection-paragraph">
                        Random variables have associated probability distributions that describe the likelihood of each of their outcomes. These distributions provide the foundation for defining 
                        important concepts such as expected value (mean), variance, and standard deviation, which are essential in understanding the behavior of random variables.
                    </p>

                    <p className="subsubsection-paragraph">
                        The expected value or mean of a random variable provides a measure of the 'center' of its distribution, while the variance and standard deviation provide measures of the 
                        'spread' or variability around this center.
                    </p>

                    {/* <p className="subsubsection-paragraph">
                        Future sections will expand on how the probability distributions of random variables are characterized and utilized, delving into the concepts of PDFs and CDFs, which are 
                        vital for a comprehensive understanding of statistical analysis in NLP. Again, much of this stuff should just be a review. 
                    </p> */}

                    <h4>Probability Functions</h4>

                    <p className="subsubsection-paragraph">
                        Probability Density Functions (PDFs) are fundamental in statistics and probability theory. 
                        They provide a description of the relative likelihood for a random variable to take on a given value.
                    </p>

                    <p className="subsubsection-paragraph">
                        For a continuous random variable <InlineMath math="X" />, the probability density function <InlineMath math="f(x)" /> describes the probability distribution 
                        of <InlineMath math="X" />. It is defined such that the probability of <InlineMath math="X" /> falling in a particular interval is given by the integral of the PDF over 
                        that interval:
                        <BlockMath math="P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx" />
                        Key characteristics of a PDF include:
                        <ul>
                            <li>The value of the PDF is non-negative for all <InlineMath math="x" />: <InlineMath math="f(x) \geq 0" />.</li>
                            <li>The total area under the PDF curve is 1, representing the total probability: <BlockMath math="\int_{-\infty}^{\infty} f(x) \, dx = 1" /></li>
                        </ul>
                    </p>

                    <p className="subsubsection-paragraph">
                        A classic example of a PDF is the normal distribution, often represented as:
                        <BlockMath math="f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^2}" />
                        where <InlineMath math="\mu" /> is the mean and <InlineMath math="\sigma" /> is the standard deviation.
                    </p>

                    <p className="subsubsection-paragraph">
                        PDFs are used to model the distribution of various linguistic features. For example, the length of sentences in a corpus, the frequency of certain words, or the 
                        distribution of word embeddings can all be analyzed using PDFs. Understanding these distributions aids in tasks like anomaly detection (identifying outliers in text data), 
                        content generation, and probabilistic language modeling.
                    </p>

                    <h4>Distributions</h4>

                    <p className="subsubsection-paragraph">
                        Cumulative Distribution Functions (CDFs) are crucial in statistics for summarizing the distribution and probabilities associated with a random variable. They provide a 
                        comprehensive view of the probability structure of a random variable, both discrete and continuous.
                    </p>

                    <p className="subsubsection-paragraph">
                        The CDF of a random variable <InlineMath math="X" />, denoted as <InlineMath math="F(x)" />, is defined as the probability that <InlineMath math="X" /> will take a value 
                        less than or equal to <InlineMath math="x" />:
                        <BlockMath math="F(x) = P(X \leq x)" />
                        For a continuous random variable with a probability density function <InlineMath math="f(x)" />, the CDF is the integral of <InlineMath math="f(x)" />:
                        <BlockMath math="F(x) = \int_{-\infty}^{x} f(t) \, dt" />
                        Key properties of a CDF include:
                        <ul>
                            <li>It is non-decreasing: <InlineMath math="F(x)" /> increases or remains constant as <InlineMath math="x" /> increases.</li>
                            <li>It approaches 0 as <InlineMath math="x" /> approaches negative infinity, and it approaches 1 as <InlineMath math="x" /> approaches positive infinity.</li>
                        </ul>
                    </p>

                    <p className="subsubsection-paragraph">
                        For instance, in the case of a normal distribution with mean <InlineMath math="\mu" /> and standard deviation <InlineMath math="\sigma" />, the CDF is used to determine the 
                        probability of observing values in a certain range.
                    </p>

                    {/* <p className="subsubsection-paragraph">
                        In NLP, CDFs are useful for understanding the distribution of various linguistic features, such as word frequencies, sentence lengths, or features derived from text 
                        embeddings. They help in making probabilistic statements about the data, such as determining thresholds for outlier detection or for segmenting texts based on feature values.
                        CDFs are also important in tasks like topic modeling or in algorithms where understanding the cumulative distribution of words and phrases aids in the interpretation and 
                        decision-making process. For example, in sentiment analysis, CDFs can be used to determine threshold values that distinguish between different sentiment classes.
                    </p> */}

                    <h4>Expectation, Variance, and Moments</h4>
                    <p className="subsubsection-paragraph">
                        Expectation, variance, and moments are fundamental concepts in probability and statistics that provide crucial insights into the characteristics of random variables and 
                        their distributions.
                    </p>

                    <p className="subsubsection-paragraph">
                        The expectation or expected value of a random variable is a measure of the central tendency of its distribution. For a random variable <InlineMath math="X" />, it is defined as:
                        <BlockMath math="\text{E}[X] = \sum_{x} x P(X=x)" />
                        for discrete variables, or 
                        <BlockMath math="\text{E}[X] = \int_{-\infty}^{\infty} x f(x) \, dx" />
                        for continuous variables, where <InlineMath math="f(x)" /> is the probability density function of <InlineMath math="X" />.
                    </p>

                    <p className="subsubsection-paragraph">
                        Variance quantifies the spread or variability of a distribution around its mean. The variance of <InlineMath math="X" /> is defined as:
                        <BlockMath math="\text{Var}(X) = \text{E}[(X - \text{E}[X])^2]" />
                        It measures the average squared deviation from the mean, providing a statistic summarizing the distribution's dispersion.
                    </p>

                    <p className="subsubsection-paragraph">
                        Moments are a set of parameters that provide a description of the shape of a distribution. The <InlineMath math="n" />-th moment of a random variable about 
                        the origin is given by:
                        <BlockMath math="\mu_n' = \text{E}[X^n]" />
                        The first moment is the mean, and the second central moment (the second moment about the mean) is the variance. Higher moments, like skewness and kurtosis, describe 
                        the asymmetry and peakedness of the distribution, respectively.
                    </p>

                    <p className="subsubsection-paragraph">
                        In NLP, these statistical measures are essential in various tasks:
                        <ul>
                            <li><b>Text Analysis:</b> The expectation and variance can describe characteristics of word frequencies, sentence lengths, and other linguistic features, aiding in tasks
                             like topic modeling and sentiment analysis.</li>
                            <li><b>Modeling and Prediction:</b> Understanding the moments of distributions is crucial in probabilistic modeling and machine learning applied to NLP, as it influences 
                            decisions about model design and feature selection.</li>
                            <li><b>Data Preprocessing:</b> These concepts are used in preprocessing steps like feature scaling and normalization, which are vital for the effective performance of NLP 
                            algorithms.</li>
                        </ul>
                    </p>

                    <h4>Jointly Distributed RVs</h4>

                    <p className="subsubsection-paragraph">
                        Jointly distributed random variables are used when understanding the relationships and dependencies between different variables in statistical analysis. 
                        When two or more random variables are considered together, their probability distribution is described as joint. For two discrete random 
                        variables <InlineMath math="X" /> and <InlineMath math="Y" />, the joint probability mass function (PMF) is given by:
                        <BlockMath math="P(X = x, Y = y)" />
                        which represents the probability that <InlineMath math="X" /> takes on value <InlineMath math="x" /> and <InlineMath math="Y" /> takes on value <InlineMath math="y" /> simultaneously. 
                        For continuous random variables, the joint probability distribution is described by a joint probability density function.
                    </p>

                    {/* <p className="subsubsection-paragraph">
                        The concept of jointly distributed random variables finds extensive applications for NLP tasks, particularly in modeling complex 
                        language based relationships. For instance, the joint probability distributions are instrumental in understanding word co-occurrences, which is a cornerstone in language modeling 
                        and generation, as well as in tasks like topic modeling where the likelihood of words appearing together in texts conveys significant thematic information. Similarly, these 
                        distributions play a pivotal role in syntactic analysis, such as in dependency parsing, where the relationships between different parts of speech are crucial. Analyzing joint distributions also aids in correlation analysis within text datasets, offering insights into how various linguistic features interplay across different contexts or classifications. This deeper understanding of the intricate relationships in language data, facilitated by jointly distributed random variables, is key to advancing the sophistication and accuracy of NLP models and techniques, enhancing their ability to capture the nuances of human language.
                    </p> */}


                </section>

                <section id="statistics" className="code-cleaned">
                    <h2>Core Statistics</h2>
                    <p className="subsubsection-paragraph"></p>

                    <h4>CLT & LLN</h4>
                    <p className="subsubsection-paragraph">
                        The Central Limit Theorem and the Law of Large Numbers are fundamental concepts in probability theory and statistics.
                    </p>

                    <p className="subsubsection-paragraph">
                        The Central Limit Theorem states that the distribution of the sum (or average) of a large number of independent, identically distributed random variables approaches a 
                        normal distribution, regardless of the shape of the original distribution. Mathematically, if <InlineMath math="X_1, X_2, ..., X_n" /> are random variables with
                         mean <InlineMath math="\mu" /> and variance <InlineMath math="\sigma^2" />, then the sum <InlineMath math="S_n = X_1 + X_2 + ... + X_n" /> tends towards a 
                         normal distribution as <InlineMath math="n" /> becomes large. The theorem is a cornerstone in statistical inference, allowing for the use of normal distribution 
                         assumptions in many real-world situations.
                    </p>

                    <p className="subsubsection-paragraph">
                        The Law of Large Numbers states that as the number of trials in a random experiment increases, the average of the results becomes closer to the expected value. 
                        In other words, the sample mean converges to the population mean as the sample size increases. This law underpins many statistical practices and ensures that empirical 
                        averages of random variables are reliable estimators of their expected values.
                    </p>

                    <h4>Hypothesis Testing</h4>

                    <p className="subsubsection-paragraph">
                        Hypothesis testing is a statistical method used to make inferences or decisions about population parameters based on sample data. 
                        It's a cornerstone of statistical analysis in various fields including ML tasks, where it aids in validating assumptions and models.
                    </p>

                    <p className="subsubsection-paragraph">
                        The process begins by proposing two hypotheses: the null hypothesis <InlineMath math="H_0" /> (a statement of no effect or no difference) and the alternative 
                        hypothesis <InlineMath math="H_1" /> or <InlineMath math="H_a" /> (a statement that indicates the presence of an effect or difference). 

                        The objective is to determine whether there is enough evidence in a sample of data to reject the null hypothesis in favor of the alternative hypothesis. This decision 
                        is made using a test statistic that measures the degree of agreement between the sample data and the null hypothesis. The test statistic is then compared to a critical 
                        value from a probability distribution (like the normal or t-distribution), which defines the rejection region for the null hypothesis.
                    </p>

                    <p className="subsubsection-paragraph">
                        The p-value, which is the probability of observing the test statistic or something more extreme under the assumption that the null hypothesis is true, is a crucial 
                        component of this decision. If the p-value is less than a predetermined significance level (commonly <InlineMath math="\alpha = 0.05" />), the null hypothesis is rejected, 
                        indicating that the results are statistically significant.
                    </p>

                    <p className="subsubsection-paragraph">
                        Suppose we are testing whether a new NLP algorithm improves the accuracy of sentiment analysis compared to an existing algorithm. Here, <InlineMath math="H_0" /> might be 
                        that there is no difference in accuracy, while <InlineMath math="H_a" /> is that the new algorithm has higher accuracy. After calculating the test statistic from the sample 
                        data (accuracy measurements) and comparing it to a critical value, we can determine the p-value. If the p-value is low (less than 0.05), we 
                        reject <InlineMath math="H_0" />, concluding that the new algorithm significantly improves accuracy.
                    </p>

                    {/* <p className="subsubsection-paragraph">
                        In NLP, hypothesis testing is used to validate models, compare algorithms, and make informed decisions based on data. It is vital for assessing the effectiveness 
                        of different NLP techniques, such as tokenization methods, feature extraction techniques, or machine learning models. By applying hypothesis testing, NLP practitioners 
                        can determine whether the differences observed in their models' performance are due to chance or are statistically significant, thus guiding the development and 
                        optimization of NLP tools and applications.
                    </p> */}

                    <h4>Confidence Intervals</h4>

                    <p className="subsubsection-paragraph">
                        Confidence intervals (CIs) provide a range of values that are believed, with a certain degree of confidence, to contain the value of an unknown population parameter. 
                        The concept of a confidence interval encompasses the idea that an estimate should come with a measure of its precision.
                    </p>

                    <p className="subsubsection-paragraph">
                        Mathematically, a CI for a population mean, when the population standard deviation is known, is constructed as:
                        <BlockMath math="\bar{x} \pm z \left( \frac{\sigma}{\sqrt{n}} \right)" />
                        Here, <InlineMath math="\bar{x}" /> is the sample mean, <InlineMath math="z" /> is the z-score corresponding to the desired confidence level (e.g., 1.96 for 
                        95% confidence), <InlineMath math="\sigma" /> is the population standard deviation, and <InlineMath math="n" /> is the sample size.
                    </p>

                    <p className="subsubsection-paragraph">
                        For large samples when the population standard deviation is unknown, it's common to use the sample standard deviation <InlineMath math="s" /> in 
                        place of <InlineMath math="\sigma" />, with the t-distribution providing the critical values:
                        <BlockMath math="\bar{x} \pm t \left( \frac{s}{\sqrt{n}} \right)" />
                        where <InlineMath math="t" /> is the t-score from the t-distribution for the desired confidence level and degrees of freedom <InlineMath math="n-1" />.
                    </p>

                    <p className="subsubsection-paragraph">
                        Confidence intervals are used to assess the reliability of statistical estimates, such as the accuracy of a classification model or the average 
                        length of sentences in a corpus. For instance, when evaluating a new language model, a CI can provide a range within which we expect the true accuracy of the model to
                         fall, based on a sample test set. This is particularly useful for comparing models and understanding the variability of model performance due to different data samples.
                    </p>

                    <h4>Estimators</h4>

                    <p className="subsubsection-paragraph">
                        In statistics, an estimator is a rule or method for estimating an unknown parameter of a population based on observed data. The goal is to approximate the 
                        true value of the parameter as closely as possible.
                    </p>

                    <p className="subsubsection-paragraph">
                        A common estimator is the sample mean, used to estimate the population mean.
                        Another important estimator is the sample variance, used to estimate the population variance. It's given by:
                        <BlockMath math="s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2" />
                        Here, <InlineMath math="s^2" /> represents the sample variance, and <InlineMath math="\bar{x}" /> is the sample mean.
                    </p>

                    <p className="subsubsection-paragraph">
                        Estimators have key properties like unbiasedness, consistency, and efficiency. An unbiased estimator means its expected value equals the true parameter value. 
                        For example, the sample mean <InlineMath math="\bar{x}" /> is an unbiased estimator of the population mean.
                    </p>

                    <p className="subsubsection-paragraph">
                        Consistency implies that as the sample size increases, the estimator converges in probability to the true parameter value. Efficiency refers to the estimator's variance 
                        being as low as possible among all unbiased estimators.
                    </p>

                    {/* <p className="subsubsection-paragraph">
                        In NLP, statistical estimators are fundamental in tasks such as language modeling, sentiment analysis, and text classification. For instance, the frequency of words in a 
                        document can be estimated using sample proportions, which are estimators of the true frequency in the language corpus. 

                        When developing probabilistic models for NLP, such as Naive Bayes classifiers, estimators are used to calculate probabilities based on training data. These probabilities 
                        are then used to make predictions or classify new text samples. These are fundamental objects and show up everywhere in various forms. 
                    </p> */}

                    {/* <h4>Sufficient Statistics</h4>

                    <p className="subsubsection-paragraph">
                        In the theory of statistical estimation, a sufficient statistic for a parameter is a function of the sample data that captures all necessary information about the 
                        parameter contained in the data. Mathematically, a statistic <InlineMath math="T(X)" /> is sufficient for a parameter <InlineMath math="\theta" /> if the conditional 
                        probability distribution of the data <InlineMath math="X" />, given <InlineMath math="T(X)" />, does not depend on <InlineMath math="\theta" />. This is formalized 
                        in the factorization theorem, which states:
                        <BlockMath math="f(x; \theta) = g(T(x); \theta) h(x)" />
                        where <InlineMath math="f(x; \theta)" /> is the probability density or mass function of <InlineMath math="X" />, <InlineMath math="g" /> is a function that depends 
                        on <InlineMath math="X" /> only through <InlineMath math="T(X)" />, and <InlineMath math="h(x)" /> is a function that does not depend on <InlineMath math="\theta" />.
                    </p>

                    <p className="subsubsection-paragraph">
                        A classic example is the sample mean as a sufficient statistic for the population mean in a normal distribution with known variance. The sample mean contains all the 
                        information needed to estimate the population mean, making it sufficient.
                    </p>

                    <p className="subsubsection-paragraph">
                        Sufficient statistics are valuable because they reduce the data to a simpler form without losing information about the parameter being estimated. This simplification is 
                        particularly useful in complex data analysis, as it allows for more efficient and robust statistical inference. 
                    </p> */}

                </section>

                <section id="regression" className="code-cleaned">

                    <h2>Regression</h2>
                    <p className="subsubsection-paragraph"></p>

                    <h4>Underlying Mathematics</h4>

                    <p className="subsubsection-paragraph">
                        Regression analysis is a statistical method used for modeling the relationship between a dependent variable and one or more independent variables. The 
                        simplest form is linear regression, where this relationship is modeled as a linear function.
                    </p>

                    <p className="subsubsection-paragraph">
                        In linear regression, the model assumes a linear relationship between the dependent variable <InlineMath math="Y" /> and independent 
                        variables <InlineMath math="X_1, X_2, ..., X_k" />. The model can be represented as:
                        <BlockMath math="Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_k X_k + \epsilon" />
                        Here, <InlineMath math="\beta_0, \beta_1, ..., \beta_k" /> are the coefficients, and <InlineMath math="\epsilon" /> is the error term.
                    </p>

                    <p className="subsubsection-paragraph">
                        The Ordinary Least Squares (OLS) method is commonly used to estimate the coefficients. The goal is to minimize the sum of squared residuals, leading to the 
                        OLS estimators. For a simple linear regression with one independent variable, the OLS estimators <InlineMath math="\hat{\beta}_0" /> and <InlineMath math="\hat{\beta}_1" /> can 
                        be derived as:
                        <BlockMath math="\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}" />
                        <BlockMath math="\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}" />
                        where <InlineMath math="\bar{x}" /> and <InlineMath math="\bar{y}" /> are the sample means of <InlineMath math="X" /> and <InlineMath math="Y" />, respectively.
                    </p>

                    <p className="subsubsection-paragraph">
                        The OLS estimators have desirable properties in the context of linear regression:
                        <ul>
                            <li><b>Unbiasedness:</b> The OLS estimators are unbiased, meaning their expected values equal the true parameter values.</li>
                            <li><b>Efficiency:</b> Within the class of linear unbiased estimators, OLS estimators have the smallest variance.</li>
                            <li><b>Consistency:</b> As the sample size increases, OLS estimators converge in probability to the true parameter values.</li>
                        </ul>
                        These properties ensure the reliability and validity of the regression model for inference and prediction. They are also a microcosm of neural networks; namely, they look 
                        awfully similar to the form that neurons take on -- look for this connection in the next section. 
                    </p>
                    
                    <h4>Inference</h4>

                    <p className="subsubsection-paragraph">
                        Statistical inference in regression involves drawing conclusions about the population parameters based on the sample data. This typically includes hypothesis testing and confidence interval estimation for the regression coefficients.
                    </p>

                    <p className="subsubsection-paragraph">
                        For instance, to test the significance of a coefficient <InlineMath math="\beta_i" />, we formulate null and alternative hypotheses:
                        <BlockMath math="H_0: \beta_i = 0" />
                        <BlockMath math="H_a: \beta_i \neq 0" />
                        The test statistic for <InlineMath math="\beta_i" /> is given by:
                        <BlockMath math="t = \frac{\hat{\beta}_i - \text{value under } H_0}{\text{SE}(\hat{\beta}_i)}" />
                        where <InlineMath math="\text{SE}(\hat{\beta}_i)" /> is the standard error of <InlineMath math="\hat{\beta}_i" />. This statistic follows a t-distribution with <InlineMath math="n - k - 1" /> degrees of freedom in small samples.
                    </p>
                    <p className="subsubsection-paragraph">
                        Confidence intervals for the coefficients can be constructed to estimate the range of plausible values for the coefficients. For a 95% confidence interval for <InlineMath math="\beta_i" />, it is:
                        <BlockMath math="\hat{\beta}_i \pm t_{\alpha/2} \times \text{SE}(\hat{\beta}_i)" />
                        where <InlineMath math="t_{\alpha/2}" /> is the critical value from the t-distribution.
                    </p>

                    <p className="subsubsection-paragraph">
                        Consider a regression model predicting house prices (Y) based on square footage (X). After estimating the model, we might find that the coefficient for square footage is 
                        significantly different from zero, suggesting a strong linear relationship between square footage and house price.
                    </p>

                    <p className="subsubsection-paragraph">
                        Another example is a model estimating the effect of an advertising campaign on sales. Hypothesis testing can determine whether the campaign had a statistically significant 
                        effect on sales, guiding future marketing strategies.
                    </p>

                    <p className="subsubsection-paragraph">
                        In NLP, inference in regression models is used to understand the relationship between various linguistic features and outcomes. For example, in a model predicting the 
                        sentiment score (Y) from text features like word count, sentence length, or specific word frequencies (X1, X2, ..., Xk), inference can reveal which features significantly 
                        affect sentiment. 
                    </p>

                    <h4>Types of Regression</h4>
                    <p className="subsubsection-paragraph">

                    <ul className="subsubsection-list">

                        <li>
                            <b>Linear Regression:</b>
                            <div className="custom-math-size"><BlockMath math="Y = \beta_0 + \beta_1 X_1 + ... + \beta_k X_k + \epsilon" /></div>
                            Used for predicting a dependent variable based on one or more independent variables. It assumes a linear relationship between the variables.
                        </li><br/>

                        <li>
                            <b>Polynomial Regression:</b>
                            <div className="custom-math-size"><BlockMath math="Y = \beta_0 + \beta_1 X + \beta_2 X^2 + ... + \beta_k X^k + \epsilon" /></div>
                            Extends linear regression by adding polynomial terms, making it suitable for non-linear data.
                        </li><br/>

                        <li>
                            <b>Logistic Regression:</b>
                            <div className="custom-math-size"><BlockMath math="P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + ... + \beta_k X_k)}}" /></div>
                            Used for binary classification problems. The outcome is modeled as a probability that the dependent variable belongs to a particular class.
                        </li><br/>

                        <li>
                            <b>Ridge Regression (L2 Regularization):</b>
                            <div className="custom-math-size"><BlockMath math="\text{Minimize } \left( \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right)" /></div>
                            Useful in dealing with multicollinearity by adding a penalty term to the loss function.
                        </li><br/>

                        <li>
                            <b>Lasso Regression (L1 Regularization):</b>
                            <div className="custom-math-size"> <BlockMath math="\text{Minimize } \left( \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right)" /></div>
                            Similar to ridge regression but can shrink some coefficients to zero, thus performing feature selection.
                        </li><br/>

                        <li>
                            <b>Elastic Net Regression:</b>
                            <div className="custom-math-size"><BlockMath math="\text{Minimize } \left( \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda_1 \sum_{j=1}^{p} \beta_j^2 + \lambda_2 \sum_{j=1}^{p} |\beta_j| \right)" /></div>
                            Combines L1 and L2 regularization, useful when there are multiple features correlated with each other.
                        </li><br/>

                        <li>
                            <b>Quantile Regression:</b>
                            <div className="custom-math-size"><BlockMath math="\text{Minimize } \sum_{i=1}^{n} \rho_\tau (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})" /></div>
                            Where <InlineMath math="\rho_\tau" /> is the check function. This regression is used to estimate the quantiles of the dependent variable.
                        </li><br/>

                        <li>
                            <b>Nonlinear Regression:</b>
                            <div className="custom-math-size"><BlockMath math="Y = f(X, \beta) + \epsilon" /></div>
                            Where <InlineMath math="f" /> is a nonlinear function of <InlineMath math="X" />. Suitable for data that follows a nonlinear trend.
                        </li>

                    </ul>

                    </p>

                    <h4>Bias & Variance</h4>
                    <p className="subsubsection-paragraph">
                    The bias-variance trade-off is a fundamental concept in statistical learning that describes the tension between the error introduced by the bias and the variance in an 
                    algorithm's predictions. In the context of supervised learning, consider a predictive model's expected squared error at a point <InlineMath math="x" />:
                    <BlockMath math="\text{E}[(Y - \hat{f}(x))^2]" />
                    Here, <InlineMath math="Y" /> is the true value, and <InlineMath math="\hat{f}(x)" /> is the model's prediction. This expected error can be decomposed into three parts: 
                    squared bias, variance, and irreducible error:
                    <BlockMath math="\text{Bias}^2(\hat{f}(x)) + \text{Var}(\hat{f}(x)) + \sigma^2" />
                    <ul>
                        <li><b>Bias</b>: The difference between the expected (or average) prediction of our model and the correct value. High bias can cause an algorithm to miss relevant 
                        relations (underfitting).</li>
                        <li><b>Variance</b>: The variability of model predictions for a given data point. High variance can cause overfitting, where a model captures random noise instead of the 
                        intended outputs.</li>
                        <li><b>Irreducible Error</b>: The inherent noise in the data.</li>
                    </ul>
                    The trade-off entails that minimizing one of these two errors typically increases the other. An ideal model will strike a balance between bias and variance, minimizing overall error.
                    </p>

                <p className="subsubsection-paragraph">
                    Consider the task of fitting a regression model to data points. A high-bias/low-variance model (like linear regression) might oversimplify the model, failing to capture 
                    important trends (underfitting). On the other hand, a low-bias/high-variance model (like a polynomial regression with many degrees) might model the random noise in the data 
                    too closely (overfitting).
                </p>

                <p className="subsubsection-paragraph">
                    In NLP, the bias-variance trade-off is crucial in developing and selecting models. For instance, when building a model for sentiment analysis:
                    <ul>
                        <li>A high-bias model might only rely on basic indicators like word presence/absence and could miss the nuances in language that convey sentiment.</li>
                        <li>A high-variance model might adapt too closely to the training data, incorporating idiosyncrasies of the training set that don’t generalize well to new, unseen data.</li>
                    </ul>
                    Understanding and navigating this trade-off is key in NLP model selection and tuning. It involves choosing the right complexity for the model, considering factors like the 
                    amount and variability of data, and the application's specific requirements. Regularization techniques are often employed in machine learning to control the balance of bias 
                    and variance. In NLP, this might involve decisions about feature selection, the architecture of neural networks, or the use of techniques like dropout or early stopping during 
                    training. Honestly, this is a really important topic and the way it manifests itself in the model-making process can be very subtle. This is a topic, like most topics here, that 
                    requires much more studying to understand deeply.
                </p>

                </section>
                
                
                <div className="subsubsection-navigation">
                    <Link to="/foundations/calc">← Calculus</Link>
                    <Link to="/foundations/nn">Neural Networks →</Link>
                </div>
            </main>
            
            <Footer />
        </div>
    );
}

export default Probability;
