import React from 'react';
import '../../styles/subsection.css';
import Header from '../../components/Header';
import Footer from '../../components/Footer';
import { Link } from 'react-router-dom';
import 'katex/dist/katex.min.css';
import { InlineMath, BlockMath } from 'react-katex';

function Evaluation() {
    return (
        <div className="subsubsection-container">
            <Header />
            <div class="side-nav-container">
                <aside className="subsubsection-side-nav">
                    <a href="#types">Types</a>
                    <a href="#evalmetrics">Evaluation Metrics</a>
                    <a href="#imbalanced">Imbalanced Datasets</a>
                    <a href="#sig">Significance Testing</a>
                    <a href="#human">Human Evaluation</a>
                </aside>
            </div>
            
            <main className="subsubsection-content">
                <div className="titles"><h1>Evaluation in NLP</h1></div>

                <section id="types" className="code-cleaned">
                    <h2>Types</h2>
                    <p className="subsubsection-paragraph"></p>

                    <h4>Intrinsic v. Extrinsic Evaluation</h4>
                    <p className="subsubsection-paragraph">In NLP, evaluating the performance and effectiveness of models and algorithms is crucial. This evaluation typically falls into 
                    two categories: intrinsic and extrinsic. Intrinsic evaluation measures the performance of an NLP system or model based on a specific internal task or metric. It 
                    focuses on the correctness and quality of the system's outputs in a controlled setting. One the other hand,  extrinsic evaluation assesses the performance of an NLP 
                    system based on its effectiveness in a real-world application or task. It's about how well the system contributes to the performance of a larger system. Intrinsic evaluation 
                    is more about the depth of analysis (how well the system performs a specific NLP task), while extrinsic evaluation is about the breadth of impact (how the system contributes to
                     broader objectives).</p>

                    <p className="subsubsection-paragraph">An example of instrinsic evaluation would assessing a sentiment analysis model based on its accuracy in classifying 
                    sentiment correctly. And, an example of an extrinsic evaluation would be the impact of a chatbot on customer satisfaction in a customer service application. Through these 
                    example, you can observe that Intrinsic methods are often used during the development phase of NLP systems to fine-tune and optimize them, whereas extrinsic methods are 
                    used post-deployment to gauge their real-world effectiveness.</p>

                    <h4>Qualitative v. Quantitative Evaluation</h4>
                    <p className="subsubsection-paragraph">Evaluation methods can also be broadly categorized into qualitative and quantitative approaches. Each type provides different insights 
                    into the performance and effectiveness of NLP systems and models.</p>

                    <p className="subsubsection-paragraph">Qualitative evaluation involves assessing the quality of an NLP system's output based on non-numeric criteria. It focuses on the aspects of 
                    the system's performance that aren't easily quantifiable. A couple of examples:
                        <ul>
                            <li>Evaluating the readability and fluency of text generated by a machine translation system.</li>
                            <li>Assessing how well a chatbot understands and responds to different types of user queries in natural conversations.</li>
                        </ul></p>

                        <p className="subsubsection-paragraph">Quantitative evaluation measures the performance of an NLP system using numerical metrics. It's about quantifying the effectiveness of a system 
                        using statistical and mathematical methods. These approaches tilize objective, measurable criteria like accuracy, precision, recall, F1 score, etc. They also allow for a 
                        reproducible assessment of performance and comparison with other systems. We will get into some of these in the sections below. </p>

                </section>
                
                <section id="evalmetrics" className="code-cleaned">
                <h2>Evaluation Metrics</h2>

                <p className="subsubsection-paragraph">Here, I just provide a high level overview of popular evaluation methods -- they will be discussed in more detail when discussing 
                specific models. </p>

                <h4>Classification</h4>
                <p className="subsubsection-paragraph">
                    In classification tasks, common metrics include Accuracy, Precision, Recall, and F1 Score. 
                    <div className="custom-math-size"><BlockMath math="\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}" /></div>
                    Precision (positive predictive value) and recall (sensitivity) are particularly important in imbalanced datasets. 
                    <div className="custom-math-size"><BlockMath math="\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}" /></div>
                    <div className="custom-math-size"><BlockMath math="\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}" /></div>
                    The F1 Score is the harmonic mean of precision and recall, providing a balance between the two.
                    <div className="custom-math-size"><BlockMath math="\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}" /></div>
                    For example, in sentiment analysis, precision measures the accuracy of positive sentiment predictions, while recall shows how many actual positive cases were captured.
                </p>

                <h4>Regression</h4>
                <p className="subsubsection-paragraph">
                    In regression tasks, Mean Squared Error (MSE) and Mean Absolute Error (MAE) are common.
                    <div className="custom-math-size"><BlockMath math="\text{MSE} = \frac{1}{n} \sum (y_i - \hat{y_i})^2" /></div>
                    <div className="custom-math-size"><BlockMath math="\text{MAE} = \frac{1}{n} \sum |y_i - \hat{y_i}|" /></div>
                    MSE gives more weight to larger errors, making it sensitive to outliers, whereas MAE provides a linear view of errors. In an NLP context, these metrics are used in tasks like 
                    predicting word embedding values or scores in regression-based semantic tasks.
                </p>

                <h4>Information Retrieval</h4>
                <p className="subsubsection-paragraph">
                    Precision@K, Recall@K, and Mean Average Precision (MAP) are crucial for evaluating information retrieval systems. Precision@K measures the proportion of relevant items in the 
                    top K results.
                    <div className="custom-math-size"><BlockMath math="\text{Precision@K} = \frac{\text{Number of Relevant Items in Top K}}{K}" /></div>
                    MAP considers the order of retrieval and averages the precision at the rank of each relevant item.
                    For search engine evaluation, MAP provides a comprehensive measure of how well the engine ranks relevant documents.
                </p>

                <h4>Machine Translation</h4>
                <p className="subsubsection-paragraph">
                    BLEU (Bilingual Evaluation Understudy) is the standard for evaluating machine translation.
                    <div className="custom-math-size"><BlockMath math="\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^N w_n \log p_n\right)" /></div>
                    Here, <InlineMath math="p_n" /> is the n-gram precision, <InlineMath math="w_n" /> is the weight for each n-gram, and BP is a brevity penalty to penalize short 
                    translations. BLEU compares the candidate translation against reference translations to assess quality.
                </p>

                <h4>Text Generation</h4>
                <p className="subsubsection-paragraph">
                    Perplexity is often used for models like language generation.
                    <div className="custom-math-size"><BlockMath math="\text{Perplexity} = 2^{-\sum_{x} p(x) \log p(x)}" /></div>
                    Lower perplexity indicates better performance, suggesting the model is better at predicting the sample. It’s especially useful in evaluating language models in tasks like 
                    chatbot response generation.
                </p>

                <h4>Semantic Tasks</h4>
                <p className="subsubsection-paragraph">
                    For semantic similarity and entailment, metrics like cosine similarity and the Jaccard index are used.
                    <div className="custom-math-size"><BlockMath math="\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|}" /></div>
                    These metrics measure how close the semantic representation of two texts is, which is pivotal in tasks like paraphrase detection or document clustering.
                </p>

                <h4>Conversational AI</h4>
                <p className="subsubsection-paragraph">
                    In conversational AI, metrics like Dialog Success Rate and User Satisfaction Score are used.
                    Dialog Success Rate measures how often a chatbot successfully completes its intended task, while User Satisfaction gauges user happiness with the bot's performance, often 
                    through surveys or feedback mechanisms.
                </p>
            </section>



            <section id="imbalanced" className="code-cleaned">
            <h2>Imbalanced Datasets</h2>
            <p className="subsubsection-paragraph">
                Imbalanced datasets are common in NLP and other data-driven fields, where some classes are significantly more frequent than others. This imbalance can lead to biased models that favor the majority class. Effective handling of imbalanced datasets is crucial for building robust and fair NLP models.
            </p>

            <h4>Oversampling & Undersampling</h4>
            <p className="subsubsection-paragraph">
                Oversampling and undersampling are techniques to adjust the class distribution in a dataset.
                <ul>
                    <li><strong>Oversampling</strong> involves increasing the number of instances in the minority class by replicating them or generating synthetic samples (e.g., SMOTE - Synthetic Minority Over-sampling Technique).</li>
                    <li><strong>Undersampling</strong> reduces the number of instances in the majority class. This can be random or involve more complex strategies to retain information.</li>
                </ul>
                For example, in a sentiment analysis dataset with an imbalance of positive and negative reviews, oversampling the negative reviews or undersampling the positive reviews can help achieve better class balance.
            </p>

            <h4>Cohen's Kappa</h4>
            <p className="subsubsection-paragraph">
                Cohen's Kappa is a statistical measure used to evaluate the performance of classification models on imbalanced datasets. It compares the observed accuracy with the accuracy that could be expected by chance.
                <div className="custom-math-size"><BlockMath math="\kappa = \frac{p_o - p_e}{1 - p_e}" /></div>
                Here, <InlineMath math="p_o" /> is the observed agreement, and <InlineMath math="p_e" /> is the expected agreement by chance. A high Kappa value indicates that the model performs well beyond what would be expected by random chance, considering the imbalances in the dataset.
            </p>

            <h4>Matthews Correlation Coefficient</h4>
            <p className="subsubsection-paragraph">
                The Matthews Correlation Coefficient (MCC) is a robust metric used for binary classification tasks, effective even with imbalanced datasets. It takes into account true and false positives and negatives.
                <div className="custom-math-size"><BlockMath math="\text{MCC} = \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}" /></div>
                Where TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives, respectively. MCC ranges from -1 to +1, where +1 indicates perfect prediction, 0 no better than random prediction, and -1 total disagreement between prediction and observation.
            </p>
        </section>


        <section id="sig" className="code-cleaned">
        <h2>Significance Testing</h2>
        <p className="subsubsection-paragraph">
            Significance testing is a statistical method used to determine if the results of an experiment are not due to chance. In NLP, this is crucial for establishing the reliability of 
            models and improvements over baseline methods.
        </p>

        <p className="subsubsection-paragraph">
            The importance of significance testing in NLP lies in its ability to validate experimental results. It helps in confirming that changes in model performance (e.g., accuracy, F1 score) 
            are statistically significant and not just random variations. This is especially vital when comparing different models or when tweaking hyperparameters to improve a model's performance.
        </p>

        <h4>T-Tests & Chi-squared</h4>
        <p className="subsubsection-paragraph">
            T-tests and Chi-squared tests, both of which we discussed before, are common methods for significance testing in NLP.
            <ul>
                <li><strong>T-Tests</strong> are used to compare the means of two groups and determine if they are different from each other. It's useful when comparing the performance of two models.
                <div className="custom-math-size"><BlockMath math="\text{T-test} = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s^2}{n_1} + \frac{s^2}{n_2}}}" /></div>
                Here, <InlineMath math="\bar{X}_1" /> and <InlineMath math="\bar{X}_2" /> are the sample means, <InlineMath math="s^2" /> is the sample variance, and <InlineMath math="n_1" /> and <InlineMath math="n_2" /> are the sample sizes.</li>
                <li><strong>Chi-squared Tests</strong> assess whether observed frequencies in one categorical variable match expected frequencies. They are used in NLP for tasks like feature 
                selection or to test the independence of variables.
                <div className="custom-math-size"><BlockMath math="\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}" /></div>
                Where <InlineMath math="O_i" /> are the observed frequencies, and <InlineMath math="E_i" /> are the expected frequencies.</li>
            </ul>
        </p>

        <h4>Bootstrap Approaches</h4>
        <p className="subsubsection-paragraph">
            Bootstrap methods are non-parametric approaches to statistical significance testing. They involve repeatedly resampling a dataset with replacement and recalculating the statistic of 
            interest. In NLP, bootstrapping is used to estimate the confidence intervals of model metrics, providing a more robust understanding of model performance.
            The essence of the bootstrap approach is to derive numerous datasets (samples) from the original data and compute the statistic (e.g., model accuracy) for each sample to create a 
            distribution. This distribution is then used to calculate confidence intervals or to perform hypothesis testing.
        </p>
    </section>


                <section id="human" className="code-cleaned">
                <h2>Human Evaluation</h2>
                <p className="subsubsection-paragraph">
                    Human evaluation in NLP involves subjective assessment by human judges to rate or categorize NLP outputs. This type of evaluation is crucial for tasks where human judgment 
                    is the gold standard, such as assessing the quality of machine translation or the naturalness of generated text. It can also be used in conjunction with Reinforcement Learning -- 
                    something that OpenAI did to develop GPT.
                </p>

                <h4>Judgment</h4>
                <p className="subsubsection-paragraph">
                    Judgment in human evaluation typically involves tasks like rating the coherence, fluency, or relevance of text on a numerical scale. It can also include categorizing responses, 
                    deciding whether a translation is accurate, or if a conversation response is appropriate. The reliability of these evaluations often depends on the clarity of the guidelines 
                    provided to the human judges.
                </p>

                <h4>Fleiss’ Kappa</h4>
                <p className="subsubsection-paragraph">
                    Fleiss’ Kappa is a statistical measure used to assess the reliability of agreement between multiple raters.
                    <div className="custom-math-size"><BlockMath math="\kappa = \frac{\bar{P} - \bar{P_e}}{1 - \bar{P_e}}" /></div>
                    Here, <InlineMath math="\bar{P}" /> is the mean proportion of agreement observed between raters, and <InlineMath math="\bar{P_e}" /> is the hypothetical probability of chance 
                    agreement. A higher value of <InlineMath math="\kappa" /> indicates more consistent ratings among different evaluators.
                </p>

                <h4>Krippendorff's Alpha</h4>
                <p className="subsubsection-paragraph">
                    Krippendorff's Alpha is another reliability metric, applicable to any number of raters, levels of measurement, and sample sizes. It assesses the agreement among raters who rate a 
                    set of items.
                    <div className="custom-math-size"><BlockMath math="\alpha = 1 - \frac{\text{Observed Disagreement}}{\text{Expected Disagreement}}" /></div>
                    It accounts for the agreement occurring by chance, providing a robust measure of the reliability of human raters. A high value of <InlineMath math="\alpha" /> indicates a high 
                    level of agreement among raters, validating the reliability of the human evaluation process.
                </p>
            </section>

                
                
                <div className="subsubsection-navigation">
                    <Link to="/nlpbasics/semantic">← Semantic & Sentiment Analysis</Link>
                    <Link to="/ml">Machine Learning →</Link>
                </div>
            </main>
            
            <Footer />
        </div>
    );
}

export default Evaluation;
