import React from 'react';
import '../../styles/subsection.css';
import Header from '../../components/Header';
import Footer from '../../components/Footer';
import { Link } from 'react-router-dom';
import 'katex/dist/katex.min.css';
import { InlineMath, BlockMath } from 'react-katex';
import { LightAsync as SyntaxHighlighter } from 'react-syntax-highlighter';
import { docco } from 'react-syntax-highlighter/dist/esm/styles/hljs';

function NLPData() {
    return (
        <div className="subsubsection-container">
            <Header />
            <div class="side-nav-container">
                <aside className="subsubsection-side-nav">
                    <a href="#corpora">Corpora & Datasets</a>
                    <a href="#tokenization">Tokenization</a>
                    <a href="#embeddings">Embeddings</a>
                    <a href="#bow">Bag-of-Words</a>
                    <a href="#tfidf">TF-IDF</a>
                    <a href="#ngram">N-Grams</a>
                    <a href="#rep">Miscellaneous</a>
                </aside>
            </div>
            
            <main className="subsubsection-content">
                <div className="titles"><h1>NLP Data</h1></div>

                <section id="corpora" className="code-cleaned">
                    <h2>Corpora</h2>

                    <h4>Understanding Corpora</h4>
                    <p className="subsubsection-paragraph">
                        A corpus is a large and structured set of texts. These texts can be anything from books, articles, and essays to transcripts of speeches, tweets, or dialogues from movies. The key is that it's 
                        an accumulation of real-world text data. Unlike a random assortment of texts, a corpus is usually compiled with a specific purpose in mind. For example, a corpus might 
                        be created to represent a language (like English), a dialect (like British English), a genre (like scientific journals), or to support specific research (like sentiment analysis on social media).
                    </p>

                    {/* <p className="subsubsection-paragraph">
                    Think of a corpus like a specialized library. A library contains a vast number of books (texts) organized systematically. Similarly, a corpus contains various texts, often 
                    organized or classified in a way that serves a specific research or linguistic purpose. A corpus is the dataset for NLP models, providing the raw material (language data) upon 
                    which models and analyses are built.
                    </p> */}

                    <p className="subsubsection-paragraph">In a more computational sense, a corpus can be viewed as a large matrix or database. Each document (or text) in the corpus 
                    can be represented as a vector of features. These features could be as simple as word counts or as complex as multi-dimensional embeddings derived from models like Word2Vec. For 
                    instance, in a very simplified model, imagine a corpus with three documents, each containing a few words. We could represent this corpus as a matrix where each row is a 
                    document and each column is a word, with cell values indicating the frequency of that word in the document. To make it more plain, a document would be analagous to a particular 
                    observation that you would see in your usual data analysis case and the collection of documents (observations) is your dataset.</p>

                    <p className="subsubsection-paragraph">
                        Popular corpora like the Penn Treebank or the British National Corpus provide annotated texts that are invaluable for training NLP models. Feel free to run the following 
                        code snippet to get a sense of what a corpus might look like:
                    </p>

                    <p className="subsubsection-paragraph">
    <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
        {`# Example of loading a popular dataset using NLTK
import nltk
nltk.download('brown')
from nltk.corpus import brown
print(brown.words())`}
    </SyntaxHighlighter> </p>
                    </section>
                
                <section id="tokenization" className="code-cleaned">
                    <h2>Tokenization</h2>
                    <p className="subsubsection-paragraph">
                    Tokenization is a fundamental process in NLP where a large piece of text is divided into smaller units, called tokens. These tokens can be words, characters, or 
                    subword units like syllables. Imagine a string of pearls. If the string represents a sentence, each pearl represents a token. Just as you can separate pearls from the string, tokenization separates 
                    words or characters from a continuous stream of text.
                    </p>

                    <p className="subsubsection-paragraph">
                        The basic unit in text processing is a token, typically a word. However, tokens can also be phrases, symbols, or other elements, depending on the granularity of the tokenization process. 
                        The tokenization process can be formally represented as a function <InlineMath math="T" /> that maps a string <InlineMath math="S" /> to a list of 
                        tokens <InlineMath math="[t_1, t_2, \ldots, t_n]" />.
                        <div className="custom-math-size"><BlockMath math="T: S \rightarrow [t_1, t_2, \ldots, t_n]" /></div>
                    </p>

                    <p className="subsubsection-paragraph">
                        There are several types of tokenization:
                        <ul>
                            <li><strong>Word Tokenization</strong> -  The most common form, where text is split into words. For example, "Hello, world!" becomes ["Hello", ",", "world", "!"].</li>
                            <li><strong>Character Tokenization</strong> - Here, text is split into characters. "Hello" becomes ["H", "e", "l", "l", "o"].</li>
                            <li><strong>Subword Tokenization</strong> - B This splits text into subword units, useful in languages where words can be very long or in handling unknown words in a language model.</li>
                            <li><strong>Byte Pair Encoding (BPE)</strong> - A subword tokenization technique that represents common sequences of characters as single tokens.</li>
                        </ul>
                    </p>

                    <p className="subsubsection-paragraph">
                        Tokenization is a crucial first step in many NLP tasks as it determines the granularity of information that will be processed. The choice of tokenization can affect both the performance 
                        and complexity of subsequent tasks, such as language modeling, parsing, and machine translation. Some challenges include:
                        <ul>
                            <li><strong>Language Variation:</strong> Different languages have different rules for splitting words (like Chinese, which doesn’t use spaces).</li>
                            <li><strong>Complexity:</strong> Punctuation, contractions (like "don't"), and other linguistic nuances add complexity.</li>
                        </ul>
                        You can use the following code snippet to see what an example of tokenization looks like.
                    </p>

                    <p className="subsubsection-paragraph">
    <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
        {`# Example of word tokenization using NLTK
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

text = "Natural Language Processing with Python is super awesome"
tokens = word_tokenize(text)
print(tokens)`}
    </SyntaxHighlighter>
    </p>
                    </section>


                <section id="embeddings" className="code-cleaned">
                    <h2>Embeddings</h2>
                    <p className="subsubsection-paragraph">
                        Embeddings represent discrete textual elements like words, phrases, or even entire documents in a continuous vector space. They capture 
                        semantic meaning and relationships in a way that can be processed by machine learning models. Consider word embeddings as the geographic location of cities on a map. Just as 
                        cities that are close to each other often have similar climates or cultural attributes, words that are close in the embedding space often have similar meanings or are used 
                        in similar contexts.
                    </p>

                    <p className="subsubsection-paragraph">
                        More technically, an embedding is a mapping <InlineMath math="\phi" /> from a discrete space, such as a vocabulary <InlineMath math="V" />, to a continuous and dense 
                        vector space <InlineMath math="\mathbb{R}^d" /> where <InlineMath math="d" /> is the dimensionality of the vectors. This can be represented as:
                        <div className="custom-math-size"><BlockMath math="\phi: V \rightarrow \mathbb{R}^d" /></div>
                        The goal of this transformation is to represent textual data in a format that a machine learning model can work with effectively.
                    </p>


                    <p className="subsubsection-paragraph">
                        Embeddings are fundamental in NLP because they provide a way to transfer the categorical data of language into numerical form, where the relationship between items 
                        can be measured by geometric distances. For instance, in a well-constructed embedding space, semantically similar words will be closer to each other than dissimilar ones. 
                        This property is invaluable for tasks like text classification, sentiment analysis, and machine translation. Rest assured that this topic will come up often as you work your 
                        way through this sections -- the primary concept to understand is that you have words/sentences/etc. (natural language) and you want to represent these as numbers; how we 
                        get these embeddings, etc. is the singular objective of some approaches as you will see.
                    </p>

                    <p className="subsubsection-paragraph">
    <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
        {`# Example of generating word embeddings using gensim's Word2Vec
from gensim.models import Word2Vec

# Define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
             ['this', 'is', 'the', 'second', 'sentence'],
             ['yet', 'another', 'sentence'],
             ['one', 'more', 'sentence'],
             ['and', 'the', 'final', 'sentence']]

# Train a Word2Vec model
model = Word2Vec(sentences, min_count=1)

# Get an embedding for a word
word_embedding = model.wv['sentence']
print(word_embedding)`}
    </SyntaxHighlighter>
    </p>
                    </section>

                    <section id="bow" className="code-cleaned">
    <h2>Bag of Words</h2>
    <p className="subsubsection-paragraph">
        The Bag of Words (BoW) model is a simple yet powerful NLP representation where text is represented as the bag (multiset) of its words, disregarding grammar and word order but keeping 
        multiplicity. Imagine a literal bag containing a bunch of words; each time you pull out a word, you're essentially sampling from this bag. 
    </p>

    <p className="subsubsection-paragraph">
        BoW transforms text into a vector space model, where each unique word in the text corresponds to a dimension in the vector space. If <InlineMath math="D" /> is a document 
        containing words <InlineMath math="w_1, w_2, ..., w_n" />, and <InlineMath math="V" /> is the vocabulary (set of unique words across all documents), then the BoW representation 
        of <InlineMath math="D" /> is a vector <InlineMath math="\vec{v} \in \mathbb{N}^{|V|}" />, where each element <InlineMath math="v_i" /> is the frequency of
         word <InlineMath math="w_i" /> in <InlineMath math="D" />. This can be represented as:
         <div className="custom-math-size"><BlockMath math="\vec{v} = (freq(w_1, D), freq(w_2, D), ..., freq(w_n, D))" /></div>
        As you can imagine, you will get sparse vectors often because the length of this vector is equal to the total number of unique words across all of your documents i.e. if the vector corresponding
        to a particular document doesn't have some particular word, <InlineMath math="w_j" />, then <InlineMath math="w_j" /> will have a value of 0 in the vector <InlineMath math="\vec{v}" />.
    </p>

    <p className="subsubsection-paragraph">
        Consider a simple corpus with the vocabulary: ["apple", "banana", "cherry", "date"]. If a document only contains the words: "apple cherry apple", the corresponding BoW vector would 
        be [2, 0, 1, 0]. This vector indicates that "apple" appears twice, "banana" and "date" do not appear, and "cherry" appears once. 
    </p>

    <p className="subsubsection-paragraph">
        While the BoW model is straightforward and effective for various tasks like document classification and spam filtering, it has limitations. It doesn't capture the order of words, making 
        it inadequate for understanding linguistic nuances like syntax and semantics. Also, the model can lead to high dimensionality if the vocabulary is large, which is a common scenario in
         natural language data. Run the following to see what a BoW matrix looks like:
    </p>

    <p className="subsubsection-paragraph">
    <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
        {`# Example of creating a Bag of Words model using Python's CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "Hello, how are you?",
    "Winning is not everything, it's the only thing.",
    "Today is a beautiful day."
]

# Create a CountVectorizer object
vectorizer = CountVectorizer()

# Fit the model and transform the documents
bow_matrix = vectorizer.fit_transform(documents)

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame for better visualization
import pandas as pd
df = pd.DataFrame(bow_matrix.toarray(), columns=feature_names)
print(df)`}
    </SyntaxHighlighter>
    </p>
                </section>

                <section id="tfidf" className="code-cleaned">
                    <h2>TF-IDF</h2>
                    <p className="subsubsection-paragraph">
                        TF-IDF is a numerical statistic used in text mining and information retrieval to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting 
                        factor in searches, text mining, and user modeling. The value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word 
                        in the corpus, which helps to control for the fact that some words are generally more common than others.
                    </p>


    <p className="subsubsection-paragraph">
        The TF-IDF value is computed as follows:
        <ul>
            <li><strong>Term Frequency (TF):</strong> This measures how frequently a term occurs in a document. If <InlineMath math="t" /> is the term and <InlineMath math="d" /> is the document, 
            TF is calculated as:
            <div className="custom-math-size"><BlockMath math="\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}" /></div>
            </li>
            <li><strong>Inverse Document Frequency (IDF):</strong> This measures the importance of the term across a set of documents. The IDF of a term is calculated as:
            <div className="custom-math-size"><BlockMath math="\text{IDF}(t, D) = \log \left(\frac{\text{Total number of documents}}{\text{Number of documents with term } t \text{ in it}}\right)" /></div>
                where <InlineMath math="D" /> is the set of documents.
            </li>
        </ul>
        The TF-IDF score is the product of these two measures:
        <BlockMath math="\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)" />
    </p>

    <p className="subsubsection-paragraph">
        TF-IDF is widely used in the field of information retrieval and text mining. It is a key technique in document classification, allowing models to evaluate and rank the relative importance of 
        terms in documents. This is crucial in tasks like keyword extraction, topic modeling, and many types of text classification, such as filtering spam emails.
    </p>

    <p className="subsubsection-paragraph">
    The TF-IDF model is intrinsically linked to the Bag of Words (BoW) model, building upon its foundation to enhance the representation of text. While BoW focuses on the frequency of words in a 
    document, simply counting occurrences without considering the context of their use across different documents, TF-IDF adds an important layer of analysis. It not only accounts for the frequency 
    of words within a single document (as BoW does) but also adjusts this frequency based on how common or rare the word is in the entire corpus. This dual focus allows TF-IDF to mitigate one of 
    the primary limitations of BoW: the overemphasis on frequent but potentially less informative words. By weighing the terms based on their occurrence across multiple documents, TF-IDF provides a 
    more nuanced, contextually relevant representation of text. It recognizes that words which occur frequently in a specific document but not commonly in the corpus are more likely to be of 
    significant importance to the meaning of that document, hence offering a more insightful feature space for tasks like information retrieval and text classification.
    </p>

    <p className="subsubsection-paragraph">Feel free to run the following code example to get a canonical look at TF-IDF:</p>

    <p className="subsubsection-paragraph">
    <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
        {`# Example of calculating TF-IDF using scikit-learn's TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "The sky is blue.",
    "The sun is bright today.",
    "The sun in the sky is bright.",
    "We can see the shining sun, the bright sun."
]

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit the model and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame for better visualization
import pandas as pd
df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
print(df)`}
    </SyntaxHighlighter>
    </p>
</section>


                    <section id="ngram" className="code-cleaned">
                        <h2>N-Grams</h2>
                        <p className="subsubsection-paragraph">
                            N-grams are contiguous sequences of 'n' items from a given sample of text or speech. In the context of NLP, these items are typically words, but they can also be 
                            characters or syllables. N-grams are used to model the probability of each item in a sequence, based on the occurrence of previous items. They are a simple yet effective 
                            tool for capturing context in text data, which is crucial for many linguistic models.
                        </p>

                        <p className="subsubsection-paragraph">
                            The 'n' in N-grams represents the number of items in the sequence. For example:
                            <ul>
                                <li><strong>Unigrams (1-gram):</strong> Each item is considered on its own (e.g., "the", "cat").</li>
                                <li><strong>Bigrams (2-gram):</strong> Sequences of two items (e.g., "the cat", "cat sat").</li>
                                <li><strong>Trigrams (3-gram):</strong> Sequences of three items (e.g., "the cat sat").</li>
                            </ul>
                            The choice of 'n' depends on the specific application and the balance required between capturing sufficient context and managing computational complexity.
                        </p>

                        <p className="subsubsection-paragraph">
                            An N-gram model predicts the probability of an item based on its preceding items. For a sequence of 
                            words <InlineMath math="w_1, w_2, ..., w_n" />, the probability of word <InlineMath math="w_n" /> given the preceding words is:
                            <div className="custom-math-size"><BlockMath math="P(w_n | w_{n-1}, w_{n-2}, ..., w_{n-(N-1)})" /></div>
                            In practice, this is approximated using frequency counts from a corpus. This concept often comes up as a hyperparameter choice in many models -- much of this will 
                            be discussed again as we get into the details of each specific model but for now, just understand that an n-gram is essentially how many neighbors we want to consider 
                            in a given document (piece of text). And a basic analysis could be to just get a count of phrases as defined by the neighborhood size across documents and 
                            rank them; the most frequent can be represenatative of the general topics within that set of documents.
                        </p>

                        <p className="subsubsection-paragraph">
                            Below is a Python example of how to create bigrams from text using NLTK:
                        </p>

                        <p className="subsubsection-paragraph">
    <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
        {`import nltk
from nltk.util import bigrams

text = "I need to write a sentence with some words"
tokens = nltk.word_tokenize(text)
bigrams_list = list(bigrams(tokens))

print(bigrams_list)`}
    </SyntaxHighlighter>
                    </p>
                    </section>


                    <section id="rep" className="code-cleaned">
                        <h2>Miscellaneous</h2>
                        <p className="subsubsection-paragraph">
                            Here, I will just update if some additional, more simple concepts come up that are worth discussing.
                        </p>

                        <h4>One-Hot Encoding vs. Dense Representations</h4>
                        <p className="subsubsection-paragraph">
                            One-hot encoding represents each word as a sparse vector with a 1 in the position corresponding to the word in the vocabulary and 0s everywhere else. 
                            Dense representations, also known as word embeddings, use dense vectors where each word is represented by a real-valued vector in a high-dimensional space.
                        </p>

                        <h4>OOV</h4>
                        <p className="subsubsection-paragraph">
                            Out-of-vocabulary (OOV) issues occur when a word is not present in the training corpus's vocabulary. It is a common challenge in NLP, as new words, names, and misspellings 
                            are continually encountered. Strategies to handle OOV words include using subword units or character-level representations.
                        </p>

                    </section>

                
                
                <div className="subsubsection-navigation">
                    <Link to="/NLPBasics">← NLP Basics</Link>
                    <Link to="/NLPBasics/ling">Liguistics →</Link>
                </div>
            </main>
            
            <Footer />
        </div>
    );
}

export default NLPData;
