import React from 'react';
import '../../styles/subsection.css';
import Header from '../../components/Header';
import Footer from '../../components/Footer';
import { Link } from 'react-router-dom';
import 'katex/dist/katex.min.css';
import { InlineMath, BlockMath } from 'react-katex';
import { LightAsync as SyntaxHighlighter } from 'react-syntax-highlighter';
import { docco } from 'react-syntax-highlighter/dist/esm/styles/hljs';

// Image imports
import const_parsing from '../../media/Linguistics/const_parsing.png';
import depends_parsing from '../../media/Linguistics/depends_parsing.png'

function Linguistics() {
    return (
        <div className="subsubsection-container">
            <Header />
            <div class="side-nav-container">
                <aside className="subsubsection-side-nav">
                    <a href="#stem">Stemming & Lemmatization</a>
                    <a href="#tagging">POS Tagging</a>
                    <a href="#ner">NER</a>
                    <a href="#parsing">Parsing</a>
                    <a href="#beyond">Other</a>
                </aside>
            </div>
            
            <main className="subsubsection-content">
                <div className="titles"><h1>Linguistics</h1></div>

                <section id="stem" className="code-cleaned">
                <h2>Stemming & Lemmatization</h2>

                <p className="subsubsection-paragraph">
                    In this section, we'll continue to learn about important concepts and terms within NLP. If there use doesn't seem immediately obvious, don't worry -- often, a lot of these methods 
                    are tools you can just have in the back of your head for use later on when you are creating your own models and so, their utility will be more apparent in those situations.
                </p>

                <h4>Morphology</h4>
                <p className="subsubsection-paragraph">
                    Morphology is the branch of linguistics concerned with the study of the form and structure of words, particularly through morphemes, which are the smallest grammatical units in a
                     language. A morpheme may be as small as a single letter, such as 's' indicating possession, or a complex word element that conveys specific meaning, like "un-" denoting negation.
                </p>

                <p className="subsubsection-paragraph">
                    Some more jargon -- there are two primary types of morphemes:
                    <ul>
                        <li><strong>Stems or Roots:</strong> The central part of the word that holds the basic meaning.</li>
                        <li><strong>Affixes:</strong> These are modifiers such as prefixes, suffixes, infixes, and circumfixes that alter the meaning of the stems.</li>
                    </ul>
                </p>

                <h4>Stemming</h4>
                <p className="subsubsection-paragraph">
                    Stemming is a process of reducing words to their word stem, base or root form -- generally a written word form. The idea is to remove affixes (prefixes and suffixes) to get to a 
                    base form of the word. For example, "fishing", "fished", "fisher" all reduce to the stem "fish".
                </p>
                <p className="subsubsection-paragraph">
                Morphological analysis is fundamental to various NLP applications, from information retrieval and text processing to machine translation and speech recognition. 
                Here's how you might implement a simple stemmer in Python using the NLTK library:

                <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
{`from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ["program", "programs", "programer", "programing", "programers"]
stems = [ps.stem(word) for word in words]
print(stems)`}
    </SyntaxHighlighter>
                </p>



                        <h4>Lemmatization</h4>
                        <p className="subsubsection-paragraph">
                            Lemmatization is the algorithmic process of determining the lemma for a given word. It involves linguistic analysis to remove inflectional endings and return the base or 
                            dictionary form of a word, which is known as the lemma.
                        </p>

                        <p className="subsubsection-paragraph">
                            The process requires understanding the word's part of speech, its tense, and its role in the sentence. Lemmatization typically uses a lexical knowledge base like WordNet,
                             along with a set of specific language rules, to correctly identify the lemma. For instance, the verb 'running' is lemmatized to 'run', and the comparative adjective 'better' is lemmatized to 'good'. This is done through detailed linguistic 
                            analysis rather than a simple algorithmic heuristic. Unlike stemming, which might incorrectly reduce 'university' to 'univers', lemmatization would leave 'university' unchanged, recognizing it as a base form. The 
                            lemmatization process is more accurate but computationally intensive compared to the relatively simpler and faster stemming process.
                        </p>


                        <p className="subsubsection-paragraph">
                            Accurate lemmatization is essential for tasks that depend on precise word identification, such as semantic reasoning, text analysis, and information retrieval systems 
                            where context and accuracy are critical.
                        </p>

                        <p className="subsubsection-paragraph">
                            Here is an example of lemmatization using the NLTK library:

                            <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
{`from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["am", "are", "is"]
lemmas = [lemmatizer.lemmatize(word, pos="v") for word in words]
print(lemmas)`}
    </SyntaxHighlighter>
                        </p>


    <p className="subsubsection-paragraph">
    Note that the lemma for "am," "are," and "is" is "be" because in English grammar, these words are various conjugations of the verb "to be" in the present tense. Lemmatization involves reducing a word to 
    its base or dictionary form, and for verbs, this typically means converting them to the infinitive. </p>
                    </section>

                
                    <section id="tagging" className="code-cleaned">
                        <h2>POS Tagging</h2>
                        <p className="subsubsection-paragraph">
                            Part-of-Speech tagging is a process in NLP where words in a text are marked with their corresponding part of speech. This grammatical classification is crucial for
                             understanding the roles that words play in sentences and for further linguistic processing of text.
                        </p>

                        <p className="subsubsection-paragraph">
                            POS tagging can be rule-based, relying on language-specific grammatical rules, or stochastic, where it utilizes statistical methods based on a corpus of annotated text. 
                            Modern approaches typically involve machine learning algorithms that can handle complex patterns and contextual nuances. As an example, consider the sentence "The quick brown fox jumps over 
                            the lazy dog." -- a POS tagger would annotate each word with its respective part of speech, such as noun (N), verb (V), adjective (ADJ), etc., helping to parse the sentence
                             structure and meaning. In this case, we would have the following:
                             <ul>
                                <li>The (Determiner)</li>
                                <li>quick (Adjective)</li>
                                <li>brown (Adjective)</li>
                                <li>fox (Noun)</li>
                                <li>jumps (Verb)</li>
                                <li>over (Preposition)</li>
                                <li>the (Determiner)</li>
                                <li>lazy (Adjective)</li>
                                <li>dog (Noun)</li>
                             </ul>
                        </p>

                        <p className="subsubsection-paragraph">
                            Accurate POS tagging is vital for syntactic parsing, word sense disambiguation, named entity recognition (you'll learn this next), and improving the quality of machine translation.
                        </p>

                        <p className="subsubsection-paragraph">
                            The main challenges in POS tagging include dealing with the ambiguity of word classes, requiring deep contextual understanding, and adapting to the variability across 
                            different languages.
                        </p>

                        <p className="subsubsection-paragraph">
                            A simple Python example using the NLTK library for POS tagging is as follows:

                            <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
{`import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

from nltk import word_tokenize, pos_tag

sentence = "Natural language processing empowers computational systems to understand human language."
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)

print(pos_tags)`}
        </SyntaxHighlighter>
                            </p>


                </section>



                <section id="ner" className="code-cleaned">
                    <h2>Named Entity Recognition (NER)</h2>
                    <p className="subsubsection-paragraph">
                        Named Entity Recognition is a process that identifies named entities in text and classifies them into predefined categories, such as the names of persons, 
                        organizations, locations, expressions of times, quantities, monetary values, percentages, etc. As an example, in the sentence "George Washington went to Washington." 
                        NER systems would identify "George Washington" as a person and the second "Washington" as a location.
                    </p>

                    <p className="subsubsection-paragraph">
                        Essentially, <InlineMath math="P(\text{Entity Type} | \text{Context of Word})" /> represents the probability of a certain type of entity given the context of the word. NER models 
                        compute this to classify words into entities (categories).
                    </p>

                    <p className="subsubsection-paragraph">
                        There are various types of NER classifiers, such as rule-based, list-based, and advanced machine learning models including Conditional Random Fields (CRFs) and neural 
                        network approaches like LSTM networks. 
                    </p>

                    <p className="subsubsection-paragraph">
                        NER is used in many applications like information retrieval, content classification, and knowledge graph creation. It's particularly useful in areas requiring understanding of 
                        the text content, such as legal document analysis and news aggregation. We will be revisiting some of these as we continue to move through these sections.
                    </p>

                    <h4>An Example</h4>
                    <p className="subsubsection-paragraph">
                        Here's a Python example using the SpaCy library to perform NER:

                        <SyntaxHighlighter language="python" style={docco} className="codeStyle_small">
{`import spacy

nlp = spacy.load("en_core_web_sm")

text = "Apple is looking at buying U.K. startup for $1 billion"

doc = nlp(text)

for entity in doc.ents:
    print(entity.text, entity.label_)`}
        </SyntaxHighlighter>
                        </p>

                </section>


                <section id="parsing" className="code-cleaned">
                    <h2>Parsing</h2>

                    <p className="subsubsection-paragraph">
                        Parsing in NLP is the process of analyzing the structure of sentences by identifying its constituents and their syntactic relations. It involves breaking down a text into 
                        its constituent parts and understanding its grammar. The primary goal of parsing is to understand how the parts of a sentence (like nouns, verbs, adjectives) are 
                        organized to convey meaning.
                    </p>

            
                    <p className="subsubsection-paragraph">
                        There are different parsing techniques like dependency parsing and constituency parsing, each providing different insights into the sentence structure. The two primary examples are:

                        <ul>
                            <li><strong>Constituency Parsing: </strong> Identifies the constituents (noun phrases, verb phrases, etc.) of a sentence and organizes them into a parse tree, which 
                            reflects the syntax of the sentence according to a formal grammar. The tree shows how different parts of the sentence group together to form phrases and clauses.</li> <br />

                            <div className="flex-container"><img src={const_parsing} alt="Broken" className="image-tiny"/></div><br />

                            The above are defined: S: Sentence, the root of the tree;
                              NP: Noun Phrase, a phrase that functions as a noun;
                             VP: Verb Phrase, a phrase that functions as a verb;
                              NNP: Proper Noun, singular;
                              VBD: Verb, past tense. <br /> <br />



                            <li><strong>Dependency Parsing:</strong> Focuses on the relationships between words in a sentence. It establishes a dependency tree where the nodes are words, and the edges 
                            represent dependencies between them. These are the grammatical relationships (like subject, object, modifier) between words, indicating how each word depends on or is 
                            connected to others.</li> <br />

                            <div className="flex-container"><img src={depends_parsing} alt="Broken" className="image-tiny"/></div> <br />

                            Here, we observe the following definitions: loves: The main verb; the root of the parse.
                            John: Subject of the verb "loves."
                            Mary: Object of the verb "loves."

                        </ul>
                    </p>

                    <p className="subsubsection-paragraph">
                        Below is an example of constituency parsing using the NLTK library:

                        <SyntaxHighlighter language="python" style={docco}  className="codeStyle_small">
{`import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

from nltk import pos_tag, word_tokenize, Tree

sentence = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(sentence)
tags = pos_tag(tokens)

grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(tags)
result.draw()`}
        </SyntaxHighlighter>
                        </p>


                  
                    <p className="subsubsection-paragraph">
                        As an aside, if it wasn't obvious, Parsing can be computationally expensive and challenging with complex sentences, ambiguous grammar, and language idiosyncrasies.
                    </p>
                </section>


                <section id="beyond" className="code-cleaned">
                    <h2>Additional Fields</h2>
                    <p className="subsubsection-paragraph">
                        Here, I will summarize some additional sub-fields of work that are related to processing langauge.
                    </p>

                    <h4>Semantic Role Labelling</h4>
                    <p className="subsubsection-paragraph">
                        Semantic Role Labeling (SRL) is about understanding the "who did what to whom" in a sentence. It assigns labels such as Agent, Patient, Instrument, etc., to parts of a sentence
                        (these are referred to "semantic roles").
                        In other words, SRL involves identifying verbs and their corresponding arguments (like subjects, objects) and assigning semantic roles to these arguments. For instance, 
                        in the sentence "John gave Mary a book," John is the "giver" (Agent), Mary is the "recipient" (Beneficiary), and "a book" is the "thing given" (Theme). <br/> <br/>

                        Another another example, consider the sentence: "Alice drove to the supermarket.", then with SLR, we would look to achieve the following:

                        <ul>
                            <li>"Alice" - Agent (the one who is driving)</li>
                            <li>"drove" - Predicate (the action)</li>
                            <li>"to the supermarket" - Goal (the destination of the driving)</li>
                        </ul>

                        Semantic Role Labeling is a sophisticated aspect of NLP that requires a deep understanding of both syntax and semantics. It plays a critical role in advancing the comprehension 
                        capabilities of NLP systems, allowing them to process and interpret human language more effectively. It's also different than parsing in that
                        parsing focuses more on the more objective elements of what makes a sentence (nouns, verbs, etc.) whereas here, we are focused on the intent behind some of the word choices. 
                    </p>

                    <h4>Coreference Resolution</h4>

                    <p className="subsubsection-paragraph">
                    Coherence resolution refers to the task of ensuring that a text is logically and stylistically consistent throughout its entirety. It's 
                    about maintaining a coherent flow in a narrative or discourse, making sure that various parts of the text are logically connected and contribute to the overall understanding of 
                    the subject.
                    </p>

                    <p className="subsubsection-paragraph">
                    Consider a simple narrative: "John went to his favorite restaurant. He ordered the usual meal. The waiter knew exactly what he wanted."

                    To resolve coherence in this text, an NLP system must understand that "he" refers to "John" and "the usual meal" refers to a specific meal that John regularly orders at 
                    this restaurant, even though the meal itself isn't explicitly described.
                    </p>

                    <p className="subsubsection-paragraph">
                        Some key types of coherence:
                    <ul>
                        <li>
                            <strong>Referential Coherence:</strong> This involves managing how entities and concepts are introduced and maintained throughout the text. It includes resolving 
                            pronouns and ensuring entity consistency.
                        </li>
                        <li>
                            <strong>Logical Coherence:</strong> Ensuring the text follows a logical structure, with clear cause-effect relationships and logical progression of ideas.
                        </li>
                        <li>
                            <strong>Temporal Coherence:</strong> Involves maintaining a consistent and logical timeline within the narrative, ensuring chronological order in event descriptions.
                        </li>
                        <li>
                            <strong>Thematic Coherence:</strong> Relates to maintaining consistency in themes and topics, ensuring that all parts of the text contribute meaningfully to the overall theme.
                        </li>
                    </ul></p>

                    <p className="subsubsection-paragraph">
                        One use case of Coreference Resolution is the task of finding all expressions that refer to the same entity in a text. It's crucial for tasks like document summarization and question answering. 
                    </p>

                    <h4>Relation Extraction</h4>
                    <p className="subsubsection-paragraph">
                        Relation Extraction involves identifying and classifying semantic relationships between entities within a text, often used to build knowledge graphs.
                        <div className="custom-math-size"><BlockMath math="\text{Entity}_1 \xrightarrow[\text{relationship}]{\text{extract}} \text{Entity}_2" /></div>
                        While this equation is hilariously pointless, it's just to showcase that 
                        you might look for two things that are related somehow (e.g. family members) through the use of embeddings, etc. 
                    </p>

                    <h4>Record Linkage</h4>
                    <p className="subsubsection-paragraph">
                        Record Linkage is the task of linking records from different databases that refer to the same entity, fundamental for data cleaning and integration. For example, you could have 
                        a giant database of all businesses in Canada -- this actually exists and is called the Business Registar. Often, there are duplicates within this data set due to multiple 
                        filings by different members of the same business; finding these matches would be an example of a record linkage problem. 
                    </p>
                </section>

                
                
                <div className="subsubsection-navigation">
                    <Link to="/nlpbasics/data">← NLP Data</Link>
                    <Link to="/nlpbasics/semantic">Semantic & Sentiment Analysis →</Link>
                </div>
            </main>
            
            <Footer />
        </div>
    );
}

export default Linguistics;
