NATURAL LANGUAGE PROCESSING (BAI601)

NATURAL LANGUAGE PROCESSING

Course Code BAI601
CIE Marks 50
Teaching Hours/Week (L:T:P: S) 3:0:2:0
SEE Marks 50
Total Hours of Pedagogy 40 hours Theory + 8-10 Lab slots
Total Marks 100
Credits 04
Exam Hours 03
Examination nature (SEE) Theory

MODULE-1

Introduction: What is Natural Language Processing? Origins of NLP, Language and Knowledge, The Challenges of NLP, Language and Grammar, Processing Indian Languages, NLP Applications.

Language Modeling: Statistical Language Model - N-gram model (unigram, bigram), Paninion Framework, Karaka theory.

Textbook 1: Ch. 1, Ch. 2.

MODULE-2

Word Level Analysis: Regular Expressions, Finite-State Automata, Morphological Parsing, Spelling Error Detection and Correction, Words and Word Classes, Part-of Speech Tagging.

Syntactic Analysis: Context-Free Grammar, Constituency, Top-down and Bottom-up Parsing, CYK Parsing.

Textbook 1: Ch. 3, Ch. 4.

MODULE-3

Naive Bayes, Text Classification and Sentiment: Naive Bayes Classifiers, Training the Naive Bayes Classifier, Worked Example, Optimizing for Sentiment Analysis, Naive Bayes for Other Text Classification Tasks, Naive Bayes as a Language Model.

Textbook 2: Ch. 4.

MODULE-4

Information Retrieval: Design Features of Information Retrieval Systems, Information Retrieval Models - Classical, Non-classical, Alternative Models of Information Retrieval - Custer model, Fuzzy model, LSTM model, Major Issues in Information Retrieval.

Lexical Resources: WordNet, FrameNet, Stemmers, Parts-of-Speech Tagger, Research Corpora.

Textbook 1: Ch. 9, Ch. 12.

MODULE-5

Machine Translation: Language Divergences and Typology, Machine Translation using EncoderDecoder, Details of the Encoder-Decoder Model, Translating in Low-Resource Situations, MT Evaluation, Bias and Ethical Issues.

Textbook 2: Ch. 13.

PRACTICAL COMPONENT OF IPCC

Experiments

1 Write a Python program for the following preprocessing of text in NLP:

● Tokenization

● Filtration

● Script Validation

● Stop Word Removal

● Stemming

2 Demonstrate the N-gram modeling to analyze and establish the probability distribution across sentences and explore the utilization of unigrams, bigrams, and trigrams in diverse English sentences to illustrate the impact of varying n-gram orders on the calculated probabilities.

3 Investigate the Minimum Edit Distance (MED) algorithm and its application in string comparison and the goal is to understand how the algorithm efficiently computes the minimum number of edit operations required to transform one string into another.

● Test the algorithm on strings with different type of variations (e.g., typos, substitutions, insertions, deletions)

● Evaluate its adaptability to different types of input variations

4 Write a program to implement top-down and bottom-up parser using appropriate context free grammar.

5 Given the following short movie reviews, each labeled with a genre, either comedy or action:

● fun, couple, love, love comedy

● fast, furious, shoot action

● couple, fly, fast, fun, fun comedy

● furious, shoot, shoot, fun action

● fly, fast, shoot, love action and

A new document D: fast, couple, shoot, fly Compute the most likely class for D. Assume a Naive Bayes classifier and use add-1 smoothing for the likelihoods.

6 Demonstrate the following using appropriate programming tool which illustrates the use of information retrieval in NLP:

● Study the various Corpus – Brown, Inaugural, Reuters, udhr with various methods like

filelds, raw, words, sents, categories

● Create and use your own corpora (plaintext, categorical)

● Study Conditional frequency distributions

● Study of tagged corpora with methods like tagged_sents, tagged_words

● Write a program to find the most frequent noun tags

● Map Words to Properties Using Python Dictionaries

● Study Rule based tagger, Unigram Tagger

Find different words from a given plain text without any space by comparing this text with a given corpus of words. Also find the score of words.

7 Write a Python program to find synonyms and antonyms of the word "active" using WordNet.

8 Implement the machine translation application of NLP where it needs to train a machine translation model for a language with limited parallel corpora. Investigate and incorporate techniques to improve performance in low-resource scenarios.

Suggested Learning Resources:

Textbook:

1. Tanveer Siddiqui, U.S. Tiwary, “Natural Language Processing and Information Retrieval”, Oxford University Press.

2. Daniel Jurafsky, James H. Martin, “Speech and Language Processing, An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition”, Pearson Education, 2023.

Reference Books:

1. Akshay Kulkarni, Adarsha Shivananda, “Natural Language Processing Recipes - Unlocking Text Data with Machine Learning and Deep Learning using Python”, Apress, 2019.

2. T V Geetha, “Understanding Natural Language Processing – Machine Learning and Deep Learning Perspectives”, Pearson, 2024.

3. Gerald J. Kowalski and Mark.T. Maybury, “Information Storage and Retrieval systems”, Kluwer Academic Publishers.

About Me

Az Documents

NATURAL LANGUAGE PROCESSING (BAI601)