Jump to content

Natural language processing

From UBC Wiki
Source: AI Graphic Preisler, 2024, pg.6.

Compiled by

Updated

See also

Introduction

Natural language processing (NLP) is a branch of AI enabling computers to understand spoken and written human language. NLP enables text and speech recognition on devices, for example. NLP combines computational linguistics, machine learning, and computer science to bridge gaps between human communication and machine understanding; it allows systems to process vast amounts of data, extracting insights, identifying patterns, and responding in ways that mimic human comprehension.

For decades, the US National Library of Medicine has used NLP techniques in automated indexing, see entry below.

Some common NLP techniques

Natural language processing encompasses various techniques that enable computers to process and understand human language efficiently

Source: https://datasciencedojo.com/blog/natural-language-processing-applications/.

Some common techniques used in NLP include:
  1. Text preprocessing: several sub-techniques to prepare raw text data for analysis; cleans and organizes text, making it suitable for machine learning algorithms. Text preprocessing can significantly enhance performance of NLP models by reducing noise and ensuring consistency in data;
  2. Tokenization: breaks text into smaller units of words or phrases; essential for tasks such as text analysis and language modeling. Converts text into tokens, NLP manages and manipulates data, enabling precise interpretation and processing; foundation for NLP tasks, such as part-of-speech tagging and named entity recognition;
  3. Stemming: reduces words to their base or root form; “running,” “runner,” and “ran” are transformed to “run.” Technique normalizes words to a common base, facilitating better text analysis and information retrieval; stemming can sometimes produce non-dictionary forms of words; computationally efficient and beneficial for various text-processing applications;
  4. Lemmatization: considers the context and converts words to their meaningful base form. For instance, “better” becomes “good.” Unlike stemming, lemmatization ensures that the root word is a valid dictionary word, providing more accurate and contextually appropriate results. This technique is particularly useful in applications requiring a deeper understanding of language, such as sentiment analysis and machine translation.
  5. Parsing techniques: analyze the grammatical structure of sentences to understand their syntax and relationships between words. These techniques are integral to natural language processing as they enable machines to comprehend the structure and meaning of human language, facilitating more accurate and context-aware interactions.
  6. Semantic analysis aims to understand meaning behind words and phrases; interpreting semantics, machines can comprehend intent and nuances of human communication, leading to more accurate and meaningful interactions. ie., Named Entity Recognition (NER), Word Sense Disambiguation (WSD).
  7. Machine learning models: NLP relies on machine learning models (ie., deep learning, supervised and unsupervised learning) for various tasks; enabling machines to learn from data and perform complex language processing tasks with high accuracy.

Presentation

Note: This video is meant to be for informational purposes only. Any claims of the video should be tested for accuracy and verified.

Apps using NLP

Natural language processing (NLP) applications power tools such as virtual assistants, chatbots, and language translation services. These systems analyze syntax, semantics, and context to perform tasks such as sentiment analysis, speech recognition, and text summarization. NLP enables voice-activated devices to respond to spoken commands and search engines to deliver relevant results based on queries. In recent years, NLP development has been accelerated by deep learning models, particularly large language models (LLMs) trained on massive datasets, which enhance their ability to grasp nuances in responses regarding tone, intent, and cultural references.

Challenges with NLP include handling ambiguity, understanding low-resource languages, and mitigating biases embedded in training data. Ethical considerations are critical, as NLP systems will perpetuate stereotypes or misinformation if not carefully calibrated. Advances in transfer learning and fine-tuning have improved NLP's adaptability, allowing models to specialize in domains, such as legal or medical texts. NLP struggles with emotional intelligence and contextual depth, as human language is complex, shaped by culture, history, and personal experience. Ongoing research aims to make NLP more inclusive, efficient, and capable of understanding the subtleties of human communication, potentially transforming how we interact with technology and each other in fields ranging from education to healthcare.

National Library of Medicine (NLM)'s Use of NLP

In 2002, NLM created MetaMap archive, which was integral in its early Medical Text Indexer (MTI) (used in automated indexing). The MetaMap employed linguistic knowledge to map text to UMLS codes, and identified Unified Medical Language System (UMLS) concepts in text based on linguistic principles. MM uses a minimal commitment parser, lexicon, and part-of-speech tagger, all developed at the NLM. It then retrieves candidate terms from the UMLS Metathesaurus, and scores the terms based on an evaluation function. It includes a word-sense disambiguation facility, recently enhanced with a statistical context-sensitive method.

MetaMap underpins the Medical Text Indexer (MTI), which summarizes text using the Medical Subject Heading (MeSH) terminology. MTI has been used in production since 2002 for indexing MEDLINE citations, cataloguing and History of Medicine records. Although superceded by the MTIX in 2024, the prior MTI versions processed the titles and abstracts of PubMed records and then recommended MeSH terms, which were reviewed by experts who selected, revised, and approved terms. In February 2011, MTI became the first-line indexer (MTIFL) for a select number of journals, where it has historically performed well. The MTIFL indexing for these journals is only revised by an indexer. In 2025, MEDLINE indexing is driven by neural networks technology of the MTIX.

Translational NLP in the biomedical domain (BioNLP) is a topic of investigation at NLM, and needs to use the vast amount of biomedical knowledge and ontologies in NLP as well as the potential for handling the very complex verb-dominated biomolecular language utilizing sublanguage theory. Another example is SemRep, based on linguistic symbolic principles, which is used to extract predications needed for biomolecular text mining.

References

Disclaimer

  • Note: Please use your critical reading skills while reading entries. No warranties, implied or actual, are granted for any health or medical search or AI information obtained while using these pages. Check with your librarian for more contextual, accurate information.