Natural Language Processing

Published by Daniel at October 9, 2023

Natural language processing (NLP) is the field within data science concerned with understanding and generating human language (written or spoken) using computer techniques and machinery to interrogate free text.

NLP aims to create structure from unstructured data (clinical free text loosed on the world daily). These structured data can then provide a substrate to train Machine Learning Models. In a way, NLP is a collection of tasks aiming to bridge the gap between computer and human communication.

Background

NLP started in the 1950s as the intersection of artificial intelligence and linguistics. It was distinct from text information retrieval, a highly scalable statistics-based technique to index and search large volumes of text efficiently. When the Turing test was introduced as a criterion of intelligence in artificial intelligence and Chomsky theoretics on Syntactic structure and Universal grammar growth, NLP started to take shape. (1)

There are three stages in NLP:

Symbolic NLP: Handwritten rules.
Statistical NLP: Insertion of algorithms.
Neural NLP: Representation, learning, and deep neural network-style machine learning methods become widespread in NLP.

Tasks in NLP

By utilizing NLP, developers can organize and structure knowledge to perform tasks:

Among the most important are (1) :

Tokenization: A token is typically a word, but punctuation symbols such as periods, commas, and dashes are also tokens.

Sentence Splitting: Sentences are more challenging to define, but punctuation symbols typically delimit them.

Part-of-Speech Tagging (POS): Task of determining the part-of-speech for each token in a text.

Named Entity Recognition (ER): Detects named entities such as persons, locations, or dates from free-text and then classify them into their appropriate categories.

Relation extraction: Related to NER. Mapping relations between named entities.

By utilizing NLP, developers can organize and structure knowledge to perform tasks:

Among the most important are (1) :

Tokenization: A token is typically a word, but punctuation symbols such as periods, commas, and dashes are also tokens.

Sentence Splitting: Sentences are more challenging to define, but punctuation symbols typically delimit them.

Part-of-Speech Tagging (POS): Task of determining the part-of-speech for each token in a text.

Named Entity Recognition (ER): Detects named entities such as persons, locations, or dates from free-text and then classify them into their appropriate categories.

Relation extraction: Related to NER. Mapping relations between named entities.

. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.(1,2,3)

Design of NLP system

NLP systems contain two main components: Background knowledge and framework. (2)

Background knowledge: Important in building clinical NLP systems (domain and clinical knowledge are key elements). The Unified Medical Language System (UMLS), initiated in 1986, is the most widely used knowledge resource in clinical NLP. It contains vocabularies of biomedical concepts and provides mappings across them. There are three main components: the Metathesaurus (over one million biomedical concepts and five million concept names from over 150 controlled vocabularies, MeSH, SNOMED CT) (2) the Semantic Network (categorization of all concepts represented in Metathesaurus, reducing the complexity) and the SPECIALIST lexicon which contains syntactic, morphological, and spelling information for biomedical terms.

Framework: There are two main approaches for building NLP tools. The first is rule-based, which mainly uses dictionary lookup and rules. The second takes a machine learning approach based on annotated corpora to train learning algorithms. Most contemporary NLP systems are hybrid, built from a combination of rule-based and machine learning methods.

The framework can be included into the NLP system itself, or it can choose available general architectures. The two most widely used are GATE and UIMA. The former was written in Java, originally developed in 1995, and commonly used in the NLP community; the latter was written in Java/C++ and developed by IBM and part of the Apache Software Foundation software.

NLP systems contain two main components: Background knowledge and framework. (2)

There are three main components: the Metathesaurus (over one million biomedical concepts and five million concept names from over 150 controlled vocabularies, MeSH, SNOMED CT) (2) the Semantic Network (categorization of all concepts represented in Metathesaurus, reducing the complexity) and the SPECIALIST lexicon which contains syntactic, morphological, and spelling information for biomedical terms.

Perspectives

Structured responses do not necessarily increase data validity. (3) Choosing answers from a structured list of options may worsen validity if they don’t allow to summarize or include contextual information. Also, the volume and velocity of the information make thorough and timely human review impossible. NLP has demonstrated potential utility in healthcare data (clinical incident reporting, detection of thrombosis in Radiology reports, malignancy diagnoses, preventing adverse drug events, identifying hypoglycemic episodes, healthcare urinary infections, cancer recurrence, etc. (3) It is time to rely on automated data processing to improve clinical outcomes and support clinical decisions.

References