A brief introduction on data processing techniques in NLP

Nurul Huda
4 min readApr 28, 2021

With the advancement of AI, many challenging tasks can be accomplished with the help of technology in a short span of time. People who speak different languages can text each other using translation apps like Google Translator. There are many apps which can recognize human voice and performs a particular task. Examples of such apps include Cortana, Siri, etc.

We might have come across the terms such as speech recognition, natural language understanding, and natural-language generation. All these terms belongs to a field of AI called NLP. NLP stands for Natural Language Processing which is a field that deals with the interactions between computers and human language so that computers are able to process and analyze large amounts of natural language data. This interaction results a computer capable of understanding the human language. Its ultimate goal is to read, decipher, understand and make sense of the human languages in a manner that is valuable.

The above image shows human and machine interaction using NLP. Such interaction can involve the following steps-

  • Human talks to the machine
  • Machine records the audio
  • Conversion of audio into text
  • Text data obtained is processed using NLP techniques
  • Processed text data is converted into audio
  • Machine responds by playing the audio obtained in the above step

Now, let’s discuss how NLP works. It uses algorithms to identify and extract the natural language rules in such a way that the unstructured language data( such as audio) is converted into a form that computers can understand. It will then apply algorithms to extract meaning associated with every sentence and collect the essential data from them.

Some of the techniques to process data in NLP are given below-

  • Tokenization:

It is the process of segmenting texts into sentences and words. Here, punctuations are removed. Each segmented part of text is called Token. For example if you apply tokenization for the following texts.

“My pet’s name is Ruby.”

Tokenized form of the above sentence would be

My pet name is Ruby

It can be problematic when dealing with biomedical text domains which contain lots of hyphens, parentheses, and other punctuation marks.

  • Stop Words Removal:

Here, it involves removing widespread and frequent terms that are not informative and significant. Such removals include nouns, pronouns, articles, preposition, conjunctions, etc. like “and”, ”the” ,”a”, ”to”, ”my”, etc. But removing these stop words may sometime lose a chunk of significant information from our data.

  • Bag of Words:

Here each word in the text is counted to get their frequency. These word frequencies then used as features for training a classifier. It may lead to absence of semantic meaning and context, and the facts that some more frequent stop words add noise to the analysis and some words are not weighted accordingly. To resolve this issue, we may consider TDF-IDF which improves the bag of words by weights by penalizing the most frequent words.

  • Stemming:

It is the process of slicing the end or the beginning of words with the intention of removing affixes. For example if we apply stemming in the word “News” and “Newer”, we will get “New” as the result. It can be used to correct spelling errors from the tokens. It is simple to use and run very fast.

  • Lemmatization:

It is the process of reducing a word to its base form and grouping together different forms of the same word. For example, the words “going”, “Went” and “gone” can be reduced to its base form “go”. In short, we can say that it groups words with similar meaning to their root. It takes into consideration the context of the word in order to solve other problems like disambiguation meaning, it can discriminate between identical words that have different meanings depending on the specific context.

  • Topic Modeling:

It is the process of learning, recognizing, and extracting these topics across a collection of documents. It uncovers hidden structures in sets of texts or documents. It clusters texts to discover latent topics based on their contents, processing individual words and assigning them values based on their distribution.

Following are some applications of NLP are-

  • Question Answering,
  • Spam Detection,
  • Sentiment Analysis,
  • Machine Translation,
  • Spelling correction,
  • Speech Recognition,
  • Chatbot, etc.

In fact, NLP is one of the AI technology that help us in solving many challenging issues. It helps computers to communicate with humans in human languages. This field of AI is developing at a very fast pace and in future human space exploration programs, there will be AI systems that will talk with astronauts in human language using NLP and perform the tasks assign to system.

--

--

Nurul Huda

Data Scientist | Business/Data Analyst | Data Engineer