Data wants to change its form to useful information!
Internet and interconnectivity of devices lead to the generation of data at a very high rate every day. But, this generated data itself is not useful (data doesn’t give any insights directly). Because of this reason, every Data Scientist tries to change the form of data to useful information to analyze past situations and to make future decisions. At a basic level, data can be classified into two types:
Structured data comprises of clearly defined data types whose pattern makes it easily understandable/searchable.
Unstructured data comprises of data that is not easily searchable, including formats like audio, images, videos and text data.
The major challenge for any Data Scientist is the conversion of unstructured data into structured data to get useful information from it. Here, we will discuss text data, text preprocessing steps, different libraries available in python and R for handling text data (word and sentence tokenization), extracting POS tags and converting text data into word embeddings.
Why text data mining is required?
These days, we are using applications like Facebook, Twitter, WhatsApp, Amazon, Gmail, tourism websites and several others to share our feedbacks, reviews, etc., which are collectively generating huge data every day. For building automation techniques like chatbots, email filtering, summarization, classifications, etc., we need to process unstructured text data into a structured format.
How much text data?
In today’s time, text data is generated at very high interims of volume because of upgraded smartphones, laptops, etc., and also the speed of the internet. As per GSMA Intelligence report, by the end of 2018, around 5.1 billion people in the world subscribed to mobile services accounting for 67% of the global population. Since people use smartphones for texting, chatting, emailing to communicate with the business and a lot more tasks, organizations majorly depend on text data to analyze customers and make decisions. Hence, they need text mining and Natural Language Processing (NLP) to make sense out of this data. Thankfully, the amount of text data generated has exploded exponentially in the last few years. If we get more data, we can improve the accuracy of automation.
Text Pre-processing Steps:
Basic text analytics pre-processing steps include:
First, clean the data using regular expressions
Regular expressions (Regexes) are commonly used for searching, replacing text and removing unnecessary text from the original data. Refer to the below link to practice working with regular expressions: https://www.w3schools.com/python/python_regex.asp
By default, we represent text in the form of String values. These values store a sequence of characters. The first step of an NLP pipeline is, therefore, to split the text into smaller units corresponding to the words of the language we are considering. In the context of NLP, we often refer to these units as tokens, and the process of extracting these units is called tokenization.
Converting text into lower case (Required based on the problem statement)
When we want to do text classification, we need to convert the tokens into lower case. And while we need to identify the start of a sentence, nouns, importance of a specific word, then we need to maintain a backup of original text also. Example: A person writing a review – “The product has bad quality” is different from “The product has BAD QUALITY” where the person is stressing upon the “quality” of a product in the second case.
Stop words removal (Not in all cases)
Stop words usually refer to the most common words in a language, carrying less important meaning than the keywords. By removing the stop words, we can reduce the data size. Based on the problem statements, we have to decide whether removing stop words is good or not.
Text normalization (Stemming and Lemmatization)
Text normalization is a process by which text is transformed to make it consistent in a way which it might not have been before. Stemming and Lemmatization is Text Normalization techniques (or sometimes called Word Normalization) in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing. At a basic level, both stemming and lemmatization follow different approaches to convert word tokens into its root form. For example, if we do lemmatization on words like “play”, “playing” and “played”, then all three words converted to root form “play”.
Feature Extraction – Extraction of POS (Parts of Speech) tags (if needed)
Take, for example, the word “but”, as a conjunction, it anticipates that the next words in the sentence are to oppose the previous words. On the other hand, as a proposition, it has the same meaning as “except”. It would be useful for our DL system, therefore, to take into account these labels (such as “conjunction” or “proposition”) for each string to differentiate them and perform better. This is precisely what POS tagging and another linguistic-based preprocessing task do. “POS” stands for “part-of-speech”, which is how linguists refer to the syntactical behaviour of words, what you may have heard as grammatical categories. They are indeed very useful in linguistics and used widely to distinguish one word from another.
Making data ready for building a model
Create a TFIDF data frame (or) Word embeddings in order to convert text data into numerical data (making data ready for model building). After doing the required preprocessed steps, and if data is made ready for model building, we need to run a model on it and then make predictions.
TEXT CLASSIFICATION EXAMPLE PROBLEM (Reviews data):
Both R and Python are open source languages with large community support and both have different advantages and disadvantages as well. So, the selection of choice from R and Python mainly depends on your programming preferences and experiences. You can use either R or Python based on your requirements/problem statement.
Hope this article was helpful in giving a fair idea about the various attention mechanisms that are known today.
Here you can notice that you have multiple libraries available in both R and Python (even though ‘spacyr’ library in R load its environment from Python, but if you are familiar with R you can use R).
If you are in a dilemma and you have no idea which one to choose, then my suggestion is to prefer Python for text processing. It doesn’t mean that R is not good to use, you can use R also. This statement is based only on the explored libraries. Maybe there are more libraries to be explored in both R and Python, which are more preferable.
Any questions, feedback, suggestions for improvement are most welcome. 🙂