Text Processing and Natural Language Processing: What is the relation between these two terms? Are the two mutually exclusive? Is one a subset of the other? Are they different techniques for solving the same problem? In this post, I share my thoughts on how to think about these two terms. I would love to hear your thoughts too. Do leave a note in the comments section.
Before we delve deeper, a bit of context: My thoughts are shaped by my academic-research and industrial-research work in Artificial Intelligence / Machine Learning (AI/ML) first using speech data (Automatic Speech Recognition as part of my Ph.D. thesis, then Automatic Speech Recognition and Intent Mining on call-center speech data at IBM Research), then using text data (Intent Mining and Flow-Analysis of email and chat data in the call-center context at IBM Research), then on video data (Topic Identification, Automatic Segmentation and Summarization, and Ontology Creation of educational videos at Xerox Research), and my ongoing work at Envestnet | Yodlee on enriching the text data corresponding to financial transactions (Some examples of financial transactions are: ATM withdrawal, online money transfer, debit/credit card based purchase at a merchant outlet or on an e-portal and so on).
Note that Text Processing is often also referred to as Text Mining or Text Analytics. Throughout this blog, Natural Language Processing refers to processing the textual information which was either generated in textual format (like an email) or was derived from another modality (like a speech signal transcribed into text).
Just about every blog on this topic seems to agree that Text Processing is the superset which includes Natural Language Processing but the distinction between the two seems to be based either on the data science techniques used or on the intended end-goal. For example, Part of Speech (PoS) Tagging and Named Entity Recognition (NER) goals are associated with NLP whereas shallow analysis like surface-level pattern mining or bigram/trigram analysis is associated with Text Processing.
Start with a fun activity
I have a totally different take on how the two terms should be thought about. But before we discuss that I would like you to do this small activity which won’t take more than a minute: In the link below, type in how you would introduce yourself in a formal setting (for example, imagine you have to introduce yourself to a group of new acquaintances at a professional conference or imagine you are at a recruiter’s group discussion session). Once you complete the activity you will also receive an AI/ML-powered ‘formal communication score’ for your input. So, do try it out for the fun factor also! What aspects of text processing and/or natural language processing do you think were used to generate the score? The answer should become clear as you read through the blog. I have also provided some details towards the end of the blog.
Now that you have typed your brief formal introduction, let us reflect on that a little bit. I am guessing you had a very clear intent in mind: to provide a glimpse of your professional background highlighting a few of your unique skills and achievements. The automatic score generated by the system is a way of quantifying the ‘informativeness’ as well as the ‘expressiveness’ of your introduction. Now, imagine if you had to introduce yourself to a similar group at another recruiting event/group discussion. I bet there is a very tiny chance you will use the exact same set of words or even the exact same sentence structure, even when the intent and the information to be conveyed remains exactly the same. For example, on the first day, I may introduce myself as “I am Om Deshmukh” and on the second day I may start off by saying “Hello everybody, Om here”. There is an even slimmer chance that each of the members introduces themselves using the same sentence structure. In fact, if the same construct is used by all (or even a majority), we may call the exchange as ‘mechanical’ or ‘lacking expressiveness’. For such data, the end-goal of Named Entity Recognition (NER) seems like one of the reasonable processing goals. NER addresses questions like: ‘Who is the person, which company(ies) did s/he work(s) in, etc.?’.
The social network of computers
Let us look at a slightly different example now: A cluster of computing servers communicate with each other while accessing different databases or other services hosted by them. This communication is typically called the server-logs. Each entry in the log has a very clear intent. The intention could be to alert when a particular database is not accessible or to provide an update on how frequently a particular service is accessed and so on. The difference with the previous “introduction” example is that the surface manifestation of the log message (i.e., the text) and the intent have a 1-to-1 mapping whereas the surface manifestation of the “introduction” message and the intent typically have a many-to-1 mapping. For instance, the log messages may indicate that there is some failure in the network or the server through a “ping response failed” phrase alone. Contrast that with messages with different phrases like “I am Om”, “Om here” or “My name is Om”, all of which convey the exact same intent in the “Introduction” example. However, this is not the only difference between the two examples.
The mapping between the surface manifestation and the underlying intent in the case of a server log is typically a simple set of rules that a human expert (in this case a system administrator) would have formulated. These rules will not undergo any gradual drift. They can, however, change quite drastically. For example, the system administrator may decide to change the naming convention of all the servers, or s/he may realize that the 350-character long log messages are leading to a higher storage cost and so may change the length of the messages to 100 characters. These changes will almost always be overnight and drastic. Also, the system administrator doesn’t have to build any general ‘acceptance’ for these new changes across the computers or other humans before these new messages start flooding the logs. In contrast, I, as an individual, cannot unilaterally and abruptly decide to convey the “hello” intent by saying “bye, you how are” or some such outlandish construct. Acceptance of “hey” or “hi” as a synonym for “hello” happened gradually and only after a critical mass of the constituents started using it. Mathematically speaking, changes to the server logs can be modeled by step functions whereas changes to the “introduction” message can be modeled by continuous (smooth) functions.
And yet, NER seems like a reasonable goal on the server log data also. For example, NER on these logs will identify the specific server which is sending an unusual number of requests or the specific database which is not responding to pings.
As computers get more expressive…
One may argue that analyzing the server logs using regular expressions alone is good enough because the structure is rigid. Further, as the patterns change, the set of regular expressions need to be revisited. This seems like a simple yet elegant solution to the problem until we factor in the significant shifts in how such data is being generated these days. Given the recent explosive growth in big data infrastructure and the phenomenal ease with which different computing environments are able to interact with each other, such individual logs are getting combined across multiple environments to create more complex logs which in turn need to be consumed by a much wider set of end-users and for a much wider set of use cases. This combination leads to dramatic growth in the ‘expressiveness’ of these ‘artificial’ machine-data which cannot be adequately captured by the regular expressions. Yet, the “traditional NER” algorithms cannot be applied as-is on this data, because of the likely abrupt and instant changes that this data may witness.
These examples should convince you that letting the end goal alone decide the type of data processing is a suboptimal strategy. Getting boxed into an existing keyword analysis based method alone or an existing long-term-dependency-mining method alone will also not lead to optimal outcomes.
Let the input data guide you
Instead, I believe that the nature of the input data sources should decide the degree of expressiveness of the observed text and hence whether it is a natural language (or natural-language-like) phenomenon or if it is a limited set of rigid rule-based text generation or a combination of two. Knowing this nature will open up a wide(r) array of business-relevant problems that are worthy of solving and also make a wide(r) set of ML techniques applicable to the data. On the other hand, starting with the end goal or a particular ML technique will restrict the efficiency of the final system. More importantly, this will inherently impose some structural expectations of the data, which may or may not be true in reality.
Going back to the little exercise that you did at the beginning of the blog, given that the context in which the input was asked is quite bounded, certain surface-level constructs are much more likely than the other. Hence, simple bigram/trigram techniques will work. But certain constructs can be ambiguous too. For example, in “I am Om” vs. “I am tired”, there is an extremely low chance that someone may be named “Tired” but a much higher chance that the word indicates the state the person is in. Hence, a deeper understanding of the syntax and semantics of the construct, which comes through knowing your data/domain, would be just as valuable. Our underlying scoring mechanism uses a combination of both these techniques.
Why should you care!
But you may ask, why is all this relevant in today’s context? Today we still have a clear distinction between human-generated language vs. machine-generated language (again restricting only to text), but every indicator points to a future where machines will get more and more expressive (Admittedly, to begin with, this change will be more pronounced in some specific business domains than the others). Hence, the set of techniques that were traditionally associated with a domain may not necessarily continue to be the optimal ones. Understanding the nature of data will help identify the optimal machine learning machinery quickly.
In summary, whether you call a particular situation a text processing problem or a natural language processing problem should be influenced by the nature of the input data rather than the particular end goal or the particular machine learning technique.