They spend roughly $10 billion per year on buying machinery-related parts. There is a lot of knowledge to gain from past purchases that can help them minimize the duplication of orders, understand purchase patterns to plan procurement.
They had a standard machine learning model (term-document matrix and a classifier on top). The model was not as accurate (less than 75%) on validation and extremely complex (was taking 40+ hours even for logistic regression).
What does Data Science Solve?
The client wanted to convert the unstructured free-flowing description into well-structured categorized information. Extract classes, subclasses and attributes from the free-flowing text.
Overall attribution is a standard entity extraction problem. However, just assigning a text to a class itself can be treated as a large class (1000+) classification problem.
The client spent enormous time in tagging the data and actually gave us more than a quarter-million records and the corresponding classes!
To have INSOFE faculty and data scientists solve your business problems, prep your engineering teams to face the real world complexities, visit here
What did we do?
This is one of the best data problems we have ever received. Again, profuse thanks to the client.
A deep learning model of Jeremy (ULM FIT) actually solved the problem prima facie. In less than 40 minutes of run time, we were able to get a 91.5% accuracy.
We then studied the errors deeply. Eye balled them, plotted them to understand where we were going right and where we were failing. The error analysis taught us some really important pre-processing steps. It showed us how to handle and alert the user during production time. With our learnings (relatively simple once we consolidated), we were able to move up to 96% accuracy with almost no additional processing time.
Moral of the story
Actually, there are two!
Knowing the state-of-the-art solution for a problem class is important.
I do see this issue with many practitioners. They just are 5 to 10 years behind times and in data science, that can mean a huge penalty at times. Pick your favorite blogs and follow them for the latest research and innovations. Your initial architecture need not be 5 years old.
Often I see young data scientists jump to enormous hyperparameter tuning exercises once the first model is built. It perhaps pays to do an imaginary time travel to 40 years back and assume that you need to wait for 3 days to resubmit your punch cards to the computer guy for processing your updated code. You can then use those 3 days to study what went wrong with the current model by squeezing information from every type of error it made.
Alas, near-zero compute cost and the easily editable libraries with hundreds of options drive the data scientists down the experimentation-without-a-plan path often.