According to François Chollet, the creator of Keras and Someone who at his best, placed 17th on Kaggle’s cumulative leaderboard: it is not about having the best idea(s) from the start but rather iterating over ideas often and quickly that allows people to win competitions and publish papers.
It is therefore important to be able to optimize your workflow and tools to enable you to rapidly experiment and iterate over ideas faster.
The goal of this series of blog posts is to explore the coding patterns and design principles for speeding up the machine learning workflow and enabling rapid experimentation.
There are several key components in a regular machine learning workflow, some of them are listed below:
Exploratory Data Analysis
Feature Transformation / Engineering
Model Tuning / Ensembling
In this first post, we will look at writing code for feature transformations, creating validation strategies and data versioning. You can also code along with this blog post by following along the code at kaggle
Functions are the best units of code that work for feature transformations and reusability. These functions when written with the goal of creating reusable, portable, repeatable, and chainable sets of data transformations, work very well for machine learning workflows.
Let us understand how to tackle these design goals:
Naming and Documenting functions
Firstly, it is important to name your functions appropriately and document them following standards. I generally like to use a custom version of the sklearn docstring template, which can be found at the following link
The must-have categories in a docstring are:
A function description
Parameters (with expected types)
Returns (with the object type)
Based on the complexity of the function, it can either have a single line description or a multi-line description.
I like adding another portion to the documentation, which is Suggested Imports. This makes the function easier to be copied across various notebooks/scripts in projects without the worry of having to move scripts around to import just a couple of functions in a notebook and miss any important import statements. It also indirectly lets me know what libraries the function depends on. I would still recommend putting all imports in at the top of a script according to PEP standards.
Also, it may be beneficial to note down Example Usage patterns of the function, in the rare cases where the function has complex use cases.
Let us now look at how to document a function
This now enables us to access the docstring from anywhere we import, define or call this example_fnfunction from
Standardizing the UX
One of the key ingredients to make using a function familiar and easy is if the inputs and outputs are standardized across the board for a specific use case. Thereby ensuring that the user or developer experience is seamless.
The scikit-learn API has standardized .fit(), .transform(), and .predict() methods across its library and it is a fantastic example of how a standardized UX can lead to high developer productivity and an extremely low barrier for entry.
For functions that are meant to perform feature engineering, it is important to always input a dataframe and some auxillary arguments while returning a dataframe, as can be seen in the figure below.
This design can is independent of pandas and can be ported to other data processing frameworks such as spark. Any logical code written using pandas can then be ported over to run on a spark based parallel processing system using the Koalas API, yet following a similar design pattern.
Pure functions have two important properties:
If given the same arguments, the function must return the same value
When the function is evaluated, there are no side effects (no I/O streams, no mutation of static and non-local variables)
When writing functions for data transformations, we cannot always write pure functions, especially considering the limitations of system memory, given that mutating a non-local variable would require creating an entirely new copy of the dataframe.
Therefore, it is important to have a boolean argument inplace which can help the developer decide whether or not to mutate the dataframe as per the requirements of the situation.
Type hinting was introduced in PEP 484 and Python 3.5. Therefore I do not recommend using it completely as of now unless you are sure that all of the libraries you use in your workflow are compatible for Python 3.5 and above.
The basic structure of type hinting in python is as follows:
Once a function definition is complete, use the -> symbol to indicate the return type, it could be int, dict or any other python data type.
Every argument in the function is followed by a : and the data type of the argument
You can also use more complex ways of representing nested data types, optional data types, etc. using the typing module in Python
Below, you can find an example where I use the typing module and use type hinting in python
Not all of the principles stated above are necessary, but they are important to consider when designing functions for feature transformation/engineering.
Let us now use these principles to design a function that allows us to engineer date based features. We will be using the Rossman Store Sales dataset.
Reading in the dataset
Putting the principles in practice
We will now write a function that allows us to engineer date based features, which can be used in downstream machine learning training tasks especially suited for tree-based models
As can be seen from above, the generate_date_features the function is portable, reusable, flexible, and can work across various data transformation pipelines.
You can design directed acyclic graphs to execute specific python functions with pre and post dependencies to generate your final transformed dataset.
This feature engineering pipeline can also be constantly regenerated from new raw data from such a DAG. I would definitely recommend checking out the package Airflow, which allows us to write flexible DAGs to manage ETL workloads easily.
It is important to keep track of data that is generated from raw sources, so that, it becomes easier to reproduce results, machine learning models, bugs, or any anomalies found during the machine learning pipeline.
There are several ways to keep track of data. Two such ways are:
Saving copies of the modified datasets
Creating new columns with a standardized naming scheme to track validation sets, modified and engineered features
In this particular kernel, I will discuss the latter one, which I believe is a strategy that is more suited to a Data Scientist as compared to a machine learning engineer.
If you go to Scikit-Learn’s documentation for the KFold class, you will see a pattern which most Data Scientists/ML Engineers use when performing validation. This pattern can be found below:
I recommend using a solution that is based on his approach, where we create a new column that tracks the fold number to which the record belongs. This enables me to not only train models on a given fold parallelly, but also on machines that do not have any network connection between them. Every team member could build a model for one fold. This approach to validation unshackles the Data Scientist and increases their productivity.
For this particular problem, we can just use one time period of 48 days
To validate a model on a fold number of k, you can extract your train and validation sets using the code below