Coding principles for rapid machine learning experimentation

Businessmen with magnifier looking at business process flow chart. Business rules and regulation, main company policy, IT business analysis concept. Pink coral blue vector isolated illustration

According to François Chollet, the creator of Keras and Someone who at his best, placed 17th on Kaggle’s cumulative leaderboard: it is not about having the best idea(s) from the start but rather iterating over ideas often and quickly that allows people to win competitions and publish papers.

Fig. 1 François Chollet’s Twitter Feed  Source

It is therefore important to be able to optimize your workflow and tools to enable you to rapidly experiment and iterate over ideas faster.

Fig. 2 François Chollet’s Cycle of Experimentation  Source

The goal of this series of blog posts is to explore the coding patterns and design principles for speeding up the machine learning workflow and enabling rapid experimentation.

There are several key components in a regular machine learning workflow, some of them are listed below:

  • Exploratory Data Analysis
  • Feature Transformation / Engineering
  • Validation Strategy
  • Model Building
  • Experiment Tracking
  • Model Tuning / Ensembling

In this first post, we will look at writing code for feature transformations, creating validation strategies and data versioning. You can also code along with this blog post by following along the code at kaggle

Feature Engineering

Functions are the best units of code that work for feature transformations and reusability. These functions when written with the goal of creating reusable, portable, repeatable, and chainable sets of data transformations, work very well for machine learning workflows.

Let us understand how to tackle these design goals:

Naming and Documenting functions

Firstly, it is important to name your functions appropriately and document them following standards. I generally like to use a custom version of the sklearn docstring template, which can be found at the following link

The must-have categories in a docstring are:

  • A function description
  • Parameters (with expected types)
  • Returns (with the object type)

Based on the complexity of the function, it can either have a single line description or a multi-line description.

I like adding another portion to the documentation, which is Suggested Imports. This makes the function easier to be copied across various notebooks/scripts in projects without the worry of having to move scripts around to import just a couple of functions in a notebook and miss any important import statements. It also indirectly lets me know what libraries the function depends on. I would still recommend putting all imports in at the top of a script according to PEP standards.

Also, it may be beneficial to note down Example Usage patterns of the function, in the rare cases where the function has complex use cases.

Let us now look at how to document a function

  • This now enables us to access the docstring from anywhere we import, define or call this example_fnfunction from

Standardizing the UX

One of the key ingredients to make using a function familiar and easy is if the inputs and outputs are standardized across the board for a specific use case. Thereby ensuring that the user or developer experience is seamless.

The scikit-learn API has standardized .fit().transform(), and .predict() methods across its library and it is a fantastic example of how a standardized UX can lead to high developer productivity and an extremely low barrier for entry.

For functions that are meant to perform feature engineering, it is important to always input a dataframe and some auxillary arguments while returning a dataframe, as can be seen in the figure below.

Fig.3 Every transformation function should take a data frame as input and return a dataframe

This design can is independent of pandas and can be ported to other data processing frameworks such as spark. Any logical code written using pandas can then be ported over to run on a spark based parallel processing system using the Koalas API, yet following a similar design pattern.

Pure Functions

Pure functions have two important properties:

  • If given the same arguments, the function must return the same value
  • When the function is evaluated, there are no side effects (no I/O streams, no mutation of static and non-local variables)
Fig. 4 A visual representation of pure functions  Source

When writing functions for data transformations, we cannot always write pure functions, especially considering the limitations of system memory, given that mutating a non-local variable would require creating an entirely new copy of the dataframe.

Therefore, it is important to have a boolean argument inplace which can help the developer decide whether or not to mutate the dataframe as per the requirements of the situation.

Type Hinting

Type hinting was introduced in PEP 484 and Python 3.5. Therefore I do not recommend using it completely as of now unless you are sure that all of the libraries you use in your workflow are compatible for Python 3.5 and above.

The basic structure of type hinting in python is as follows:

def fn_name(arg_name: arg_type) -> return_type:

    pass

  • Once a function definition is complete, use the -> symbol to indicate the return type, it could be intdict or any other python data type.
  • Every argument in the function is followed by a : and the data type of the argument
  • You can also use more complex ways of representing nested data types, optional data types, etc. using the typing module in Python

Below, you can find an example where I use the typing module and use type hinting in python

Not all of the principles stated above are necessary, but they are important to consider when designing functions for feature transformation/engineering.

Let us now use these principles to design a function that allows us to engineer date based features. We will be using the Rossman Store Sales dataset.

Reading in the dataset

Putting the principles in practice

We will now write a function that allows us to engineer date based features, which can be used in downstream machine learning training tasks especially suited for tree-based models

As can be seen from above, the generate_date_features the function is portable, reusable, flexible, and can work across various data transformation pipelines.

You can design directed acyclic graphs to execute specific python functions with pre and post dependencies to generate your final transformed dataset.

This feature engineering pipeline can also be constantly regenerated from new raw data from such a DAG. I would definitely recommend checking out the package Airflow, which allows us to write flexible DAGs to manage ETL workloads easily.

Data Versioning

It is important to keep track of data that is generated from raw sources, so that, it becomes easier to reproduce results, machine learning models, bugs, or any anomalies found during the machine learning pipeline.

There are several ways to keep track of data. Two such ways are:

  • Saving copies of the modified datasets
  • Creating new columns with a standardized naming scheme to track validation sets, modified and engineered features

In this particular kernel, I will discuss the latter one, which I believe is a strategy that is more suited to a Data Scientist as compared to a machine learning engineer.

Validation Strategy

If you go to Scikit-Learn’s documentation for the KFold class, you will see a pattern which most Data Scientists/ML Engineers use when performing validation. This pattern can be found below:

kf = KFold(n_splits=2)

for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

  • When you start using target-based features such as target encoding in your model, your code must go inside this loop.
  • Every time you build your model on a fold, your code must go inside this loop

This restricts the freedom of a Data Scientist and makes the results/code harder to keep track of.

This is where a solution that I first came across when reading the x4 Kaggle Grandmaster Abhishek Thakur’s book, titled Approaching any machine learning problem.

I recommend using a solution that is based on his approach, where we create a new column that tracks the fold number to which the record belongs. This enables me to not only train models on a given fold parallelly, but also on machines that do not have any network connection between them. Every team member could build a model for one fold. This approach to validation unshackles the Data Scientist and increases their productivity.

For this particular problem, we can just use one time period of 48 days

To validate a model on a fold number of k, you can extract your train and validation sets using the code below

train_kfold = kfold_data[kfold_data.kfold < k]

val_kfold = kfold_data[kfold_data.kfold == k]

This now, allows us to build models and validate on different, disconnected systems without worrying about the processor architecture that generates random numbers based on a seed.

We can fully reproduce our results on each of the validation sets.

0

Leave a Reply

Your email address will not be published. Required fields are marked *