Optimal Decision Trees

The new frontier of decision trees: non-greedy training to obtain small and performant decision trees

8 min readMar 22, 2021

Decision trees are one of the most popular machine learning algorithm and constitute the main building block of the most successful ensemble methods, namely random forests and gradient boosting machines. Interpretability is often cited as the main reason of popularity. However decision trees are interpretable only if they are small and, as a consequence, the number of splits and variables involved can be easily mastered by the human brain. When trees grow too large and the depth of their leaves exceeds 3 or 4, or when they are aggregated in ensembles with dozens of trees, most of the interpretability value is lost. So, the big question is: can I train small decision trees that provide state of the art performance? Answer: probably yes :)

A Bit of History: greedy training

Decision trees are traditionally trained in a greedy fashion, split after split, until a termination criteria is met. Once done, splits are never reconsidered, and this is where the greediness come from. A pruning phase may be performed afterwards to control the complexity of the model and to limit overfitting. Basically the training is performed in two phases: the first one in which the tree is (greedily) grown, and the second one where the tree is (greedily) shrunk.

Greedy training of a decision tree: first the tree is grown split after split until a termination criterion is met, and afterwards the tree is pruned to avoid overly complex models.

The main, and arguably only, reason for training decision trees greedily is computational cost. It is indeed well known that training a decision tree is a NP-hard problem [1]. However, in recent years, big advances in solver technology and increase in computational power opened up several possibilities to improve decision tree training and move away from purely greedy approaches.

Moving away from greedy training towards global optimisation methods that can construct the nodes of the tree all at once optimising one or more overall criteria could result in several advantages:

Interpretability: optimal trees tend to be smaller, and therefore more interpretable.
Global Constraints: given the mathematical optimisation nature of the problem formulation, global constraint can be imposed to enhance interpretability, fairness or other performance measures.
Generalisation: several works in scientific literature [2,3,4,5,6,7,8] show that optimal trees tend to generalise better than greedily trained trees, which are known to be a weak learner. It is however unclear whether optimal trees generalise better than ensemble of greedy trees like Random Forests or Gradient Boosting methods. In other words, can optimal decision trees be considered strong classifiers?
Small data: optimisation based machine learning algorithms (e.g. Support Vector Machine) tend to obtain state of the art performance on small data problems. However they tend to struggle on large dataset since they usually scale at least linearly (but often more!) with the number of samples in the dataset.

The Next Big Thing: non-greedy training

Alright, there are good reasons to attempt to train a decision tree in a non-greedy fashion, but how do I practically do it? This article provides a taxonomy of the approaches proposed in literature for fitting Optimal Decision Trees (ODT), organised keeping the various optimisation paradigms in focus, namely Dynamic Programming, Constraint Programming, Mixed Integer Programming, and Continuous Optimisation. A word of caution: this article is not meant to be an exhaustive review of the research on optimal decision trees, but rather a starting point for the interested researcher/practitioner.

Optimal training of a decision tree: a constrained optimisation is solved, and the decision tree is obtained as the solution. Loss function image taken from here.

Dynamic Programming

Dynamic programming is an optimisation technique that solves complicated problem by recursively breaking them down into simpler subproblems. A classic problem that can be efficiently solved with dynamic programming is the Knapsack problem.

There are few examples in literature of ODTs based on dynamic programming, among which [2]. The name of the game in dynamic programming is to avoid repeated computations. This is achieved in [2] and other optimal decision trees papers based on dynamic programming by using ad hoc algorithms characterised by a rather complex math, whose explanations falls outside the scope of this short article.

To my knowledge the ODT formulations based on dynamic programming are rather specific limiting often to classification problems with binary or categorical features.

Constraint Programming

Constraint programming is an optimisation paradigm that puts the focus on the constraints rather than on an objective function, to the point that one can formulate problems without an objective function at all (satisfaction problems). Constraint programs are usually solved using a mix of constraint propagation, local and backtracking search, and dynamic programming. Moreover, major performance boosts can be obtained by encoding special combinatorial structures in global constraints, which are solved by ad-hoc algorithms (the big classic of global constraints is the alldifferent constraint).

The ODT formulation proposed in [3] seems to be very efficient and leverages on a new global constraint (CoverSize) and on a specific search strategy ( AND/OR search). However, the presented model can only tackle classification problems, and it is unclear how it can be extended to regression. Moreover, features need to be binarised to be fed to this model, making the model of limited applicability.

Mathematical Programming — Mixed Integer Optimisation

Mixed Integer Optimisation is a paradigm where we try to minimise an objective function subject to a set of constraints, and the decision variables are both continuous and discrete. Mixed integer optimisation problems are usually solved relaxing the integrality constraint of the discrete variables, and solving the continuous optimisation problem. Then, through the use of branch & bound and cutting heuristics, the integrality constraints are restored, and a feasible solution is found. Mixed Integer Optimisation is known to be an NP-hard problem.

ODTs based on mixed integer optimisation are by far the most common in literature, and this makes it hard to select one paper to cite here. Well… I’ll go with two! The ODT proposed in [4] is arguably the most complete I stumbled upon. It can tackle both classification and regression problems, can work with categorical as well as continuous features, and can even introduce fairness constraints to make sure that the learned tree is non-discriminative with respect to sensitive features. The model is solved using Gurobi, no ad-hoc optimisation algorithm is presented. The number of variables in this formulation scales linearly with the size of the dataset. Unfortunately, having a look at the training times, it seems that the formulation proposed in [4] does not scale very well. This aspect is addressed by [5], where, together with a novel ODT formulation inspired by the well known 1-norm Support Vector Machine, a data selection method is presented to allow the method to scale to large datasets (dataset with 245k samples presented in the experiments, all other approaches hardly go beyond datasets with 5k samples). Moreover the authors present an ad-hoc cutting method that tightens the continuous relaxation and improves the run time of the model.

Mathematical Programming — Continuous Optimisation

As we touched upon in the previous section, the hardest part of solving mixed optimisation problems are the integrality constraints that come with the discrete variables. Therefore it is attractive to try to formulate an ODT avoiding discrete variables all together. That is exactly what the formulations presented in this section do. One of the main attractive aspects of this family of methods from the practitioner perspective is that, being these models continuous, they can be optimised using variations of gradient based methods that are readily available in frameworks like Tensorflow or Pytorch. A lot of functionalities, including GPU acceleration, may be available for free!

The formulations presented in [6] is a model based on an upper bound of a loss function which would be very hard to optimise. This upper bound can be optimised using continuous optimisation techniques like mini-batch SGD. This model can perform both classification (binary and multi-class) and regression and can cope with all kinds of features. The main drawback is that, being the loss non-convex, the approach does not provide any optimality guarantee. Moreover, it may be hard to impose additional constraint, but on the other hand the objective function can be very flexible given the nature of continuous optimisation.

More recently [7], apart from proposing a very interesting and simple formulation of non-greedy decision trees based on a probabilistic interpretation of the tree nodes, moved the first steps towards an analysis of feature importance for ODTS.

Conclusions

It is possible to evince from the publication year of the papers cited in this short article that the research on ODT is flourishing right now. Many aspects on how ODTs compare to greedy decision trees and ensembles of trees are still unclear. In my opinion, the Holy Grail of ODTs would be to be able to produce a decision trees that are on par with ensemble of trees (random forest, gradient boosting machines) with respect to performance, but it are much more compact and therefore interpretable. Moreover ODTs should be computationally feasible and scale to large datasets, which is not the case for most of the currently proposed formulations.

In general, the fields of machine learning and discrete optimisation are coming closer and closer, opening up new horizons. I can recommend [8] as an excellent book on the topic, covering several of the topics where machine learning and discrete optimisation meet, among which optimal decision trees, prescriptive analytics, experiment design, and much more.

References

[1] Hyafil L., Rivest R. L. (1976). Constructing optimal binary decision trees is np-complete. Information Processing Letters, 5(1), 15–17.

[2] Lin J., Zhong C., Hu D., Rudin C., Seltzer M. (2020). Generalized and Scalable Optimal Sparse Decision Trees. Proceedings of the 37th International Conference on Machine Learning, 119:6150–6160. Open source code available here.

[3] Verhaeghe H., Nijssen S., Pesant G., Quimper C.G., Schaus P.(2019). Learning Optimal Decision Trees using Constraint Programming. Proceedings of the 25th International Conference on Principles and Practice of Constraint Programming. Open source code available here.

[4] Aghaei S., Azizi M.J., Vayanos P.(2019). Learning Optimal and Fair Decision Trees for Non-Discriminative Decision-Making. Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 201

[5] Zhu H., Murali P., Phan D., Nguyen L., Kalagnanam J.(2020). A Scalable MIP-based Method for Learning Optimal Multivariate Decision Trees. Advances in Neural Information Processing Systems 33 (NeurIPS 2020)

[6] Norouzi M., Collins M.D., Johnson M., Fleet D.J., Kohli P.(2015). Efficient Non-greedy Optimization of Decision Trees. Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS 2015).

[7] Blanquero R., Carrizosa E., Molero-Río C., Romero Morales D.(2021). Optimal Randomized Classification Trees, Computers and Operations Research.

[8] Bertsimas D., Dunn J.(2019). Machine Learning Under a Modern Optimization Lens. Dynamic Ideas LLC