A new way to understand Random Forest models that allows scientists to uncover their decision-making processes in detail, going beyond classical approaches.
Complex supervised machine learning methods, like Random Forest (RF) models, are often considered to be ‘Black Boxes’. Such models can make highly accurate predictions but their complexity hinders us to understand some parts of the decision-making process. To use such models in the real world, it is indispensable to not only make accurate predictions but also to fully understand the logic behind those predictions. Only by understanding the model’s decision-making process, we can identify the key features influencing its outputs and ensure that the model produces valuable insights.
That’s why data scientists make use of explainability methods, approaches to understand how the decisions are being made inside a machine learning model. A common example is the permutation feature importance method, used to pinpoint the individual contribution of features to the model performance. However, such methods assume that all features contributing to the model are independent from each other. Hence, these kinds of methods might miss the role of correlated features in the model’s decision-making process. Moreover, because of how they function and the output they provide, it is almost impossible to uncover the possible feature interactions. For RF models, this is an important limitation, as one of their key aspects is their non-linear nature.
We addressed those problems by developing the Forest-Guided Clustering algorithm. The main difference between this new algorithm from previous approaches is that the importance of a feature is not calculated by how it affects the performance of the model, but is based on the paths followed by the RF model. The algorithm computes feature importance based on subgroups of instances that follow similar decision paths within the RF model. It provides a new metric that overcomes the limitations of permutation feature importance methods.
With our approach, the importance of each feature can be analyzed on a global scale, giving an overview on features that drive the underlying patterns in the data, but also on a local scale, measuring the relevance of each feature for a specific subgroup. The pattern-driven importance metric of our method avoids the misleading interpretation of correlated features, allows the detection of feature interactions and gives a sense of how generalized the identified patterns actually are.
This project is a work in progress! Try it out, we’d love to hear what you think!
- GitHub: https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering
- Documentation: https://forest-guided-clustering.readthedocs.io/en/latest/
If Forest-Guided Clustering is useful for your research, consider citing the package via DOI: 10.5281/zenodo.6445529