Shell: Evaluating the performance of machine learning models used in the energy sector

Case study from Shell.

Background & Description

This project leverages deep-learning to perform computer vision tasks – semantic segmentation on specialised application domain. The project had about 15 deep-learning (DL) models in active deployment. The DL models are applied in a cascaded fashion to the generated predictions, which then feed into a series of downstream tasks to generate the final output which would be input to manual interpretation task. Hence, AI assurance through model performance evaluation is critical to ensure robust and explainable AI outcomes. Three types of model evaluation tests were designed and implemented into the DL inference pipeline:

  1. Regression tests (unit test per DL model),
  2. Integration tests (test cascaded pipelines), and
  3. Statistical tests (stress tests to understand the operating limits of the model conditional to test data quality).

How this technique applies to the AI White Paper Regulatory Principles

More information on the AI White Paper Regulatory Principles.

Safety, Security & Robustness

The regression and integration tests form backbone provide model interpretability against a set of test data. During model development they provide a baseline to interpret whether model performance is improving or degrading conditional on the model training data and parameters. During the model deployment phase these tests also provide early indication of concept drift.

Statistical tests are more designed to predict model performance given the statistics of test data, hence providing a mechanism to detect data drift as models are deployed. Additionally they also give an indication of how robust the DL model performance is to statistical variations in test data.

Appropriate Transparency and Explainability

The output of this AI assurance technique is communicated to AI developers and product owners to monitor potential deviation from expected DL model performance. Furthermore, if performance deviates these teams can operationalize appropriate mitigation measures.

Also, for frontline users and business stakeholders to maintain a high degree of trust in the outcomes of the DL models.

Accountability and Governance

AI developers are responsible for designing and running the model evaluation tests to strengthen the performance testing. Product owners are responsible for leveraging these tests as a first line of defence before new model deployments. The project team works together to adapt the tests to tackle data and concept drift during deployment.

Why we took this approach

In this project, the predictions of the DL models are ultimately generating inputs for a manual interpretation task. This task is complicated, time consuming and effort intensive, hence it is crucial that the starting point (in this case DL model predictions) be of high-quality in terms of accuracy, detection coverage and very low noise. Furthermore, the outcome of the manual interpretation feeds into a high-impact decision making process.

The quality and robustness of the DL model’s prediction is thus of paramount importance. The most important metric to judge the ML model’s prediction performance is human-in-the-loop quality control. However, to automate the performance testing into a first line of defence, the model evaluation test suite technique was adopted. Data version control and creating implicit ML experiment pipelines was mainly to ensure that the models could be re-produced end to end (data, code and model performance) within an acceptable margin of error.

Benefits to the organisation

  1. First line of defence, automated DL performance testing for QA

  2. Test for model robustness and better interpretability of DL model performance.

  3. Robust explanation of DL model performance for AI developers and end users

  4. Build trust in DL models and workflows with user community

  5. Enables model monitoring by establishing mechanism to detect concept drift.

  6. MLOps hooks for enabling CI-CD during model deployment.

Limitations of the approach

  1. A large number of DL models with very different tasks: detection, classification, noise reduction.

  2. Complexity and variability of problem being addressed by DL makes designing KPIs difficult.

  3. Lack of high quality, representative data that could be used to design the model evaluation

  4. Lack of clear metrics/thresholds to design regression, integration, and statistical tests.

  5. Lack of a stable model evaluation library.

Further AI Assurance Information

Published 6 June 2023