Controlling and Validating (Generative) Artificial Intelligence for Oliver Wyman’s NewsTrack Pipeline

Oliver Wyman's AI validation framework demonstrates how a rigorous validation framework can be executed efficiently in practice.

Background & Description

NewsTrack is a unique text analytics pipeline developed by Oliver Wyman, leveraging LLMs (large language models) and gen AI that is producing real-time predictions of adverse events for entities (e.g. downgrades, default/bankruptcy, fraud) from a variety of news feeds. In particular, an explainability module is applied to NewsTrack and generates an understanding of the drivers that underpin the model’s predictions: upon a significant signal the user is not only pointed to the input (news items) that has been triggering the signal but also outputs the themes that the model has been considering significant drivers.

Being able to understand & control the output generated by LLM and (gen) AI is not only a pre-requisite from any related supervisory perspective but also for the practical use in the context of any business decision. Hence, Oliver Wyman developed an AI validation framework, to demonstrate the elements of how a rigorous validation framework can be executed efficiently in practice, which requires both the development of the right methodologies and approaches for the tests and an efficient platform to support the execution.

How this technique applies to the AI White Paper Regulatory Principles

Safety, Security & Robustness

As for most analytical models, the ability to test, measure and evidence statistical power of model is essential. The right performance metrics to be considered will depend on the use case and eventually the model structure used.

For NewsTrack, the validation framework covers testing on unseen data to detect whether the model is over-fitted or too simplistic to capture underlying patterns (e.g. out-of-time/sample tests, cross-validation, perturbation of inputs, masking/swapping entities etc.).

Appropriate Transparency & Explainability

The validation framework includes a module for model logic stability, interpretability & plausibility. This module provides an evaluation of the appropriateness of themes identified by explainability module, including an assessment of how well a set of model drivers identified.

To assess the stability of the generative description of the drivers generated, a sensitivity analysis against the decoding sampling (a parameter of the generative approach to describe the drivers) density is being assessed. In addition, verbosity tests of representative driver descriptions are being performed to check whether the generative description of drivers is stable.

Accountability & Governance

Perturbation stability tests challenge whether the model is based on real content-understanding or is picking up on artefacts. In particular, this facilities a checking of whether irrelevant information for a specific use case is ignored by the model and thus hallucination is identified accordingly.

The validation framework for NewsTrack supports the execution of the tests under (meaning/content-invariant) perturbation of inputs and or train/test sets. A dedicated library supports an efficient way of execution text-perturbation, e.g. via synonym replacement, (re-) translation, keyword replacement, adding random noise etc.

Why we took this approach

The themes identified by NewsTrack rely on the fact that the themes are being identified by a dedicated algorithm in an automated, consistent, and generative (i.e. human readable) fashion. Hence, the validation is to test:

  • Stability of drivers identified: only if the set of themes identified is stable under (content-invariant) perturbation of inputs and train data sets themselves, the model predictions can be considered reliable and meaningful

  • The comprehensiveness of model drivers can be assessed via sensitivity analyses against the granularity of themes identified: if the driver identification is too granular the themes will no longer be meaningful, if the identification is too much aggregated to many themes will be associated with a certain driver

Benefits to the organisation using the technique

The AI validation framework for NewsTrack is overall similar to a traditional validation framework. Given that gen AI is expected to be applied across a large spectrum of use cases and developed/used by an extensive set of (non-expert) users, particular focus is on governance, standards, scope of application etc. A particular challenge often arises around procedures and test, thus the focus of this validation framework is on the more quantitative aspects.

Limitations of the approach

Validation tests used for the text analytics pipeline reply on the specific use case of the pipeline’s output. As a result, these tests may not be directly applied to other (Gen) AI models unless adjustments are made. Additionally, while the incorporated transparency module facilitates the interpretability of model outputs, it does not completely resolve the challenge of the black box nature of AI models in generality.

Further AI Assurance Information

Published 9 April 2024