Armilla: Armilla Guaranteed - AI Product Verification and Guarantee

Case study from Armilla.

Background & Description

Armilla Guaranteed, Armilla’s warranty for AI-powered products, empowers enterprises to assure the quality and reliability of their AI solutions with a performance guarantee backed by leading insurance companies Swiss Re, Chaucer and Greenlight Re. Armilla Guaranteed leverages Armilla’s proprietary model evaluation technology to independently validate the performance, fairness and robustness of a given AI solution in light of its use case and business requirements. If an AI product that has been guaranteed by Armilla fails to meet its key performance indicators in production, Armilla provides financial compensation to the customer.

How this technique applies to the AI White Paper Regulatory Principles

More information on the AI White Paper Regulatory Principles.

Prior to providing a warranty on a customer AI solution, Armilla conducts a holistic evaluation and verification of the following dimensions that are critical to assuring and, ultimately guaranteeing, product performance.

Safety, Security & Robustness

Robustness testing is critical to ensuring that an AI model can be relied upon to produce accurate and reliable results in a variety of real-world scenarios, or that it is not overly sensitive to small variations or errors in the input data. Some of the approaches we have employed to evaluate robustness include looking at:

  • Adversarial attacks: We have evaluated performance under various types of adversarial attacks, including data poisoning attacks and model evasion attacks.
  • Data perturbation: We have tested the model’s robustness to different types of data perturbation, including adding noise, flipping pixels, and changing the intensity of the input data.
  • Hyperparameter tuning: We have evaluated the model’s performance under different hyperparameter settings, including learning rate, batch size, and regularisation strength.
  • Distributional shifts: We have tested the model’s performance under different data distributions, including different classes, domains, and environments.

Appropriate Transparency & Explainability

We rely on a variety of techniques and approaches to evaluate and increase the transparency and explainability of machine learning models, including techniques such as feature importance analysis, visualisation tools, and model interpretability algorithms:

  • Permutation-based feature importance: This model inspection technique can be used for any fitted estimator when the data is tabular. This is especially useful for non-linear or opaque estimators. The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled. It measures the increase in the prediction error of the model after we permute the feature’s values, which breaks the relationship between the feature and the true outcome.

  • PDP and 2D PDP testing: A Partial Dependence Plot (PDP) is a visualisation that shows the impact of a feature on the predicted outcome. It shows the dependence between the target response and a set of input features of interest, marginalising the values of all other input features. It can be used to analyse the interaction between the target response and a set of input features.

  • Model sensitivity: We use a variety of sensitivity analysis approaches (including SOBOL, FAST and RBD FAST). These global sensitivity analyses work within a probabilistic framework, decomposing the variance of the output of a model or system into fractions which can be attributed to inputs or sets of inputs. These methods are often used to identify key parameters that drive model output.

  • Fingerprint comparability: An approach to reverse-engineer the model into an inherently explainable model. Simpler models are more successfully reverse-engineered, and regions, where the models disagree, correspond to risk in complexity corresponding to the lack of understanding of those regions.

We also use overlapping methods to understand the complexity of the model, including:

  • Contextual feature elimination: Feature elimination helps determine the unimportant features, and can allow a model to perform better by weeding out redundant features and features that are not providing much insight. We utilise recursive feature elimination, which works by eliminating the least important features, to determine feature importance. Contextual feature elimination increments features in descending order of feature importance and attempts to fit a surrogate model on the subset of features and the target variable. An estimation bias and variances are computed for each surrogate model and then compared with the performance of the base model. An optimal model is suggested with a subset of the features when the increment of the model performance between successive models is no more than 0.05%.

  • Fingerprint complexity analysis: This approach interrogates the reverse-engineered model to determine the number of features and interactions required to create a well-behaved proxy to the original model.

Fairness

Achieving model fairness can be challenging, particularly in cases where the data used to train the model contains biases or reflects existing inequalities. Fairness is not always a clear-cut or objective concept, and different stakeholders may have different opinions or definitions of what constitutes fairness in a particular context. Therefore, achieving model fairness may require careful consideration and engagement with a variety of stakeholders, including those who may be affected by the model’s predictions or recommendations.

There are a variety of approaches and techniques that we use to assess the fairness of a machine learning model, including methods such as counterfactual analysis, fairness constraints, and fairness metrics, including demographic parity, equality of odds, disparate impact, the four-fifths rule, group data bias and over 40 types of fairness tests.

1) Disparate impact

Disparate impact refers to a situation where a machine learning model has a disproportionate impact on different groups of individuals, based on certain protected characteristics such as race, gender, age, or disability status.

Disparate impact can occur even if the model is not intentionally discriminatory, as it may produce results that have the effect of unfairly disadvantaging certain groups. For example, a model that is trained to predict creditworthiness may inadvertently result in lower credit scores for certain groups, such as racial minorities, which can result in them being denied access to credit or being charged higher interest rates.

Identifying and addressing disparate impact is an important aspect of ensuring model fairness and reducing bias. Various techniques can be used to mitigate disparate impact, such as adjusting the model’s inputs or outputs, modifying the training data, or applying post-processing techniques to adjust the model’s predictions. A thorough evaluation of a machine learning model’s fairness should consider the potential for disparate impact and take steps to address it if necessary.

2) The Four-fifths Rule

The “four-fifths rule” is a commonly used criterion for evaluating potential adverse impact or disparate impact in employment practices, including those that involve the use of machine learning models. It states that if the selection rate for any particular group (e.g., based on race, gender, age, etc.) is less than four-fifths (80%) of the selection rate for the group with the highest selection rate, there may be evidence of adverse impact or disparate impact against that group.

For example, if the selection rate for men was 60% and the selection rate for women was 40%, then the selection rate for women would be considered to be 67% (40/60) of the selection rate for men. Since 67% is less than 80%, this would suggest that the model may be exhibiting adverse or disparate impact against women.

The four-fifths rule is often used as a guideline for evaluating whether a machine-learning model is biased against certain groups. However, it should be noted that it is only one of several possible methods for assessing fairness, and should not be used in isolation. We empower our clients to leverage the vast array of tools and methods at hand to exceed this baseline. A comprehensive evaluation of model fairness should consider multiple factors and metrics, and take into account the specific context and application of the model.

3) Selection Rate and Impact Ratios

In certain jurisdictions, new rules for AI systems have prescribed requirements for specific selection rates and impact ratios for protected groups and subgroups. This is the case in the context of New York City Council Local Law 144 which requires automated employment decision-making tools to be independently audited for bias and disparate impact with respect to gender, race and ethnicity. The law includes specific selection rates and impact ratios that set a higher threshold than the four-fifths rule.

Selection rate are defined as the ratio of a number of individuals with positive outcomes within the subgroup vs. the cardinality of a subgroup. The impact ratio is defined as the ratio of the selection rate for the subgroup versus thehighest selection rate across all subgroups.

Adverse impact is defined as “a selection rate for any race, sex or ethnic group which is less than four-fifths of the rate for the group with the highest will generally be regarded by the Federal enforcement agencies as evidence of adverse impact.”

4) Group Data Bias

The distribution of favourable versus unfavourable labels was measured against the distance between subsets with the following metrics: JS divergence, KL divergence, KS distance, LP Norm and Total Variation Distance.

Accountability & Governance

Some examples of governance practices that we follow to support Responsible AI include stakeholder engagement, independent frameworks, and oversight and accountability mechanisms.

  • Stakeholder engagement helps us ensure that the perspectives and concerns of diverse stakeholders, including affected communities, are taken into account during the assessment process.
  • Independent assessment frameworks. We leverage leading independent assessment frameworks for AI, such as the Responsible AI Institute’s certification framework, to help us evaluate the potential implications of an organisation’s AI systems, and make informed recommendations about their development and deployment.
  • Oversight and accountability mechanisms, such as audits and impact assessments, can help organisations ensure that their AI systems are transparent and accountable and that they are being used in a way that aligns with ethical principles and respects human rights.

Why we took this approach

As AI capabilities continue to accelerate, so do the risks for organisations. Our third-party verification and guarantee for AI products provide organisations with the confidence they need to unlock the potential of AI while mitigating risk.

Benefits to the organisation using the technique

Our AI warranty, which is an insurance-backed performance guarantee on the quality and reliability of the AI solution, has proven to be a powerful quality assurance and risk mitigation tool for both vendors of AI solutions and enterprises that procure them.

For Vendors of AI-powered products

  • Increase AI product value by guaranteeing trustworthiness to customers
  • Reduce sales cycles, drive adoption and market share
  • Lock in revenue and secure long-term customer relationships

For enterprises procuring third-party AI products:

  • Protect against the risk of AI underperformance and drift
  • Mitigate risk of downstream AI-related damages and liability
  • Protect ROI for your AI investments

Limitations of the approach

Based on the results of the assessment, we provide expected limitations of the model which are reflected in the guarantee (i.e. we limit when the guarantee would apply based on the results of the assessment).

Further AI Assurance Information

Published 19 September 2023