Standards and Testing Agency: Colossyan AI for voiceovers and videos

A voiceover AI that narrates scripts for the Standards and Testing Agency helpline and training videos.

From:: Cabinet Office, Department for Science, Innovation and Technology and Government Digital Service
Published: 30 October 2025

Organisation:: Department for Education
Organisation type:: Agency or public body
Function:: Education
Capability:: Computer audition
Task:: Audio generation
Phase:: Production
Region:: UK
Date published:: 30 October 2025
ATRS version:: v3.0

Tier 1 Information

1 - Name

Colossyan AI for voiceovers and videos.

2 - Description

This is an AI voiceover tool that STA use to support the department’s interactive voice recordings for the helpline, aiming to reducing staff workload.

User interaction with Colossyan takes place when callers contact the helpline and hear the options.

Staff in the department can prepare a script for their interactive voice recordings and have the AI narrate the script and provide an MP3 file, which can be used for the helpline.

3 - Website URL

https://www.colossyan.com/ai-video-creator?utm_term=colossyan&utm_campaign=SRCH+-+BRAND+-+CORE+BRAND+-+MAX+CONV+-+GLOBAL&utm_source=adwords&utm_medium=ppc&hsa_acc=5380454562&hsa_cam=20431752846&hsa_grp=153032248940&hsa_ad=669045293276&hsa_src=g&hsa_tgt=kwd-1001819081200&hsa_kw=colossyan&hsa_mt=p&hsa_net=adwords&hsa_ver=3&gad_source=1&gclid=EAIaIQobChMI8r2O4oGWjAMVQpVQBh3-yxVCEAAYASAAEgJKUvD_BwE

4 - Contact email

assessments@education.gov.uk

Tier 2 - Owner and Responsibility

1.1 - Organisation or department

Standards and Testing Agency

1.2 - Team

Teacher Assessment and Moderation Team

1.3 - Senior responsible owner

Deputy Director, Assessment Operations and Services

1.4 - External supplier involvement

Tier 2 - Description and Rationale

2.1 - Detailed description

Colossyan AI has a number of voices available. This can range from gender, tone, accent etc. Someone creating a voiceover can either select a single voice option or multiple, depending on the type of message they want to deliver to their audience. The programme also has a video functionality, which can be used in many different ways. It can include pre-built templates, which generates a range of different presentation settings from sales, marketing, teaching, project management etc. It also provides avatars which can used to replicate a real human delivering a presentation. Similar to the voice over, it can range from gender, tone, accent etc. However, there is also the option to import a document, whether it be a PowerPoint or PDF, to convert as a template for a video. This is especially useful for recurring PowerPoint presentations that take up time and effort to deliver. The programme also has a branding kit, which is especially useful for logos as well as a phonetic voice setting which ensures words that the AI voice needs to narrate phonetically can be scored as record for future use.

2.2 - Scope

The tool is designed to generate voiceover recording, training videos and induction videos, along with a range of other resources for users. Call volumes fluctuate by month ranging from 251 to 9465 calls. All calls to the helpline hear the recorded voiceover and the caller then chooses the relevant option. Although we have primarily used it for voiceover at the moment, there is scope for it to support video creation to complement some training materials made available by the department for schools and local authorities.

2.3 - Benefit

The tool is designed to save users time when preparing interactive voice recordings. The specific benefits are the speed at which Colossyan can narrate a full script in one consistent voice, and the ability to amend words phonetically for the AI voice. This would normally take a significant amount of time for someone to record using their own voice, ensuring there is no background noise, errors or retakes. The tool also ensures that edits can be done swiftly without the need for the same person to be available, saving significant amounts of time and resource.

2.4 - Previous process

A member of staff would need to record every call option with their own voice. This required a reliable microphone, good acoustics and time resources. If any amendments were needed, there was a need for that same member of staff to be available, often are very short notice.

2.5 - Alternatives considered

STA searched the market for a number of AI options, but Colossyan provided the best voiceover options to support this work.

Tier 2 - Decision making Process

3.1 - Process integration

This tool is more of a support mechanism designed to make tasks such as voice recordings and video creation much easier. Using the text-to-speech function it allows for quicker updates to the interactive voice recording and a smoother journey for schools as all voice recordings are now in the same voice and tone using the same voice from Colossyan. From a helpline perspective it is used as a voice recording tool only, there is no ability to direct callers, this only comes from the caller choosing an option on the Interactive Voice Response (IVR) system. This tool removes the manual effort of a human recording the voiceover and allows for quicker updates to be made for example when the helpline is busy a quick message can be relayed on the IVR via the Colossyan voice recording.

3.2 - Provided information

Colossyan does have the functionality to create a video based on a PowerPoint or PDF. By uploading either document the AI functionality can offer suggested video ideas. The recommendations provided by the AI tool only concern potential video ideas or templates but do not influence any decisions that would have a significant material impact on the public or other public effect.

3.3 - Frequency and scale of usage

For voice recordings, it has been more frequent due to its simplicity. Changes to the voice recordings happen regularly to reflect any changes to the options available to callers and to highlight any key themes. Call volumes fluctuate by month ranging from 251 to 9465 calls. For video creation the use has been more sporadic on an ad-hoc basis.

3.4 - Human decisions and review

The tool ensures that humans remain involved at every stage of the creation process. Although it can generate suggested ideas. Instead, it takes input from the user to create voiceovers or videos before generating a final version. This final version is reviewed and approved by a member of the STA staff.

3.5 - Required training

The product is very user friendly and requires minimal training. Training videos are available on the Colossyan YouTube channels, and users have access to a series of webinar demonstrations by the company.

3.6 - Appeals and review

N/A

Tier 2 - Tool Specification

4.1.1 - System architecture

Colossyan is built on a microservice architecture and operates on AWS, utilising virtual servers on AWS EC2. The system employs DynamoDB and Postgres as its databases. The web application is developed using NextJS, and it is accessible through web browsers.

4.1.2 - Phase

Production

4.1.3 - Maintenance

The tool is in active development and is continually maintained and reviewed. This process includes reviewing user feedback, making technical improvements, fixing bugs, and adding new features. User feedback is captured through a live chat functionality, which is used to help iterate and improve the tool.

4.1.4 - Models

Colossyan employs a combination of proprietary AI models and open-source technologies to power its text-to-video platform.

While specific details about the underlying models are not publicly disclosed, the platform utilises: Text-to-Speech (TTS) Engine: Converts input scripts into spoken audio, supporting over 70 languages.

AI Avatars: Generates virtual human actors that deliver the scripted content with realistic lip-syncing and expressions.

GPT-4 Integration: Assists in script generation and optimisation through natural language processing and uploaded document processing.

Translation services: helping translating text to different languages.

AI Image generation (Dall-e): optionally generating embedded media files

Tier 2 - Model Specification: Text-to-Speech (TTS) Engine (1)

4.2.1 - Model name

Text-to-Speech (TTS) Engine

4.2.2 - Model version

Latest Software as a Service (SaaS) version

4.2.3 - Model task

Converts input scripts into spoken audio, supporting over 70 languages.

4.2.4 - Model input

Text based scripts

4.2.5 - Model output

Converts text (characters or phonemes) into a sequence of embeddings.

4.2.6 - Model architecture

Text-to-Speech (TTS) Model Architecture: 1. Acoustic Model (Text-to-Spectrogram) Tacotron 2 (Sequence-to-Sequence Model with Attention) Encoder: Converts text (characters or phonemes) into a sequence of embeddings. Attention mechanism: Aligns each input text token with output audio frames to handle timing and prosody. Decoder: Generates mel-spectrogram frames step-by-step, predicting acoustic features over time.

Key Features: Learns pronunciation, rhythm, and intonation. Handles variable-length sequences of text and speech. Produces mel-spectrograms that represent speech sounds visually.

Vocoder (Spectrogram-to-Waveform) Purpose: Converts the predicted mel-spectrogram into raw audio waveforms.

Common Vocoder Models: WaveNet: Autoregressive model generating one audio sample at a time, known for very high-quality audio. HiFi-GAN: GAN-based vocoder offering high fidelity with real-time synthesis.

4.2.7 - Model performance

LLM (AI assistant): OpenAI 4 Translation: DeepL, Microsoft Voices: Elevenlabs, Wellsaid Image Generation: OpenAI Dall-E Lip-synching technology: in-house development

From an STA perspective, videos made go through a qualitative review process in which the Teacher Assessment and Moderation team will review before there is a further review by the STA Communications team. The review looks at both the narration of the video and the contents of the video.

4.2.8 - Datasets

For the proprietary technology (lip-syncing technology), the data that is collected is entirely private. They have their own studio into which they invite actors, whom they record with their agreement.

The other AI solutions are third-parties integrated into Colossyan (Elevenlabs, Wellsaid, DeepL, Microsoft Azure, OpenAI).

4.2.9 - Dataset purposes

These datasets help models learn pronunciation, intonation, and timing across different voices and languages which are critical for Colossyan’s multilingual avatar voiceovers

Tier 2 - Model Specification: AI Avatar Model (2)

4.2.1 - Model name

AI Avatar Model

4.2.2 - Model version

Latest SaaS version

4.2.3 - Model task

To generate realistic virtual humans with synchronised lip movements and facial expressions

4.2.4 - Model input

Text or Audio Input: The script or speech audio that the avatar will speak.

4.2.5 - Model output

Modified video frames where lip movements are synchronised with audio.

4.2.6 - Model architecture

AI Avatar Model Architecture Overview Speech Feature Extraction Extract audio features like mel-spectrograms, MFCCs, or phonemes from the input speech/audio. These features drive lip-sync and expression generation.

Lip-sync Module Neural networks like Wav2Lip or similar models that take audio features and an image/video frame as inputs. Use convolutional neural networks (CNNs) combined with recurrent layers or transformers to generate realistic mouth movements matching the speech.

Facial Expression & Emotion Module Models like Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) generate or modify facial expressions to make the avatar look natural and expressive.

These models learn a latent space representing facial features and emotions.

Conditioning on emotion vectors or other signals helps produce expressions (smiling, blinking, eyebrow movement).

3D Modelling / Neural Rendering Advanced avatars use 3D morphable models (3DMM) or neural rendering techniques: Convert 2D images into 3D face representations. Use differentiable rendering to generate photorealistic video frames. This adds depth, lighting, and realistic movement beyond just 2D manipulation.

Video Synthesis and Output

Combine the lip-sync frames and expression modifications into a seamless video. Post-processing might include smoothing transitions and improving resolution.

4.2.7 - Model performance

Traditional LLM based metrics (bias and accuracy) are not applicable in the space of AI avatars. Colossyan do measure their output quality, but these are confidential and are only used internally to benchmark new release candidates against previous releases.

The use of avatars has not yet been utilised by STA.

4.2.8 - Datasets

The other AI solutions are third-parties integrated into Colossyan (Elevenlabs, Wellsaid, DeepL, Microsoft Azure, OpenAI).

4.2.9 - Dataset purposes

Colossyan internally collected video footages to compute the quality metrics of the AI models. The training data is collected in their recording studio in Budapest. They use those recordings to train their Generative AI models. All the customer data is kept private and is not used for training any AI model expect the one used to train the custom avatar for that customer.

Tier 2 - Model Specification: Generative Pretrained Transformer (3)

4.2.1 - Model name

Generative Pre-trained Transformer

4.2.2 - Model version

OpenAI GPT 3.0

4.2.3 - Model task

Understand the semantic text based request submitted by the Colossyan user.

4.2.4 - Model input

Input text based request for voice over/video generation

4.2.5 - Model output

Outputted scripts that can be used and sent to the text to speech model.

4.2.6 - Model architecture

GPT-3 is based on the Transformer architecture, characterised by:

Multi-head self-attention layers that model relationships between words.

Feed-forward neural networks between attention layers.

Layer normalisation and residual connections for training stability.

Decoder-only model trained to predict the next word in a sequence.

Input tokens → Embedding layer → Stacked Transformer decoder layers → Output probabilities over vocabulary → Text generation

4.2.7 - Model performance

STA tested the performance of the tool with the metric that is indicative: STA generate 1 minute of video within 3 minutes.

More information about the technology can be found here - https://www.colossyan.com/posts/tech-stack-behind-ai-video-generation

4.2.8 - Datasets

Large Language models have been trained on numerous data sets examples of these can be; OpenWebText a web-scraped content similar to GPT’s training data. BooksCorpus a data set of Thousands of novels used for long-form text generation. Wikipedia + CC-News Public factual data for high-quality script writing.

4.2.9 - Dataset purposes

To train these Large Language models

Tier 2 - Risks, Mitigations and Impact Assessments

5.1 - Impact assessment

N/A

5.2 - Risks and mitigations

We ensure no sensitive information is used in this programme. All the data inputted into the programme is information that is accessible in the public domain.

Model Limitations & Quality Risk Some avatars may appear unnatural or “uncanny”, especially in edge cases or non-standard scripts. Lip-sync or pronunciation may fail for: Technical jargon Accented speech Mitigation: Human review and testing before use at scale.

Published 30 October 2025

Contents

Cookies on GOV.UK

Standards and Testing Agency: Colossyan AI for voiceovers and videos

Tier 1 Information

1 - Name

2 - Description

3 - Website URL

4 - Contact email

Tier 2 - Owner and Responsibility

1.1 - Organisation or department

1.2 - Team

1.3 - Senior responsible owner

1.4 - External supplier involvement

Tier 2 - Description and Rationale

2.1 - Detailed description

2.2 - Scope

2.3 - Benefit

2.4 - Previous process

2.5 - Alternatives considered

Tier 2 - Decision making Process

3.1 - Process integration

3.2 - Provided information

3.3 - Frequency and scale of usage

3.4 - Human decisions and review

3.5 - Required training

3.6 - Appeals and review

Tier 2 - Tool Specification

4.1.1 - System architecture

4.1.2 - Phase

4.1.3 - Maintenance

4.1.4 - Models

Tier 2 - Model Specification: Text-to-Speech (TTS) Engine (1)

4.2.1 - Model name

4.2.2 - Model version

4.2.3 - Model task

4.2.4 - Model input

4.2.5 - Model output

4.2.6 - Model architecture

4.2.7 - Model performance

4.2.8 - Datasets

4.2.9 - Dataset purposes

Tier 2 - Model Specification: AI Avatar Model (2)

4.2.1 - Model name

4.2.2 - Model version

4.2.3 - Model task

4.2.4 - Model input

4.2.5 - Model output

4.2.6 - Model architecture

4.2.7 - Model performance

4.2.8 - Datasets

4.2.9 - Dataset purposes

Tier 2 - Model Specification: Generative Pretrained Transformer (3)

4.2.1 - Model name

4.2.2 - Model version

4.2.3 - Model task

4.2.4 - Model input

4.2.5 - Model output

4.2.6 - Model architecture

4.2.7 - Model performance

4.2.8 - Datasets

4.2.9 - Dataset purposes

Tier 2 - Risks, Mitigations and Impact Assessments

5.1 - Impact assessment

5.2 - Risks and mitigations

Updates to this page

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK