Standards and Testing Agency: Colossyan AI for voiceovers and videos
A voiceover AI that narrates scripts for the Standards and Testing Agency helpline and training videos.
Tier 1 Information
1 - Name
Colossyan AI for voiceovers and videos.
2 - Description
This is an AI voiceover tool that STA use to support the department’s interactive voice recordings for the helpline, aiming to reducing staff workload.
User interaction with Colossyan takes place when callers contact the helpline and hear the options.
Staff in the department can prepare a script for their interactive voice recordings and have the AI narrate the script and provide an MP3 file, which can be used for the helpline.
3 - Website URL
4 - Contact email
Tier 2 - Owner and Responsibility
1.1 - Organisation or department
Standards and Testing Agency
1.2 - Team
Teacher Assessment and Moderation Team
1.3 - Senior responsible owner
Deputy Director, Assessment Operations and Services
1.4 - External supplier involvement
No
Tier 2 - Description and Rationale
2.1 - Detailed description
Colossyan AI has a number of voices available. This can range from gender, tone, accent etc. Someone creating a voiceover can either select a single voice option or multiple, depending on the type of message they want to deliver to their audience. The programme also has a video functionality, which can be used in many different ways. It can include pre-built templates, which generates a range of different presentation settings from sales, marketing, teaching, project management etc. It also provides avatars which can used to replicate a real human delivering a presentation. Similar to the voice over, it can range from gender, tone, accent etc. However, there is also the option to import a document, whether it be a PowerPoint or PDF, to convert as a template for a video. This is especially useful for recurring PowerPoint presentations that take up time and effort to deliver. The programme also has a branding kit, which is especially useful for logos as well as a phonetic voice setting which ensures words that the AI voice needs to narrate phonetically can be scored as record for future use.
2.2 - Scope
The tool is designed to generate voiceover recording, training videos and induction videos, along with a range of other resources for users. Call volumes fluctuate by month ranging from 251 to 9465 calls. All calls to the helpline hear the recorded voiceover and the caller then chooses the relevant option. Although we have primarily used it for voiceover at the moment, there is scope for it to support video creation to complement some training materials made available by the department for schools and local authorities.
2.3 - Benefit
The tool is designed to save users time when preparing interactive voice recordings. The specific benefits are the speed at which Colossyan can narrate a full script in one consistent voice, and the ability to amend words phonetically for the AI voice. This would normally take a significant amount of time for someone to record using their own voice, ensuring there is no background noise, errors or retakes. The tool also ensures that edits can be done swiftly without the need for the same person to be available, saving significant amounts of time and resource.
2.4 - Previous process
A member of staff would need to record every call option with their own voice. This required a reliable microphone, good acoustics and time resources. If any amendments were needed, there was a need for that same member of staff to be available, often are very short notice.
2.5 - Alternatives considered
STA searched the market for a number of AI options, but Colossyan provided the best voiceover options to support this work.
Tier 2 - Decision making Process
3.1 - Process integration
This tool is more of a support mechanism designed to make tasks such as voice recordings and video creation much easier. Using the text-to-speech function it allows for quicker updates to the interactive voice recording and a smoother journey for schools as all voice recordings are now in the same voice and tone using the same voice from Colossyan. From a helpline perspective it is used as a voice recording tool only, there is no ability to direct callers, this only comes from the caller choosing an option on the Interactive Voice Response (IVR) system. This tool removes the manual effort of a human recording the voiceover and allows for quicker updates to be made for example when the helpline is busy a quick message can be relayed on the IVR via the Colossyan voice recording.
3.2 - Provided information
Colossyan does have the functionality to create a video based on a PowerPoint or PDF. By uploading either document the AI functionality can offer suggested video ideas. The recommendations provided by the AI tool only concern potential video ideas or templates but do not influence any decisions that would have a significant material impact on the public or other public effect.
3.3 - Frequency and scale of usage
For voice recordings, it has been more frequent due to its simplicity. Changes to the voice recordings happen regularly to reflect any changes to the options available to callers and to highlight any key themes. Call volumes fluctuate by month ranging from 251 to 9465 calls. For video creation the use has been more sporadic on an ad-hoc basis.
3.4 - Human decisions and review
The tool ensures that humans remain involved at every stage of the creation process. Although it can generate suggested ideas. Instead, it takes input from the user to create voiceovers or videos before generating a final version. This final version is reviewed and approved by a member of the STA staff.
3.5 - Required training
The product is very user friendly and requires minimal training. Training videos are available on the Colossyan YouTube channels, and users have access to a series of webinar demonstrations by the company.
3.6 - Appeals and review
N/A
Tier 2 - Tool Specification
4.1.1 - System architecture
Colossyan is built on a microservice architecture and operates on AWS, utilising virtual servers on AWS EC2. The system employs DynamoDB and Postgres as its databases. The web application is developed using NextJS, and it is accessible through web browsers.
4.1.2 - Phase
Production
4.1.3 - Maintenance
The tool is in active development and is continually maintained and reviewed. This process includes reviewing user feedback, making technical improvements, fixing bugs, and adding new features. User feedback is captured through a live chat functionality, which is used to help iterate and improve the tool.
4.1.4 - Models
Colossyan employs a combination of proprietary AI models and open-source technologies to power its text-to-video platform.
While specific details about the underlying models are not publicly disclosed, the platform utilises: Text-to-Speech (TTS) Engine: Converts input scripts into spoken audio, supporting over 70 languages.
AI Avatars: Generates virtual human actors that deliver the scripted content with realistic lip-syncing and expressions.
GPT-4 Integration: Assists in script generation and optimisation through natural language processing and uploaded document processing.
Translation services: helping translating text to different languages.
AI Image generation (Dall-e): optionally generating embedded media files
Tier 2 - Model Specification: Text-to-Speech (TTS) Engine (1)
4.2.1 - Model name
Text-to-Speech (TTS) Engine
4.2.2 - Model version
Latest Software as a Service (SaaS) version
4.2.3 - Model task
Converts input scripts into spoken audio, supporting over 70 languages.
4.2.4 - Model input
Text based scripts
4.2.5 - Model output
Converts text (characters or phonemes) into a sequence of embeddings.
4.2.6 - Model architecture
Text-to-Speech (TTS) Model Architecture: 1. Acoustic Model (Text-to-Spectrogram) Tacotron 2 (Sequence-to-Sequence Model with Attention) Encoder: Converts text (characters or phonemes) into a sequence of embeddings. Attention mechanism: Aligns each input text token with output audio frames to handle timing and prosody. Decoder: Generates mel-spectrogram frames step-by-step, predicting acoustic features over time.
Key Features: Learns pronunciation, rhythm, and intonation. Handles variable-length sequences of text and speech. Produces mel-spectrograms that represent speech sounds visually.
- Vocoder (Spectrogram-to-Waveform) Purpose: Converts the predicted mel-spectrogram into raw audio waveforms.
Common Vocoder Models: WaveNet: Autoregressive model generating one audio sample at a time, known for very high-quality audio. HiFi-GAN: GAN-based vocoder offering high fidelity with real-time synthesis.
4.2.7 - Model performance
LLM (AI assistant): OpenAI 4 Translation: DeepL, Microsoft Voices: Elevenlabs, Wellsaid Image Generation: OpenAI Dall-E Lip-synching technology: in-house development
From an STA perspective, videos made go through a qualitative review process in which the Teacher Assessment and Moderation team will review before there is a further review by the STA Communications team. The review looks at both the narration of the video and the contents of the video.
4.2.8 - Datasets
For the proprietary technology (lip-syncing technology), the data that is collected is entirely private. They have their own studio into which they invite actors, whom they record with their agreement.
The other AI solutions are third-parties integrated into Colossyan (Elevenlabs, Wellsaid, DeepL, Microsoft Azure, OpenAI).
4.2.9 - Dataset purposes
These datasets help models learn pronunciation, intonation, and timing across different voices and languages which are critical for Colossyan’s multilingual avatar voiceovers
Tier 2 - Model Specification: AI Avatar Model (2)
4.2.1 - Model name
AI Avatar Model
4.2.2 - Model version
Latest SaaS version
4.2.3 - Model task
To generate realistic virtual humans with synchronised lip movements and facial expressions
4.2.4 - Model input
Text or Audio Input: The script or speech audio that the avatar will speak.
4.2.5 - Model output
Modified video frames where lip movements are synchronised with audio.
4.2.6 - Model architecture
AI Avatar Model Architecture Overview Speech Feature Extraction Extract audio features like mel-spectrograms, MFCCs, or phonemes from the input speech/audio. These features drive lip-sync and expression generation.
Lip-sync Module Neural networks like Wav2Lip or similar models that take audio features and an image/video frame as inputs. Use convolutional neural networks (CNNs) combined with recurrent layers or transformers to generate realistic mouth movements matching the speech.
Facial Expression & Emotion Module Models like Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) generate or modify facial expressions to make the avatar look natural and expressive.
These models learn a latent space representing facial features and emotions.
Conditioning on emotion vectors or other signals helps produce expressions (smiling, blinking, eyebrow movement).
3D Modelling / Neural Rendering Advanced avatars use 3D morphable models (3DMM) or neural rendering techniques: Convert 2D images into 3D face representations. Use differentiable rendering to generate photorealistic video frames. This adds depth, lighting, and realistic movement beyond just 2D manipulation.
Video Synthesis and Output
Combine the lip-sync frames and expression modifications into a seamless video. Post-processing might include smoothing transitions and improving resolution.
4.2.7 - Model performance
Traditional LLM based metrics (bias and accuracy) are not applicable in the space of AI avatars. Colossyan do measure their output quality, but these are confidential and are only used internally to benchmark new release candidates against previous releases.
The use of avatars has not yet been utilised by STA.
4.2.8 - Datasets
For the proprietary technology (lip-syncing technology), the data that is collected is entirely private. They have their own studio into which they invite actors, whom they record with their agreement.
The other AI solutions are third-parties integrated into Colossyan (Elevenlabs, Wellsaid, DeepL, Microsoft Azure, OpenAI).
4.2.9 - Dataset purposes
Colossyan internally collected video footages to compute the quality metrics of the AI models. The training data is collected in their recording studio in Budapest. They use those recordings to train their Generative AI models. All the customer data is kept private and is not used for training any AI model expect the one used to train the custom avatar for that customer.
Tier 2 - Model Specification: Generative Pretrained Transformer (3)
4.2.1 - Model name
Generative Pre-trained Transformer
4.2.2 - Model version
OpenAI GPT 3.0
4.2.3 - Model task
Understand the semantic text based request submitted by the Colossyan user.
4.2.4 - Model input
Input text based request for voice over/video generation
4.2.5 - Model output
Outputted scripts that can be used and sent to the text to speech model.
4.2.6 - Model architecture
GPT-3 is based on the Transformer architecture, characterised by:
Multi-head self-attention layers that model relationships between words.
Feed-forward neural networks between attention layers.
Layer normalisation and residual connections for training stability.
Decoder-only model trained to predict the next word in a sequence.
Input tokens → Embedding layer → Stacked Transformer decoder layers → Output probabilities over vocabulary → Text generation
4.2.7 - Model performance
STA tested the performance of the tool with the metric that is indicative: STA generate 1 minute of video within 3 minutes.
More information about the technology can be found here - https://www.colossyan.com/posts/tech-stack-behind-ai-video-generation
4.2.8 - Datasets
Large Language models have been trained on numerous data sets examples of these can be; OpenWebText a web-scraped content similar to GPT’s training data. BooksCorpus a data set of Thousands of novels used for long-form text generation. Wikipedia + CC-News Public factual data for high-quality script writing.
4.2.9 - Dataset purposes
To train these Large Language models
Tier 2 - Risks, Mitigations and Impact Assessments
5.1 - Impact assessment
N/A
5.2 - Risks and mitigations
We ensure no sensitive information is used in this programme. All the data inputted into the programme is information that is accessible in the public domain.
Model Limitations & Quality Risk Some avatars may appear unnatural or “uncanny”, especially in edge cases or non-standard scripts. Lip-sync or pronunciation may fail for: Technical jargon Accented speech Mitigation: Human review and testing before use at scale.