© Crown copyright 2018
This publication is licensed under the terms of the Open Government Licence v3.0 except where otherwise stated. To view this licence, visit nationalarchives.gov.uk/doc/open-government-licence/version/3 or write to the Information Policy Team, The National Archives, Kew, London TW9 4DU, or email: firstname.lastname@example.org.
Where we have identified any third party copyright information you will need to obtain permission from the copyright holders concerned.
This publication is available at https://www.gov.uk/government/publications/quality-assurance-of-administrative-data-in-the-uk-house-price-index/acorn-consumer-classification-caci
This is the Quality assurance of administrative data (QAAD) of the data source in Acorn consumer classification (CACI) used the production of the UK House Price Index (UK HPI) and Northern Ireland House Price Index (NI HPI).
The UK House Price Index (UK HPI) measures the change in the price paid to purchase residential property in the United Kingdom. A number of different administrative datasets are used in the production of the monthly UK HPI using a technique known as hedonic regression. In simple terms, hedonic regression is a technique which accounts for the changing quality of property transacted each period to isolate only pure price change, so that the change in price is not distorted by differences in the composition of property sold (for example, you cannot directly compare the price of a one bedroom property sold in one period with a three bedroom property sold in another).
The hedonic regression approach requires detailed information on the characteristics of property sold, both regarding the physical attributes of the property (such as size, floor space for example) and the location of the property (what type of neighbourhood, where in the country for example). For the production of the UK HPI this data is obtained from a variety of administrative data sources that cover the price paid for transacted property (such as the Price Paid Dataset collected by HM Land Registry for England and Wales), the attributes of a property (such as the Council Tax Valuation List maintained by the Valuation Office Agency) and characteristics related to the location of the property (such as the type of neighbourhood where the property is situated, defined by the Acorn classification from Consolidated Analysis Centers, Inc. (CACI)).
This document will focus on the Acorn classification, which is produced by CACI. Acorn is a segmentation tool which categorises the UK’s population into demographic types. Acorn provides a general understanding of the attributes of a neighbourhood by classifying postcodes into a category, group or type. For the purpose of the UK HPI, the Acorn group is used to classify property according to the postcode where it is situated, for example, a property (based on the postcode) could be classified in Acorn category ‘lavish lifestyles’ through to category ‘difficult circumstances’.
The reasoning (and importance) for using such a classification is that the location of a property should influence the price people are willing to pay and as such is an important price determining characteristic that should be accounted for when modelling house prices.
2. Summary of process
An updated version of the Acorn classification is provided by CACI on an annual basis for use in the UK HPI, to ensure the classification used remains representative of changes in neighbourhoods and to capture new postcodes that become available as new property is built. Each month, the latest version of the Acorn classification is matched to the latest set of property transactions (using the postcode variable as the match key) with the resultant data then used in the production of the latest month’s house price index.
3. Assessment of the CACI Acorn data using the Administrative Data Quality Assurance Toolkit
The production and publication of house price data can be considered as medium profile, in that there is wider user and media interest in the results that are published, with moderate economic or political sensitivity.
The data quality concern attached to the Acorn classification is considered a medium quality concern. While the CACI Acorn data is constructed based on a number of data sources, the Office for National Statistics (ONS) has been in discussion with CACI’s head statistician to get a better understanding of the processing and quality assurance processes applied to the data. We consider the data of sufficient statistical quality for the purpose it is being used. Furthermore, it should be noted that the Acorn classification is one of a number of characteristics (from various administrative data sources) that are used in the modelling of house price data. The floor area and type of property (detached, semi-detached for example) are generally found to be the most important variables in explaining a house price, followed by the Acorn variable (out of 6 variables).
When taking into consideration the public profile of house price statistics, its potential impact and the level of quality concern from the provider, the level of assurance attached to the use of Acorn data in the production of the UK House Price Index has been assessed as A2: Enhanced assurance.
3.1 Practice area 1: operational context and administrative data collection
Acorn is a geo-demographic segmentation of residential neighbourhoods in the UK. It classifies each postcode in the country into one of 62 types. The 62 types aggregate into 18 Acorn groups which lie within 6 Acorn categories at the top level. The 18 Acorn groups are used as explanatory variables within the UK HPI regression, these are;
- lavish lifestyle
- executive wealth
- mature money
- city sophisticates
- career climbers
- countryside communities
- successful suburbs
- steady neighbourhoods
- comfortable seniors
- starting out
- student life
- modest means
- striving families
- poorer pensioners
- young hardship
- struggling estates
- difficult circumstances
- not private household
Acorn is essentially a segmentation of people and their characteristics (rather than a characteristic of the property). It is used as an explanatory variable within the UK HPI given it was found to be a variable which explains some of the price of a property.
Acorn draws on a wide range of data sources, both commercial and public sector Open Data and administrative data. These include HM Land Registry, Registers of Scotland, commercial sources of information on age of residents, ethnicity profiles, benefits data, population density and data on social housing and other rental property. In addition CACI have created proprietary databases, including location of prisons, traveller sites, age-restricted housing, care homes, high-rise buildings and student accommodation. Traditional data sources such as the Census of Population and large-volume lifestyle surveys are also used. More information on the CACI Acorn website.
The type allocated to a property is predominantly calculated through an algorithm which is calculated using the data sources mentioned above. In some instances, a manual allocation of Acorn type is applied. Examples of this are for traveller sites, student halls of residence and prisons.
Further detail on the CACI website which includes:
- a comprehensive overview of the methods used in the production of Acorn technical guide (PDF, 783KB, 22 pages)
- a comprehensive Acorn user guide (PDF, 4.8MB, 108 pages).
In the absence of any other information, traditional data sources such as the census are predominantly used. Census data is published at the level of the output area, each of which has around 150 households and is used to allocate each postcode within the output area to the appropriate type. While at the postcode level this allocation of type may not be exact, the impact of this on the UK HPI is minimal given that the higher aggregate of Acorn groups are used with data published at a local authority level.
The Acorn dataset is provided to ONS on an annual basis. As such, newly created postcodes will not be incorporated into the Acorn dataset until its next annual update. The scale of this was tested by matching the Acorn 2016 data against HM Land Registry price paid data since this update, by postcode. For existing properties a 99% postcode match rate is achieved, however for new properties this falls to a 40% match rate. This percentage is broadly consistent across geographies and property types. The impact of this is that greater reliance is placed on the matched subset of postcodes (the 40%) which may lead to increased volatility in the resulting new build estimate.
In discussions with CACI Acorn it was confirmed that the use of the data as part of a model (as it is used in the UK HPI) is an appropriate use of its data.
3.2 Practice area 2: communication with data supply partners
The Office for National Statistics (ONS) has a rolling three-year license with CACI for the use of the Acorn classification in the calculation of house price indices. The licence was last updated in 2015 and will be renewed at the end of 2017.
Along with having access to the dataset, the licence also provides ONS with access to an account manager and technical support to answer any data queries that occur during the annual update. This provides a clear and established point of contact to discuss any issues or quality concern regarding the annual provision of Acorn data. No regular meetings are scheduled with CACI, although meetings are established when needed, and usually coincide with the renewal of the license agreement. To ensure ONS understand the processing and quality assurance processes applied to the CACI Acorn data, ONS has been in discussion with CACI’s head statistician and incorporated their recommendations into this document.
The latest delivery of Acorn data is initiated each year by the designated account manager at CACI. The contact will advise ONS of the impending update and provide the necessary password information for ONS to securely access the file transfer system through CACI’s secure server and download the data via a username and password. This is then followed up to ensure the data has been accessed successfully and to resolve any questions relating to the latest delivery of data (particularly once ONS has quality assured the data).
3.3 Practice area 3: quality assurance principles, standards and checks applied to data supplies
CACI Acorn has many methods of controlling, ensuring and maintaining the quality of the input data to their models. These include:
- evaluating quality information published alongside the data sources used
- cross-checks against other data sources at the record and aggregate level
- manual checks (internet searches, maps, Zoopla)
- Internal consistency checks
The published technical guide (and supporting material) provide comprehensive information about the methodology and source data used in the production of Acorn.
In evaluating the resulting model and output, the main method used to evaluate and monitor the segmentation of postcodes into classification types is based around the calculation of gains scores – specifically GINI scores – which measure the effectiveness of Acorn in discriminating across a wide range of variables. Further information on this method can be found in the Acorn technical manual.
An Acorn Knowledge Matrix (Excel, 5.6MB) is also made available which describes the characteristics of the Acorn categories, groups and types by comparing the penetration of each segment with that of the UK.
3.4 Practice area 4: producers quality assurance investigation and documentation
The Acorn dataset is one of a set of characteristics (about a property) that is used an input in the production of modelled house prices, which in turn are used in the production of the monthly UK House Price Index (HPI). Details on the UK HPI production methodology are available.
Acorn is updated annually for UK House Price Index (UK HPI) purposes and then subsequently used monthly in the UK HPI production process by matching the Acorn data to the latest property transactions data using the postcode attribute.
Internal (within ONS) quality assurance takes place initially on the annual update of Acorn, to assess the latest delivery of Acorn data in comparison to previous versions. The main quality assurance at this stage is to assess the distribution of postcodes within each category. This distribution is compared with previous years and any substantial changes are investigated with the account manager at CACI for clarification. A further series of spot checks are carried out on postcodes that have disappeared from the latest delivery (meaning they were in the previous year’s Acorn but are missing from the current). These cases usually relate to changes to postcode boundaries.
Further quality assurance then takes place on a monthly basis to ensure the matching of Acorn to property data takes place successfully, further checks are then done within the modelling process.
The modelling process used in the production of house price data includes an automated assurance process that assesses modelled house prices for property with a certain set of attributes against the price for a similar property. If the modelled price is substantially different (meaning it exceeds a predefined tolerance) then the price is excluded from being used in the final house price estimate. Around 50 transactions a month (out of ultimately around 70,000 transactions) are removed as part of this process. These transactions still contribute towards the volume of transactions published.
The House Price Index (HPI) modelling process used can also account for those records where a match cannot be made between the CACI Acorn data and price data provided by HM Land Registry. Each attribute used in the hedonic regression model is given a weight that represents the relative importance of that attribute in explaining house prices. If a record being used in the model has a missing attribute, then the weight of that record is adjusted downwards to represent how important the missing attribute is. This process allows the use of all property transaction data in the calculation of average house prices each month, even though some attribute data could be missing.
For example, a record with no missing attributes would receive a weight of 1, while a record which has all its attributes except rooms will receive a weight of 0.77, and so will contribute slightly less to the final modelled estimate. These weights are calculated from multiple models which are run on an annual basis to determine the importance of each variable.
Following the running of the model, test statistics are analysed to ensure the model has run correctly and fit successfully. This includes analysing the R squared of the model (model fit) and significance of the explanatory variables. An R squared of around 0.8 is achieved. This means that 80% of the variation in price is captured by the explanatory variables. An R squared of 0.8 is high. The old ONS HPI had an R squared of around 0.7.
The data is then aggregated with the resulting series analysed by various breakdowns, over time, and against other published sources of house price growth. Any unexpected movements within the series are explored through the record level data. Monthly curiosity meetings are held to review the new data and discuss any long term trends in the data and its drivers. A high level representation of this process is in Annex A.
4. Strengths and limitations of data
Coverage is comprehensive with data available at a postcode level for matching and is found to explain some of the price of a property which is why it is included in the model. While it is acknowledged that the classification for every postcode may not be exact this is not a requirement given how the data is used within the model. As the Acorn dataset is an annual release new postcodes are not updated until the next release. This may contribute to the volatility of the new builds estimate in the UK HPI.
Overall, this data source is judged to be of adequate quality for the use to which it is being put in the UK House Price index.