Technical note - English Private Landlord Survey: Segmenting the business models of private landlords

Question 1

Background

Accepted Answer

Segmentation analysis breaks down a large and potentially complex population into smaller, more manageable groups. By examining these smaller groups individually, distinct groups with multiple similarities can be compared against each other to uncover patterns and trends that might be obscured in aggregate data.

k-medoids clustering is a clustering algorithm that is particularly useful for handling outliers and providing meaningful groupings. In this technical annex, ‘cluster’ is used to represent a data-driven group of similar points, which may evolve throughout the analyses process, whereas ‘segment’ is the outcome, but conceptually they were equivalent and may be read as synonyms.

k-medoids clustering is a partitioning algorithm that aims to minimise the dissimilarity between points within a cluster while choosing actual data points as centres, known as medoids. Please see more details on the algorithm in the scikit-learn-extra 0.3.0 documentation. A significant advantage of k-medoids is that it selects representative points from the dataset, making it less sensitive to outliers and noise than other methodologies, such as k-means.

Question 2

Application of a single k-medoids cluster process

Accepted Answer

The process begins with selecting k initial medoids randomly from the dataset, where k is a predefined number of target clusters. Each data point is then assigned to the nearest medoid based on a chosen distance metric. After the initial assignment, the algorithm iteratively updates the medoids by examining potential replacements among the data points to reduce the overall dissimilarity within the clusters. This involved calculating the total ‘cost’ of swapping a medoid with another point in the cluster and choosing the swap that results in the greatest reduction in cost.

The iterative process continues until there is no change in the medoids, indicating that the optimal clusters have been found. The final clusters are then evaluated for their cohesion and separation to ensure meaningful segmentation.

Question 3

Iterative approach to finding the optimal clusters

Accepted Answer

In identifying clusters for the study, the above clustering process was run hundreds of times with variations in settings. These variations were of several types:

Number of clusters

As the number of natural clusters was not pre-supposed through existing knowledge, a typical approach is to experiment with varying the number of starting medoids, and hence the number of resulting clusters. Between 5 and 7 clusters were tested, as a balance between meaningful distinction between groups and a manageable number of potential cluster profiles to interpret.

Corporate landlord handling

The EPLS captures information about landlord type within the responses. Consideration was given as to whether to segment this group manually based on whether the landlord identified themselves as “part of a company”, or to let the segmentation model make this distinction in combination with other factors. This group were treated separately. In the report, they were included as a segment, but this distinction was based on survey response and not decided via the k-medoids segmentation process which was applied to non-corporate landlords only. Respondents who answered “both as an individual/group of individuals and as part of a company” were also included in the corporate segment.

Variables included in the model

Observations in clusters represent those that have multiple similarities across many variables, and so clusters were heavily affected by the choice of variables in the model. For example, if too many variables representing very similar concepts were chosen, this would weight the analysis towards an interpretation of those variables and reduce the impact of other less related variables or concepts. For this reason, variable selection was carried out to ensure balanced inclusion across multiple factors, and that every included variable was meaningful and something that should be considered for differential analysis.

Nevertheless, even well-considered variables can sometimes cause difficulty for clustering, such as if they add noise to the model and make clusters less well-defined. For this reason, we experimented with Leave One Feature Out approaches: excluding one variable from the model and observing the impact on the clusters. Correlation was also calculated between each pair of variables to identify those which were potentially redundant. Additionally, for some variables that were known to be similar, comparison runs were made to observe the impact of including one versus the other.

Algorithm parameters

Experiments were run comparing cluster performance using two different distance metrics that underlie the calculation of similarity: Manhattan and Gower distances. Please see more details on the Gower distance calculation on the Python Package Index page. Manhattan distance was an efficient, simple distance metric, while Gower has some advantages for categorical variables. Other techniques to account for the variety of variables, including standardisation, were also employed where appropriate. Gower ultimately outperformed and was selected for later runs.

With hundreds of model outputs, cluster cohesion was evaluated by calculating the Silhouette Score, which was an average of each point’s similarity within its own cluster against the dissimilarity towards the next nearest cluster. Please see more details on the Silhouette score in the scikit-learn documentation. A higher score means more well-defined clusters, allowing easy comparison between large numbers of models without having to examine clusters manually. The scoring metric enabled high-performing models to be identified and approaches to be refined.

The k-medoids and other clustering techniques can also be sensitive to the initial starting medoids. In some cases, this can result in optimal clusters not being found, or worse, models which appear to score well but were not well optimised. As well as manual review, potentially performant models were subjected to repeat runs using different initialisation parameters, and the variation in scores compared. This was to ensure model robustness; that similar scores were found despite different starting points and hence that clusters were representative of the data and not affected by chance.

Question 4

Model selection

Accepted Answer

Ultimately, a shortlist of high-performing models were reviewed manually to ensure the clusters were interpretable in the context of the chosen variables. A final selection from the shortlist was made in consultation with MHCLG.

Models which showed promising results were considered, including the best performing 5-, 6- and 7-cluster models, as well as one 5-cluster model with the addition of the 6th corporate landlord segment. Characteristics of each cluster were presented in terms of their answers to specific survey questions: for example, the proportions of respondents in each segment who considered their portfolio an investment. Specific summaries were made of the characteristics of each segment to ease comparison.

The chosen model was the 5-cluster with corporate landlords identified separately. This model used 15 variables to profile landlord characteristics across multiple dimensions. The model incorporates property portfolio metrics (number of properties, HMOs, and buy-to-let status), landlord role perceptions (investment vs. business orientation), employment characteristics (self-employed and retired status), financial indicators (rental income proportion and loan-to-value ratio), operational practices (agent use for management), tenant targeting (students, retired people, and housing support recipients), and professional engagement (membership in landlord organizations). Key variables:

Portfolio: numprp, hmonum, btl,

Role/Employment: llrole_final001, llrole_final002, llrole_final003, llrole_final004, llrole_final005, llrole_final007, llempl_final003, llempl_final005,

Financial/Management: rtprinc, ltv_imp, agtusea1, agtusea2,

Tenant Types: tentypallb01, tentypallb02, tentypallb03, tentypallb04,

Professional: llorg_final001, llorg_final002.

For more information on the questionnaire and variables included, please refer to the EPLS 2024 questionnaire documentation.

Question 5

Weighting

Accepted Answer

The EPLS data reported here is weighted to adjust for non-response to the survey. The calibration scheme applied to the landlord weights used for the tables and data this report is based on adjusts for non-response associated with the variables used in calibration (TDP, portfolio size and region). To the extent that these variables are correlated with landlord types and behaviours, there will be some reduction in bias. For more information on the EPLS 2024 weighting, please refer to the technical report

Technical note - English Private Landlord Survey: Segmenting the business models of private landlords

Applies to England

Background

Application of a single k-medoids cluster process

Iterative approach to finding the optimal clusters

Model selection

Weighting

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK

Cookies on GOV.UK

Applies to England

Background

Application of a single k-medoids cluster process

Iterative approach to finding the optimal clusters

Model selection

Weighting

Is this page useful?

Help us improve GOV.UK

Help us improve GOV.UK