Data science

Data science is a systematic approach to extracting insights and making predictions from data - a simplified way of looking at data science.

Once upon a time this section might have been titled ‘Big Data’ or ‘Data Mining’, but times and names move on. None of these terms are entirely equivalent but data science now effectively covers them all.

Stripped down to the core, data science is about the use of statistical methods encased in a blanket of data preparation and visualisation techniques. Exactly what is, and isn’t, data science is of course somewhat debatable and about as well-defined as artificial intelligence (AI). Similarly, a data scientist can be anything between an expert in statistical methods, machine learning (ML) and visualisation techniques at 1 extreme, to anyone with a degree in maths or computer science looking for a better paid job (changing your job title to data scientist significantly increases employability).

Perhaps the best way to think about data science is as a process that takes data and generates insight and prediction from it.

A simple picture of this process is shown below:

It should be stressed that this is a simplification and there are many variations of what the process is, but this highlights the main elements. It should also be noted that ML is not the only way to perform analytics and that ML can exist outside of data science too.

Data

The one thing that can be said about most data is that it’s hard. If you are in a situation where the data is easy you are either lucky or mistaken.

 Some common reasons data can be hard:

  • difficult to actually get hold of it for commercial, legal, or other reasons, sometimes less rational–’it’s mine you can’t have it’

  • poorly documented

  • poor quality (very common)

  • not enough of what you need (even if there are large volumes of data)

  • ethically difficult to use (for example, it contains personal information)

  • difficult to combine with other data

Data types

Data comes in different flavours, which may be 1 reason why some people are luckier than others. The main ones are:

  • Structured data

  • Textual data

  • Digital signal data

  • Image data

Structured data

Most data held in most databases are structured. Typically they are stored and represented as a table or more likely a series of inter-related tables. Common formats when output from a database include:

  • Comma separated values (.csv)

  • Excel files (.xlsx)

  • XML files (.xml, particularly useful if you have hierarchies in your data)

One of the problems with much of this sort of data is that its quality may be quite poor, especially if people have been allowed to enter it. (As a general rule of thumb people who enter data into databases are usually more creative than those that design the database in the first place – generally this is not good for data quality.)

Text data

There is a lot of information held in text (that is after all what it’s for) and such information is becoming more and more accessible to computers through Natural Language Processing (NLP).

Image data

This includes images and videos. In one sense it is highly structured, a picture being represented by a grid of pixels, but the image itself is highly unstructured and is difficult for machines to interpret.

Digital signal data

A digital signal data is something in which time and amplitude have discrete values.  It is obtained by sampling and quantifying a physical signal. Speech data is a special case of digital signal processing, in which the sound recorded is of 1 or more humans speaking.

Data discovery

Analysis and visualisation

Once you are lucky enough to actually get hold of some data, it will be necessary to understand what you have as it may not be exactly what it said on the tin.

Exactly how you undertake this voyage of discovery will be very dependent on the nature of the data. This could involve statistical measures, quality assessment, and perhaps visualisation techniques such as boxplots, histograms and scatter plots.

For some data it may be important to understand the nuances of data, including understanding how it was collected and the way this may bias the data. For example, image data will be biased by the nature of the sensor used in the imaging, near infra-red sensors are very good at highlighting vegetation, so these will be detected with ease whist other objects may be less prominent. Such nuances may also include what the data actually represents–it may not be quite what you think, just because the data is recording vehicles, the collector’s definition of a vehicle may be different from yours. This whole process may lead to some degree of disappointment but will also indicate work you may have to do to improve things.

Data wrangling

Having discovered just how far short your data is from your optimistic expectations, all is not lost–there are things that can be done to improve matters. Even if your data was of good quality, it may still be necessary to process it so that it is suitable for the analysis tools you have, and perhaps also combine it with other data–a process known as conflation. These pre-processing stages are often called data wrangling.

Although tools and techniques exist to help this process, it is one where each data set will need to be approached in its own way. It has often been described as a cottage industry, although artisan may be a better description–if nothing else it should give an indication of cost.

It is often said, but never attributed, that wrangling can take up 80% of the time of a data science project. Whatever the precise percentage for most projects it’s a lot. Some may be tempted to cut corners because this process is expensive, time consuming, and gets in the way of the fun bit which is to do the actual analysis.

Our advice is to take this stage seriously, it’s the only way to get good results; we’d only advise shortcuts in this stage for projects you don’t want to succeed. This stage should also not be treated independently from the Discovery phase, in many cases they can be closely coupled–discovery highlights issues, fixing the issues discovers different issues and so on.

Data wrangling methods

Data cleaning

Data cleaning deals with duplicated or missing data, outliers, or noisy data (that is, data containing random values–think of the static you can hear on a radio). Duplicated data, where the same information occurs twice or more times, will need to be removed.

A bigger problem is missing data. If too much data related to some item or measurement is missing it may be best to delete it. If not much is missing it may be possible to infer the missing values or use means or medians–these are not perfect solutions and may affect the results in some way, so care needs to be taken.

Outliers are instances in your dataset that are far from all the others. They normally occur when there is an error in the measurements. However, they can also indicate that you have a very skewed dataset (biased in some way). If the outlier is an error, you can delete the instance, however if it is a sample belonging to skewed data, deleting it will mean ignoring part of your data.

Some data such as signal data may contain noise and it is often possible to apply noise reduction techniques. Other sorts of data may have inconsistencies which again can be systematically removed. An example is a dataset which may refer to the company IBM as ‘IBM’, ‘I.B.M.’ or ‘International Business Machines’.

Data transformation and data reduction

These are methods that you may want to apply to statistical data and are used to either transform the data to ensure all the data are in a comparable form (this is known as normalising) or removing redundant or unnecessary information (reduction).

Conflation

As if the process is not complicated enough already it is often necessary to merge components of 2 or more different datasets–this process is known as conflation. To add interest, an element present in 2 datasets may have 2 very different identifiers applied to it making it necessary to correlate the 2 using attribute data. Such processes may introduce errors though incorrect correlation and missed correlation, so care must be taken.

Analytics

Having obtained, discovered, visualised and wrangled your data, the time has finally come to perform analytics. Currently this is almost the same as saying the time has come for some ML, but this is not the only method of analytics that can be performed.

Before covering ML in the next section it’s worth reminding ourselves that analytics were performed before the onset of ML.

Classic statistical analysis is still alive and well, and specialised data such as geographic information is still healthily analysed using Geographic Information Systems (GIS). Such methods have advantages over ML especially when the amount of data is limited, and so are still worth considering.