Linkage and processing of address data#

Last modified: 05 Jul 2024

Overview#

UK LLC links to geospatial measures using the same Trusted Third Party, Digital Health and Care Wales (DHCW), used for health data linkage. DHCW sends participants’ address information, where permissions allow, to University of Leicester (UoL), who have been commissioned by UK LLC to model environmental exposure estimates. Before data are sent to UoL, UK LLC prepares a batch of ‘masking’ addresses. See figure 1 for an overview of the dataflow.

../../_images/geo_basic_data_flow.jpg

Figure 1 An overview of the flow of LPS participants’ geo/environmental data into the UK LLC TRE

Masking addresses#

The masking addresses are true addresses, but they do not necessarily belong to Longitudinal Population Study (LPS) participants. They are generated by UK LLC at a ratio of 3:1 (masking:real) to minimise the disclosure risk associated with location based information and appended to the real addresses at DHCW before they are supplied to UoL. The masking addresses are proportionally selected (at random) from Ordnance Survey (OS) AddressBase Plus based on key attributes about the LPS that have participants with permission to link. The key attributes include number of participants, age of cohort and spatial buffers aligned with their catchment areas. This allows the masking addresses to more accurately represent the UK LLC population as a whole. Therefore, LPS with geographically constrainted populations e.g. EXCEED (Leicester) drive the generation of more masking addresses around the Leicester area.

Geocoding permissions#

There are currently eight LPS that allow participants to be linked to geospatial measures generated by UoL. These permissions are configurable to allow linkage at the household level using Unique Property Reference Number (UPRN) or to the postcode level. Where postcode level is selected, the nearest household (UPRN) to the centre of the postcode centroid is geocoded. The locations for these participants are therefore not as precise as those that are geocoded to the household level. When geocoding the household (UPRN), the central property coordinate is used.

What is geocoding?#

Geocoding is the assigning of geographical coordinates to a location. The following address data are provided by LPS:

  • Address line 1 (Premise level)

  • Address line 2 (Street name)

  • Address line 3 (Locality name)

  • Address line 4 (Town)

  • Address line 5 (Administrative area)

  • Postcode

These data are then matched using a database lookup to convert the physical address into geographical coordinates, where permissions allow full address to flow. Where permissions are set to postcode only, only the postcode is used in the geocoding process.

Geocoding using Experian#

Overview#

Addresses are verified and geocoded to one metre accuracy using Experian QAS Batch API software programme (formally QAS QuickAddress Batch API Software). In summary, the QAS Batch API software geocodes address records by verifying them against the official postal addresses using OS AddressBase Premium. Cleaned records are then assigned a match result based on the accuracy of the original address. The Experian QAS geocoding process follows five main stages: External pre-processing; Match Country; Match street, PO box or organisation; Match Premises; and Select best match (see Figure 2) (Experian, 2019).

../../_images/experian_process.jpg

Figure 2 The QAS Batch API process

Unmatched addresses#

If no match is achieved, the output address is returned and a ‘partial address found’ match code is assigned to the address (see Figure 3). If an address has been fully verified at premises level, it is assigned a ‘quality score’ depending on whether the address was partially matched or has multiple matches (e.g. multiple addresses identified with High Street). Lastly, a match confidence level (0 - low, 5 - intermediate, 9 - high) is allocated to each address depending on how confident the QAS Batch API is about the match it has returned. A low confidence indicates that essential matching rules were not satisfied, while intermediate confidence shows that the less important rules were not satisfied or another check failed, i.e. input address is not in the expected order (Experian, 2019). For documenation on how to interpret the Experian matchcode see the experian documentation

Once any interactive cleaning has been made to the returned addresses, the full input address record and filtered address record are exported for further post-processing checks, according to the following match success rating: ‘Verified and good full matches’, ‘Verified and good premise matches’ and ‘Tentative and poor full matches’.

../../_images/experian_match_codes.jpg

Figure 3 Returned match code indicators (Experian, 2019)

Post processing#

Post-processing checks are undertaken to ensure that the output addresses are correctly matched and returned with the relevant grid information. First, the ‘full’ returned address data are imported into ArcGIS Pro 3.0 to convert the file into SpatialPointDataFrame. This process removes any addresses with no returned coordinates. The spatial address file is then intersected with a UK Census Geography file to add relevant Output Area (OA) and Lower Layer Super Output Area (LSOA) level information.

Linkage of address geocodes to environmental exposure datasets#

Once geocoding has been completed, UoL stores the geocoded information to be linked to geo/environmental data, e.g. air and noise pollution, greenness and greenspace. These datasets are currently being built/modelled and will be documented as they become available.