Data data everywhere – a need to stop and think!

The raw material for our work in the Digital Environment programme is of course DATA and, as ever, we need improved means of sourcing and accessing appropriate datasets to support our modelling activities. NERC of course is not short of data!! There are a series of data centres associated with each of the environments, for the exact details of these see https://nerc.ukri.org/research/sites/data/

There are a number of challenges for data:

Challenge one – ‘Discovery’ – What data is out there – where is it?

Where are those collections, archives and catalogues – there are a number of cross-disciplinary challenges in drawing the data resources we require together

Searching across resources – we need to find data that can be placed across a number of originating sources, and thus a single search needs to identify the data required. The use of Federated searches can aid this process (z39.50 is a common standard used to support this).

Metadata is and will remain the key enabler. There are many metadata standards that can be used – at summary and detailed levels: for example, DCAT-2; ISO19115; Dublin Core; UK GEMINI; CSDG. Many readers will be familiar with data.gov.uk service and portal, and dataset descriptions can be placed into this resource to aid their discovery.

A strength of metadata is the way it enables a description to be shared of the data and the data services that are available, the form of the data resource, any assumptions embedded in data, any issues with confidence levels and uncertainty in the data and its progeny and prior usage.

To address this first challenge then, there should be available in an ideal world co-ordinated means and mechanisms to discover data resources to support our environmental science activities. This is something that has emerged clearly from the recent COVID-19 Hackathon series we have run.

Challenge two – ‘Access’ – Which data forms should I use

We use data to hold representations of the world around us. There are of course many forms that data can take to represent ‘attributes’ of interest. The trick is in knowing which kind of data representation best suits the phenomena we seek to address. We can think in terms of Quantitative data (types of quantitative data include Continuous and Discrete Data), and in terms of Qualitative data (type of qualitative data include Binomial Data (binary/yes no), Nominal (unordered) Data, and Ordinal (ordered) Data for example).

These days, data can be accessed in many ways – from a straight file download, to a link to an online web-based data service. Web Service standards that can be used for this include WMS (for maps); WFS (for geometric features); WCS (for coverages/TIN/DTM); and WMTS (for tiled data). Other forms of environmental data may originate from integrated suites of sensors – the Internet of Things (IoT), and here standards such as MQTT can be used to provide access to these resources. REST (Representational State Transfer) is the most common contemporary means to access URI-based data resources and ‘end points’.

In terms of deciding which data to use, we can consider other issues such as the scale of representation – and whether data can be scaled up (aggregation) or scaled down (disaggregation). Data is also very likely to require some form of transformation, processing, to enable its onward use in modelling applications. Examples of this in a given workflow might include conversion; classification; normalisation; and geometric operations for example.

To address this second challenge then, there should ideally be a widespread uptake and adherence to data and data access standards.

Challenge Three – ‘Interoperability’ this being how we use and interact the data

In practice, Interoperability means that one can take a representation of one environmental phenomena, let’s say ‘soil characteristics’ from one provider, and swap it out in a model for data from a different provider. If both data sources follow the same data specification, then the standardisation of the data form and format will permit the interchange. The INSPIRE Directive was all about this important aspect of data usage. Particular data ‘schemas’ exist for domains of knowledge, and these schema can be used to represent data from many potential sources. Examples of such schemas are SoterML for soils and terrain; GeoSciML for the geosciences, and CityGML for representing the built environment – there are many more! Schemas form a part of an ‘ontology’ – where we create models of shared understanding about the phenomena we seen to represent (in time). There is a lot of interest in the development of ‘semantic ontologies’ as being libraries of how best to represent and characterise such phenomena.

Allied to this is the norm for drawing together different datasets in a consistent manner to achieve an expressed aim. Linked data ontologies help us understand meaningful combinations of data themes.

A lot of the challenge of interoperability is in consistency and standards. Sometimes, wording can be used that can be contradictory and confusing. Therefore vocabularies of  reserved descriptive words can be developed, for example Gemet.

When we seek data for use, say as a ‘snap-in service’ in a model, it can be that we wish to link to data sources in the cloud – on the Internet. In this case we need to consider means by which the data is transported – and with what standards and formats being used. Raster (bitmap) and Vector (geometry) spatial data, and aspatial attibute data can be held in many, many formats. Respectively for example TIFF, JPG, PNG; and the Shapefile / SHP, JSON and the spatial variant GeoJSON, GeoPackage; and CSV etc.. The number of formats are numerous!

Research data as a service (RDAAS) is an emergent concept of great interest whereby pre-packaged, standardised forms of data can be made available online as a data service for use (consumption), say in a model.

Web services are not just about data either – there can be times when we want to use data processing tools in the cloud – where we upload some datasets and want to receive back a processed result. Standards are emerging to support this – for example WPS – Web Processing Services for geospatial processing; and WCPS – Web Coverage Processing Service for coverage data.

All of these interoperability issues need to be considered when building applications – from simulation models to GIS applications, to augmented and virtual reality models by example.

To address this third challenge then, we should ideally have considerable streamlining of data themes so as to take advantage of the tremendous future research opportunities in data science and AI (e.g. specifically machine learning, deep learning, machine vision, soundscape ecology, data visualisation, AR/VR).

Epilogue

We need to advance the field of Environmental Informatics towards a digitally-enabled environment, one that is able to develop and take full advantage of emergent technological themes around all the different sources of data of interest – static data sources, integrated sensor networks, with embedded data conditioning, data transport, data management, analysis and visualisation for assessing, monitoring and forecasting the state of the natural and built environment – at higher spatial resolutions and finer temporal scales than have been previously possible.

A challenge for NERC is to develop the Environmental Data Service to be able to offer a coordinated, coherent and accessible foundation for building the exciting data science models of tomorrow – for example Environmental Digital Twins able to support scientific enquiry and speculation/scenarios.