Posted on 29 Sep 23 by Kiernan McColl
Find out how KZN Group has helped PLUS revolutionise geospatial data management with the creation of a cutting-edge geospatial data lake, redefining how they harness location-based information.
Planning and Land Use Services (PLUS) is a division within the South Australian Government’s Department for Trade and Investment. They are custodians of a vast array of datasets which are relied upon by a wide range of applications, government agencies and external stakeholders.
One of PLUS’ crucial responsibilities is the ongoing surveillance of flood plains, assessing the risk of flooding for locations throughout the state. The resulting survey data is vital to support planning of new developments and future land use.
However, managing and integrating this complex survey data, often in a variety of proprietary formats presents a formidable challenge. Multiple contractors conduct these surveys, producing data in various formats and structures. Data easily becomes duplicated across applications and storage repositories spread across on-premises and cloud environments, which can make it difficult to consistently identify a reliable source for critical data.
Planning and Land Use Services recognised the need to construct a Geospatial Data Lake, serving as the definitive source for geographic data. This data platform aimed to provide:
PLUS engaged with KZN and AWS to design and implement their Geospatial Data Lake. This engagement involved:
Analysis and Discovery: We initiated by closely collaborating with subject matter experts to analyse their existing manual data preparation workflows thoroughly. This involved identifying each data source and understanding the transformation steps required to achieve the desired outcomes.
Geospatial Translation Proof-of-Concept: To validate our approach, we implement a small-scale PoC covering the solution’s most critical or risky parts. This established the feasibility of our solution by demonstrating we could package the GDAL geospatial library to be run within the serverless data integration environment provided by AWS Glue. We leveraged GDAL’s extensive suite of drivers, testing the translation of sample data from a variety of formats including ESRI’s FileGDB. We also used GDAL’s virtual file system capabilities to directly query data residing within compressed files that are stored in Amazon S3, and ensured we understood any trade-offs and limitations that might constrain how we should plan to use it.
Data Ingestion Pipeline: With our basic solution now validated, we established a data ingestion pipeline designed for reliability and scalability. It is data-driven, through file specification and pipeline control data stored within Amazon DynamoDB, which enables new datasets to be on-boarded by inserting a new database record. The pipeline’s main functions were decoupled through the use of AWS Step Functions, which would allow extensibility for future needs. We demonstrated this by adding optional processing steps which would validate and catalog the data quality of geospatial data files (e.g. GeoJSON), using AWS Glue and the AWS SDK for Pandas (awswrangler) to ensure we could handle very large file sizes that wouldn’t be suitable to process via AWS Lambda functions. Throughout this phase, we continually tested the pipeline using manually acquired sample data to ensure we wouldn’t encounter any data integration surprises that might impact our project’s delivery schedule.
Dataset Onboarding: We next onboarded those datasets central to our initial use cases, while streamlining the process to reduce the effort required to bring on more data in future. This involved creating data acquisition jobs to reliably export or fetch data from file-based repositories or PostGIS databases and configuring file processing records for them within the data ingestion pipeline. These acquisition jobs were delivered including a library of re-usable AWS CDK Constructs, so they can be repeatedly re-used with minimal effort, and configured to support the necessary scheduling, partitioning or naming requirements of future datasets.
Data Curation and Processing Workflows: With the necessary raw data now available in the Geospatial Data Lake our final core task was to set up data curation and processing workflows. These transformed raw and proprietary data formats into curated, standardised, and open formats, all while maintaining data quality. Once data was available in open formats, such as parquet, we were able to query it directly from S3 via Amazon Athena. That gives us immediate feedback while we iteratively develop the geospatial SQL queries necessary for each of our specialised use-cases. Adapting the final queries into AWS Glue jobs to perform regular processing then required minimal effort, especially once we had established a re-usable AWS CDK Construct library, codifying templated patterns for defining new workflows.
CI/CD Pipeline Implementation: To ensure consistency and enable efficient delivery of changes to infrastructure and related software we defined components via infrastructure-as-code (IaC) and created CI/CD pipelines to automate the deployment of new and updated data processing workflows. We worked closely with the PLUS team to ensure this process was secure, consistent, and repeatable, in-line with their existing practices for managing cloud-based infrastructure.
By leveraging AWS data analytics services and incorporating best practices in data engineering to build-in quality, we were able to create a scalable, efficient, and reliable Data Lake that would support the current and future geospatial requirements of the Planning and Land Use Services Division.
The project yielded a multitude of outcomes, each unlocking specific benefits for the business:
Automation and Workflow Reliability: Enhancing the reliability, efficiency, and reproducibility of workflows yields greater confidence in the accuracy and completeness of data, driving further adoption of these datasets.
Data Standardisation and Accessibility: Storing data in standardised, open formats has facilitated ad-hoc exploration and processing, unlocking the expanded use of data that was previously challenging to work with.
Efficient Data Quality Assurance: Early feedback on data quality issues results in faster resolution, ensuring the integrity of data and limiting impact to critical downstream processes.
Centralised Data Repository: Ensuring data integrity, accessibility, and availability provides an unambiguous single source of truth for curated data and services.
Data Compliance and Security: Alignment with organisational standards, statutory and regulatory obligations ensures the authoritative status of DIT & PLUS data is upheld, maintaining the trust necessary support regulatory processes.
Query Performance Optimisation: Using data structures optimised for query performance is saving analysts valuable time and reduced the cost of data processing.
Efficient Data On-boarding: Expediting onboarding of new data sources using a library of reusable and highly configurable AWS CDK constructs continues to simplify the process.
Accelerated Workflow Development: The rapid creation of transformation jobs using reusable and templated patterns along with deploying data processing workflows via CI/CD pipelines ensures the consistency and reproducibility necessary to ensure the integrity of data and reduces the overall development time.
Data Security and Access Control: Simplifying user access provisioning and de-provisioning through SSO and enabling row and column based access control reduces the friction to enabling the right people to get access to the right data in a timely fashion.
Audit Trails for Data Transparency: Ensuring data access transparency and security with tamper-proof audit trails to ensure the integrity of the data and supports root cause analysis when troubleshooting issues.
With the successful implementation of the Geospatial Data Lake, KZN is strategically positioned to embark on future projects that encompass intricate data management, automation, and cloud integration. Our experience with geospatial solutions, AWS and cloud-native, and data quality assurance can be leveraged to address the unique data needs of organisations facing similar challenges. Contact us to explore how we can contribute to your data-driven success. Learn more about how KZN can assist you with your Data and Analytics ambitions here.