Cloud Native OGC SensorThings API: GSoC Mid Term Blog

Introduction

Hello everyone! Welcome back to the blog series on my GSoC 2024 project – Cloud Native OGC SensorThings API.

I have been working on implementing a new module (sta-dao-cloudnative) into 52° North’s sensorweb-server-sta project. This module will store and retrieve IoT sensor data in a cloud native file format using the OGC SensorThings API (STA) standard. The past 7 weeks have been quite productive as I was able to complete the tasks that I planned for.

The sta-dao-cloudnative module draws a lot of similarities from the existing sta-dao-postgres module. Essentially, it is a migration of the sta-dao-postgres module from Java Persistence API (commonly called JPA) to Java Object Oriented Querying library (popularly known as jOOQ).

dao-cloudnative class hierarchy

Progress so far

Due to the experimental nature of the project, a lot of time went into deciding the dependencies. Many of the technologies I am going to be working with are still in their early stages and have only recently released their first stable versions (DuckDB 1.0, GeoParquet 1.1). Here is a weekly breakdown of the work that kept me busy since the coding period officially started (May 27th, 2024).

Week 1

The week kicked off with the publication of my first blog post, briefly describing my project along with some necessary background information. I did some debugging to understand the code flow and gauge the scope of my planned tasks as we had not finalized on the libraries to use for the project. Unfortunately, the week wrapped up rather early for me as I got sick from all the traveling that I did in the days prior.

Week 2 and 3

I spent the majority of my time evaluating the feasibility of several libraries and frameworks for implementation of my project. Since both GeoParquet and DuckDB are in their early stages, the ecosystem is still in development that can support these technologies. I shared these findings with my mentor.

Data Streaming Platform

Apache Kafka	Apache Pulsar	AWS Firehose
High throughput and low latency streaming platform. Offers horizontal scaling	High throughput and low latency streaming and queuing platform. Offers geo-replication for optimizing performance at scale	The service is fully managed by AWS but offers some throughput/latency parameters that can be configured
Requires custom connector to be developed for converting to GeoParquet	Requires custom connector for GeoParquet	Requires custom AWS Lambda function to convert to GeoParquet
Open source, mature ecosystem and no vendor lock-in. Cost depends on the deployment model	Open source, growing ecosystem and no vendor lock-in. Cost depends on the deployment model	Tied to AWS ecosystem and support offered through AWS. Cost depends on number of topics and messaging throughput
Complex to set-up and maintain	Complex to set-up and maintain	Easy to set-up and maintain

Query Engine

Apache Sedona	DuckDB
Highly efficient and distributed open source query engine. Optimized for large scale geospatial data to support analytical queries	Not really a ‘query engine’. Native columnar storage and open source database. With indexing support, allows fast reads/writes and complex SQL querying
Read and Write support for GeoParquet from object storage. Supports read/write for Iceberg (and Havasu spatial) table format.	No read/write support for GeoParquet, but offers functionality to read Parquet files directly from object storage using S3 plugin.(Easy to convert Parquet files to GeoParquet using open source converter tools) Supports read for Iceberg table format (no write support yet).
Requires to be deployed as a standalone service	Works in process as an embedded database
Only JDBC drivers supported, no SQL builder libraries	JDBC drivers and jOOQ support available
Complex to set-up and maintain	Easy to set-up and maintain

Persistence Library

jOOQ	Custom Builder
Type safe, object oriented querying with support for a large number of databases and SQL dialects	Since JPA does not have support for DuckDB or Sedona, only a custom builder is an alternative to jOOQ
Large feature set Customizable Open source community support	A custom builder can not support diverse dialects and extensible features out of the box as it would be specific to the storage technology being used
Easy to set up and integrate	Challenging to develop a custom and robust builder within GSoC timelines

After discussing the pros and cons of each, we decided to use

AWS Firehose as the streaming platform
AWS S3 as the object store
DuckDB for transient storage during read operations
jOOQ as the querying interface with Java

Week 4

I finally started writing code this week and organized the tasks to work on. The sta-dao-cloudnative module consists of the following submodules:

service layer (dao): responsible for interfacing with the data store and performing R/W operations
query layer (conditions): responsible for generating custom SQL predicates

I planned to complete the query layer submodule before the mid-term evaluation and work on the service layer in the second half of GSoC. There were a total of 8 STA specific entity classes to be implemented and 2 other support classes to be implemented within the query layer. Each of the entity classes and their hibernate mappings had to be translated to equivalent SQL code. I set up the PersistenceConfig class, which is responsible for setting up a connection pool and configuring jOOQ, DuckDB to be able to work with spatial data. I also set up a Spring Boot profile using the CloudNativeDaoLoader class: makes it easier to load only the required modules and dependencies. In addition, I was able to implement the EntityQueryConditions class that acts as the base class for all other STA entity specific query predicate builder classes.

Week 5

During this week, I implemented the following classes:

ThingQueryConditions
HistoricalLocationQueryConditions
ObservedPropertyQueryConditions
SensorQueryConditions

Each of these classes corresponds to its equivalent STA entities and generates custom SQL predicates for filtering over these entities based upon the OData queries from incoming HTTP requests.

Week 6

I implemented the following classes in this week:

DatastreamQueryConditions
LocationQueryConditions
FeatureQueryConditions
ObservationQueryConditions
QueryConditionRepository
SpatialQueryConditions

All the classes, except the last two, directly correspond to entities as defined in the STA data model. The last two classes act as supporting classes for interfacing with other layers and to enable geospatial querying support, respectively. Additionally, I also implemented the FilterExprVisitor class from the service layer that is responsible for handling the OData filtering predicates and invoking the entity specific query condition object methods.

Week 7

I refactored the code to avoid checkstyle plugin errors and improved the code formatting. Since jOOQ requires Java 17 for DuckDB support in its open source version, Benjamin helped me resolve certain build issues related to dependency convergence. Once that was done, I committed all the code that I had been working on and published this blog post!

code commit

Closing Notes

Overall, I enjoyed my time evaluating various tech stacks and writing a lot of code. Now my next tasks to achieve are as follows:

Implement the service layer
Set up AWS Firehose, schema configuration and Lambda functions
Block and Unit Testing
Performance Testing

See y’all in my next blog post