Introduction
Hello everyone! Welcome back to the blog series on my GSoC 2024 project – Cloud Native OGC SensorThings API.
I have been working on implementing a new module (sta-dao-cloudnative) into 52° North’s sensorweb-server-sta project. This module will store and retrieve IoT sensor data in a cloud native file format using the OGC SensorThings API (STA) standard. The past 7 weeks have been quite productive as I was able to complete the tasks that I planned for.
The sta-dao-cloudnative module draws a lot of similarities from the existing sta-dao-postgres module. Essentially, it is a migration of the sta-dao-postgres module from Java Persistence API (commonly called JPA) to Java Object Oriented Querying library (popularly known as jOOQ).
Progress so far
Due to the experimental nature of the project, a lot of time went into deciding the dependencies. Many of the technologies I am going to be working with are still in their early stages and have only recently released their first stable versions (DuckDB 1.0, GeoParquet 1.1). Here is a weekly breakdown of the work that kept me busy since the coding period officially started (May 27th, 2024).
Week 1
The week kicked off with the publication of my first blog post, briefly describing my project along with some necessary background information. I did some debugging to understand the code flow and gauge the scope of my planned tasks as we had not finalized on the libraries to use for the project. Unfortunately, the week wrapped up rather early for me as I got sick from all the traveling that I did in the days prior.
Week 2 and 3
I spent the majority of my time evaluating the feasibility of several libraries and frameworks for implementation of my project. Since both GeoParquet and DuckDB are in their early stages, the ecosystem is still in development that can support these technologies. I shared these findings with my mentor.
Data Streaming Platform
Apache Kafka | Apache Pulsar | AWS Firehose |
---|---|---|
High throughput and low latency streaming platform. Offers horizontal scaling | High throughput and low latency streaming and queuing platform. Offers geo-replication for optimizing performance at scale | The service is fully managed by AWS but offers some throughput/latency parameters that can be configured |
Requires custom connector to be developed for converting to GeoParquet | Requires custom connector for GeoParquet | Requires custom AWS Lambda function to convert to GeoParquet |
Open source, mature ecosystem and no vendor lock-in. Cost depends on the deployment model | Open source, growing ecosystem and no vendor lock-in. Cost depends on the deployment model | Tied to AWS ecosystem and support offered through AWS. Cost depends on number of topics and messaging throughput |
Complex to set-up and maintain | Complex to set-up and maintain | Easy to set-up and maintain |
Query Engine
Apache Sedona | DuckDB |
---|---|
Highly efficient and distributed open source query engine.
Optimized for large scale geospatial data to support analytical queries |
Not really a ‘query engine’. Native columnar storage and open source database.
With indexing support, allows fast reads/writes and complex SQL querying |
Read and Write support for GeoParquet from object storage. Supports read/write for Iceberg (and Havasu spatial) table format. |
No read/write support for GeoParquet, but offers functionality to read Parquet files directly from object storage using S3 plugin.(Easy to convert Parquet files to GeoParquet using open source converter tools) Supports read for Iceberg table format (no write support yet). |
Requires to be deployed as a standalone service | Works in process as an embedded database |
Only JDBC drivers supported, no SQL builder libraries | JDBC drivers and jOOQ support available |
Complex to set-up and maintain | Easy to set-up and maintain |
Persistence Library
jOOQ | Custom Builder |
---|---|
Type safe, object oriented querying with support for a large number of databases and SQL dialects | Since JPA does not have support for DuckDB or Sedona, only a custom builder is an alternative to jOOQ |
Large feature set Customizable Open source community support |
A custom builder can not support diverse dialects and extensible features out of the box as it would be specific to the storage technology being used |
Easy to set up and integrate | Challenging to develop a custom and robust builder within GSoC timelines |
After discussing the pros and cons of each, we decided to use
- AWS Firehose as the streaming platform
- AWS S3 as the object store
- DuckDB for transient storage during read operations
- jOOQ as the querying interface with Java
Week 4
I finally started writing code this week and organized the tasks to work on. The sta-dao-cloudnative module consists of the following submodules:
- service layer (dao): responsible for interfacing with the data store and performing R/W operations
- query layer (conditions): responsible for generating custom SQL predicates
I planned to complete the query layer submodule before the mid-term evaluation and work on the service layer in the second half of GSoC. There were a total of 8 STA specific entity classes to be implemented and 2 other support classes to be implemented within the query layer. Each of the entity classes and their hibernate mappings had to be translated to equivalent SQL code. I set up the PersistenceConfig class, which is responsible for setting up a connection pool and configuring jOOQ, DuckDB to be able to work with spatial data. I also set up a Spring Boot profile using the CloudNativeDaoLoader class: makes it easier to load only the required modules and dependencies. In addition, I was able to implement the EntityQueryConditions class that acts as the base class for all other STA entity specific query predicate builder classes.
Week 5
During this week, I implemented the following classes:
- ThingQueryConditions
- HistoricalLocationQueryConditions
- ObservedPropertyQueryConditions
- SensorQueryConditions
Each of these classes corresponds to its equivalent STA entities and generates custom SQL predicates for filtering over these entities based upon the OData queries from incoming HTTP requests.
Week 6
I implemented the following classes in this week:
- DatastreamQueryConditions
- LocationQueryConditions
- FeatureQueryConditions
- ObservationQueryConditions
- QueryConditionRepository
- SpatialQueryConditions
All the classes, except the last two, directly correspond to entities as defined in the STA data model. The last two classes act as supporting classes for interfacing with other layers and to enable geospatial querying support, respectively. Additionally, I also implemented the FilterExprVisitor class from the service layer that is responsible for handling the OData filtering predicates and invoking the entity specific query condition object methods.
Week 7
I refactored the code to avoid checkstyle plugin errors and improved the code formatting. Since jOOQ requires Java 17 for DuckDB support in its open source version, Benjamin helped me resolve certain build issues related to dependency convergence. Once that was done, I committed all the code that I had been working on and published this blog post!🙂
Closing Notes
Overall, I enjoyed my time evaluating various tech stacks and writing a lot of code. Now my next tasks to achieve are as follows:
- Implement the service layer
- Set up AWS Firehose, schema configuration and Lambda functions
- Block and Unit Testing
- Performance Testing
See y’all in my next blog post 🚀
Leave a Reply