Introduction
Hello everyone,
It’s been a long while since my last blog post, and indeed so. I am excited to share that my GSoC 2024 project of developing an open, interoperable and cloud native OGC SensorThings API has finally seen the light! It’s been an incredible journey of 20 weeks filled with ups and downs, but as the saying goes, all’s well that ends well.
Continuing from my mid term blog post, I had ‘evaluated’ the libraries, tools and frameworks to use for implementing the end-to-end cloud native persistence layer. However, as I proceeded with using some of these tools, it became apparent that I will have to change gears and look for alternative options. More on it, below.
Challenges
Perhaps, the most challenging part of the project was working on the service package, as this package acts as the interface between the external HTTP requests and internal dao and conditions packages. It was essentially the coming together of all the moving parts of the project. The following section will describe the challenges that I encountered and how I tackled those.
- Unique Id per entity: Since my architecture does not use any database software, it meant no auto-increment feature available right off the bat. So, I had to decide on a 64 bit unique identifier for every entity that was inserted into the data store. It had to be stateless, quick to generate and must guarantee uniqueness with a high probability. It ultimately boiled down to two choices:
a. UUID (alphanumeric)
b. Timestamp (numeric) I decided to use a numeric timestamp instead of UUID for the following reasons:
-
-
-
-
- It is very slow to join, scan, and filter tables using string based identifiers.
- A numeric timestamp is stateless, quick to calculate using system time, and offers up to 1 million unique values/second. Since the highest frequency of records will almost always be generated only for Observation entity, an atomic long variable seemed to be able to handle this in a fairly decent way . It also helps to naturally order the records.
-
-
-
- Mapping fetched records to POJOs: The jOOQ library offers autogenerated POJO classes, which are quite helpful for record mapping, however the unique aspect of 52°North’s STA implementation is its OGC SOS interoperability. This means that that the database schema and the OGC STA data model are not the same.
For example: unitOfMeasurement is a nested attribute of Datastream entity as per the OGC STA v1.1 data model, however as per 52°North’s schema, it is defined as a separate table of itself and must be joined with the Dataset (Datastream) table when fetching a Datastream. How sta-dao-postgres has handled this is by defining JPA entities with the relevant XML based mappings to the joined table fields. Every STA entity is parsed into its corresponding StaDTO class. It is then converted into one of these JPA entities before processing it for inserts/updates/deletes. However the autogenerated POJO classes, at best, allow fetching records via SQL’s multiset operator. Unfortunately, this is not supported by all SQL dialects.As far as using the StaDTO classes was concerned, it was not an issue. But in cases where I needed to access fields unique to 52 North’s database schema, I would require mapping the record to the autogenerated POJO class and then be able to access it.
Finally, I went ahead with using both the autogenerated POJO classes and StaDTO entity classes together in my code wherever I needed to access all the fields. This makes the code not so ‘clean’, but it was the quickest solution in the interest of GSoC timelines. Another issue with mapping fetched records to autogenerated POJO classes meant that I would have to alias fields that have the same name when joining two or more tables. This was trivial as my code delegates all SQL query creation to a common function shared across all DAO objects. - Poor Iceberg support in DuckDB: Initially when I decided to use DuckDB, it was because of its active open source community and support for extensions. It offered support for Iceberg tables and with the latest release (v1.1), it now also supports native reads/writes to GeoParquet format. With OGC SensorThings API however, we would be dealing with a large volume of writes and (Geo)Parquet in itself is not suitable as a storage format to handle writes/updates/deletes. This meant using a table format such as Apache Iceberg that could provide an abstraction on top of the .parquet files and offer transactional support for create, read, updates and deletes.Though, there is no native support for Geometry type in Iceberg yet, the open source Iceberg community is actively working to include it in the next release of the Iceberg spec (v3). Firstly, the JDBC driver for DuckDB throws an exception when reading WKB. I bypassed this by using
ST_AsText(ST_GeomFromWKB(geom_column)))
functions. However this would impact performance. DuckDB’s support for Iceberg, though much popularized, was not pleasing. It expects the table metadata to be written to a file with the name version-hint and if its iceberg scanner is unable to find it, it fails reading the table. Fortunately, there is an easy workaround for this, by fetching the table metadata via an S3 API. Nevertheless, it would certainly impact performance, as every table read operation would require at least 1 extra HTTP request.The bigger problem, however, was lack of support to read tables that have been updated/deleted via Merge-on-Read (MoR) strategy. This would mean that if a table’s records are updated or deleted, DuckDB would simply fail to scan the table and fetch its records. I created an issue at their GitHub to track this as it was surprisingly not documented already (Github #67). For all these reasons, I decided to switch to AWS Athena. It’s a managed Trino offering by AWS and does not require manual cluster management. It offers all Geospatial functions and it is serverless. Though it is not scalable due to its usage limits which restricts users to only 20 reads/second, it is a decent tool to prototype the proposed cloud native architecture. It can also seamlessly be replaced with a self hosted Trino cluster with very little changes to the code.
- Java 17: The sta-dao-cloudnative module required dependencies which mandated the Java version to be at least 11+. The entire codebase, however was still using Java 8. This meant that my code could not be compiled into a jar binary unless the entire codebase was migrated to Java 11+. I am very grateful to my mentor Benjamin Pross for chiming in and offering support to handle this complex migration lifting off a lot of my workload. He was able to upgrade the entire source code to Java 17, which not only helped me, but also served as a great contribution to the codebase itself.
Results
With all those issues being tackled, I was finally able to put together a working prototype of my project by using the following stack:
-
- AWS Firehose – streaming ingestion to Iceberg tables (create/update/delete)
- AWS S3 – BLOB storage of Iceberg tables
- AWS Athena – managed query engine to read Iceberg tables and manage table compaction
- jOOQ – incremental SQL query builder
Future work
While a lot has been achieved over the past 20 weeks, there is still a list of issues to be resolved and many knobs that can be fine tuned. I can list the ones that I feel are most important below:
- Support for MQTT protocol
- Modular code architecture which allows using a combination of query engines/cloud services to make a truly vendor neutral cloud native storage architecture
- Code refactoring to completely get rid of the existing OGC SOS interoperable schema and have a dedicated schema for STA which eliminates unnecessary join operations and optimizes performance
- Upgrade to Iceberg v3 with native support for Geo!
- Performance testing with large volumes of data (at least a few Terabytes)
- Partitioning and Clustering Iceberg tables for better query performance
- Comprehensive test suite and rigorous testing of OData $filter operations
Closing notes
I am filled with gratitude as GSoC officially comes to an end. It has been one of the things I wanted to participate in since my undergrad days and I am glad that I got an opportunity to be a part of it during my Master’s. I have learned a lot about OGC standards, Lakehouse formats, Cloud-Native development and upskilled both personally and professionally. I will take these learnings with me and continue being active in open source communities
Finally, I wouldn’t have been able to make it without the support of my mentor Benjamin Pross, who always motivated me and supported me in my pursuits, as well as Jan Speckamp for helping me understand the codebase.
Feel free to connect with me on LinkedIn
Leave a Reply