My experience with the GSoC 2017 project “Simple Features for Protobuf and others” came to an end by presenting my results at the Geospatial Sensor Webs Conference 2017. It was really pleasing to see that a majority of the audience was really interested in the project outcome and eager to see it progress in the future.
We had to deviate the project deliverables since we had to prioritize its presence in the Geospatial Sensor Webs Conference. The overall major outcomes of the project are as follows.
- API for deserialization serialization of vector and raster data
- Schemas to represent vector data by following the OGC Simple Feature Specification
- Schemas to represent raster data by following the ISO Coverage Model
- Benchmark the performance in accordance with time and storage space to identify comparison metrics.
If you haven’t been following the project thus far, please refer to the following blog posts for more technical details:
The Final Phase
The core tasks and development of the last five weeks of the project period can be summarized as follows:
- Extend Avro serialization/deserialization to support more simple features
- Avro serialization/deserialization support for raster/coverage type data
- Benchmark tests and performance improvements
I will give a more detailed description on the project’s status below. A more technical description about the changes in the project can be found on the project wiki page and its corresponding Github repository.
Extend Avro serialization/deserialization to support more simple features
This is a continuation of the task “Avro serialization support for simple access features” and I have further added serialization and deserialization support for MultiPoint, MultiLineString and MultiPolygon, Line, LinearRing, Triangle models. Unit tests have also been added with sample data.
Adding support for serialization/deserialization of JTS models using both vividsolutions and locationtech JTS libraries is one of the other sub tasks I carried out. Here, the user has the option of providing vividsolutions or LocationTech JTS object models based on his/her preference to be serialized. When deserializing, the user has the option of defining whether he/she needs the vividsolutions or LocationTech JTS object model.
Avro serialization/deserialization support for raster/coverage type data
GeoTIFF is an open file format based on the TIFF format. It is used as an interchange format for georeferenced raster data. We need some meta information to transform the georeferencing into real-world coordinates. This data is stored in either a Word file or in the header of the image file itself. Here I have implemented serializing coverage data for both options.
Figure 1 – Component view of Avro serialization of Raster/Coverage type data
By using serialized data, grid rows and columns will be transformed into real world coordinates on-the-fly when deserializing. One of the future tasks will be to provide the option of deserializing them based on a preferred Coordinate Reference System.
Benchmark tests and performance improvements
I have carried out a series of benchmark tests to evaluate the efficacy of Protobuf and Avro serialization/deserialization against existing geospatial data exchange formats such as GeoJson and GML. I have used the shapefile data set of Belgium, The Netherlands and Spain given here and used GeoTools GML parser and GeoJson plugin to generate respective GML and GeoJson files.
The hardware configuration used to execute the benchmark tests is as follows:
- Processor – Intel Core i5
- Memory – 8 GB
- Operating System – Linux Mint 18 64-bit.
I carried out benchmark tests to capture serialization/deserialization time and output storage space. The following graphs shows the outcome.
Figure 2 – Comparing file sizes of different techniques to represent same vector data set
As you can observe, the Avro and Protobuf file sizes are much smaller as compared to GML and GeoJson, but slightly larger than a Shapefile size. The reason is that Protobuf and Avro don’t do efficient compression internally. They are more like efficient serialization techniques. But libraries, such as Google snappy can be used to further compress it.
Figure 3 – Comparing serialize/deserialize time of Protobuf and Avro vs different vector data sets
For this comparison, I had to exclude the GML processing time as it takes 26193 ms to process 2.9 MB sample data set, thus is not in the range to be compared with other techniques used. This graph shows the serialization/deserialization time taken to generate GeoJson is much more higher than Avro and Protobuf and the average time taken by Protobuf to both serialize and deserialize is slightly less than Avro.
So followings are my conclusions from the observation of the benchmark results:
- Avro and Protobuf data processing time is minimum and further performance can be achieved by using streaming techniques.
- Average serialize/deserialize time of Protobuf is less compared to Avro, so it’s useful in data exchange between servers.
- Space used by Avro is less and Avro data files are splittable and block compressible, so it’s useful in archiving spatial data. Avro is used in Apache Hadoop for this very use case as a data storing option.
- All these frameworks have built in RPC support but may not be a complete replacement for JSON or XML, especially for services which are directly consumed by a web browser because these are in binary format. This will be highly useful in data streaming and real time processing use cases and can be directly integrated into data analytics and streaming frameworks like Apache Hadoop, Apache Kafka, Apache Spark etc.
Future tasks
The following are tentative future tasks and improvements:
- Define schemas for other serialization frameworks such as MessagePack, Flat Buffers,Thrift etc .
- Profile some other performance metrics such as network latency, memory and CPU usage etc .
- Utilize these schemes and implement serialization API to serialize Geo Spatial Data to prefered destination formats (i.e. JTS or Coverage Model)
- Integration of serialization API to 52°North core components.
Some final words
It was a great experience in both technical and non-technical aspects that I gained from this project. I would be happy to follow and see how it progress in the future. Also I am sure that this will be a great addition to geospatial technology space.
I am grateful to my two mentors, Arne de Wall and Dr. Christoph Stasch, for the valuable guidance given throughout the project and Ann Hitchcock for all the help.
Leave a Reply