So as you may or may not have been following my blog posts or status reports recently, I’m currently a Google Summer of Code 2012 student for 52°North. My project primarily deals with the problem of providing bindings for OpenStreetMap data such that they may be used within the Web Processing Service (WPS). Yet here I am talking about xmlcodegen, a component that is primarily part of Geotools… why?
It turns out that the WPS relies a lot on Geotools to for handling various map data types. This makes sense, as Geotools itself is intended to provide an easy way to interact with various geospatial data formats. By making use of Geotools, WPS can tap into its features and capabilities, and thus support interactions and manipulate all these data formats without necessarily having to reinvent the wheel for each file format. The same goes for any other projects or packages that make use of Geotools.
As such, when the task came to implement bindings in WPS for OpenStreetMap data, I naturally looked into the capabilities of Geotools to see if such functionality was already provided. It turned out that it was not, however mechanisms existed to easily implement such functionality; this was exacerbated by the fact that OpenStreetMap data was primarily in XML format.
This blog post is mostly dedicated to my experiences in generating bindings for Geotools from XML-based file formats. It is meant to supplement Geotools’ articles in terms of some of the problems I encountered, and the solutions I found for them. I also will give a brief overview as to the progress of the project thus far.
Disclaimer: I don’t in any way claim to know all there is to Geotools’ xmlcodegen process, nor in any way act as the authoritative answer for solutions to the issues I address.
What is XML? Why is knowing this important in the context of OpenStreetMap?
XML is a markup language. In the context of OpenStreetMap data, it is primarily used as means to transport and convey map data in a meaningful format to both humans and machines. As such, it tends to be considerably more verbose than other file formats (consider a 303GB uncompressed OSM XML planet file, compared to its 21GB OSM PBF relative), however this verbosity is mitigated by the fact that several mature libraries and tools already exist to handle this file format. Consider for example, all of the ways in Java to interact with XML:
…just to name a few. Some of these, like JAXB, will automatically marshall or serialize your Java objects into XML, and unmarshall the XML representation back into Java objects again. This is incredibly useful if you decide to that you want to store your data structures in XML for portability across libraries and implementations. The converse is also true, if you want the XML format to be converted to meaningful objects and structures internally in your library or program.
So how does this relate to Geotools?
It turns out that the above approach is exactly what Geotools does with all of its XML-based map format representations, such as GML, KML, etc. Geotools takes these XML files, automatically unmarshalls them into Java objects using JAXB, then does some other fancy magic using its xmlcodegen package to convert the internal representation to something that is standardized across the toolset. This internal representation in Geotools (which coincidentally is also used by WPS) is a FeatureCollection object. A FeatureCollection is essentially a collection of features reflective of maps and geospatial data.
Therefore the proper way to handle XML-based OpenStreetMap data is to:
- Get JAXB to unmarshall it into corresponding objects
- Provide the necessary bindings as an extension to Geotools; this is done using Geotools’ xmlcodegen package
- Add extra code in the unmarshalling process to transform the data into a FeatureCollection object
- Integrate the extension into WPS so the new data is now directly usable “out-of-the-box”.
This is the approach I take in processing OSM data for WPS; essentially create a Geotools extension to do all the dirty work in actual parsing.
What does this mean? The benefit of this approach is that, hypothetically you can take any XML-based map data, and provided you have the schema definition for it, you can generate your own custom Geotools bindings. It means that it is now much easier for other communities (such as the Geoprocessing community of 52°North) to handle new formats in their WPS package, thus contributing to better support and faster deployment.
To generate Java bindings for our XML file, we require a schema definition of some sort that tells the JAXB generator what fields it should expect of all incoming XML files. As the schema for OSM data is not too well defined, I used a combination of schema inference tools (such as xmlbeans’ inst2xsd), reading documentation of the format, and manual editing to generate my own schema definition for OSM data. To make life easier, I’ve created a maven pom.xml that simplifies the process slightly, reducing the process to 3 commands:
mvn jaxb2:generate
mvn install
mvn org.geotools.maven:xmlcodegen:generate
However when it comes to actual class generation, this can generate bafflingly cryptic error messages that take some time to resolve.
Here I document some of the issues that held me back initially and the solutions I found for them with the purpose of aiding the next person who may encounter these issues.
- Configuration of Parser fails due to NullPointerException
Generally this results from the original schema file not being found. By default, the Configuration class will search for the schema file within one of several predefined resource locations. In Eclipse/Maven projects, this means the schema has to be located in src/main/resources/<package name>. - Unresolved dependency errors
Like the name suggests, this results from the inability of the system to resolve dependencies. Usually this means that the system was able to find the initial schema file, but not any of the files it depends on. One solution to this is to just make one schema file; the alternative is to ensure that all the dependencies are in one of the predefined resource locations as mentioned above. - Parser returns a HashMap instead of your appropriate type
This usually results from the schema namespace being configured incorrectly. Ensuring you have the xmlns field configured the same as in the JAXB-generated bindings will fix this issue.
What about the rest of the project?
As to the rest of the project: I have most of the pre-processing component set up, but not at all in a polished package. I was a little bit late in terms of meeting my originally intended milestone, however this was mostly due to schema problems that I had to resolve. Work has started on the UI component/stage of the project, but regrettably I do not have any interesting screenshots to show at this stage of the system working in action.
Additionally, after some talk with my mentor, we have decided the best format in which to package the project overall is to deploy it as an additional backend supported by WPS (like that of the GRASS, Sextante, etc. backends). This provides support for interaction with the Overpass API (a more interactive means of getting OSM data), in addition to providing the capability for additional tasks to be performed that is not currently supported by Overpass. Some of these tasks may include simple feature extraction, such as extracting certain buildings of a particular type, extracting cycle roads, etc. The Overpass API already has some built-in task capability, but most of those capabilities are simple queries/commands; more powerful functionality can be created by chaining the simple tasks into more complex ones.
Martin says
Dear all,
I read your article and fint it very usefull in our project. This article is written very beautiful.
If it is not too much to ask I would like to get this file from you.
https://svn.52north.org/svn/geoprocessing/incubator/osmtransform/trunk/xsd/pom.xml
I don’t have login .
Your help will be much appreciated.
Best regards
Martin
Eike Hinderk Juerrens says
This issue is already solved via e-mail.
Damiano says
Hi, I’m also diving into the subject of processing OSM XML data with Java.
Do you have an update on this blog post?
Could I get access to the files in Subversion?
Thanks!