Development of an LLM-based Assistant for Searching for Geodata

Searching for relevant data is a fundamental task in geospatial and research data infrastructures (SDI/RDI) that demands precise and accurate results. Traditional search systems, which rely primarily on metadata, have notable limitations. These include the need for specialized knowledge and the inability to comprehend the contextual meaning of user queries. As a result, non-expert users often struggle to efficiently find the most relevant datasets. Furthermore, traditional metadata-based searches typically rely on exact keyword matching, which fail to capture the full semantic intent behind a query.

To address these limitations, we will explore a novel approach to geospatial data search using a Large Language Model (LLM) based framework in this blog post.

Challenges of Traditional Search Systems

The traditional search systems, such as those implemented in metadata catalogue services offer a full-text search to identify relevant metadata records, which can be supplemented by search facets and filters [1]. The implemented full-text search is based on a lexical approach that relies on the exact match of words or phrases between search queries and metadata records [2].

Key Limitations of Lexical Search

Searchers must know the specific terminology of the metadata and use the correct search terms to obtain relevant results. This specialist knowledge cannot be taken for granted, especially in interdisciplinary portals.
Linguistic variations in the search queries – such as typing errors, abbreviations, synonyms or acronyms – can significantly impair the quality of the results (cf. “vocabulary mismatch problem” [3]).

Existing Solutions and Their Limitations

Various solutions exist to overcome these challenges. For example, integrating controlled vocabularies or ontologies makes it possible to suggest related search terms such as synonyms as part of the full-text search [4]. In addition, current developments in the field of neural networks and language models open new possibilities for capturing the semantics of search queries and results and determining their relevance more precisely [5]. Corresponding applications in the context of (geo)metadata search have already been developed [6].

However, even the successful identification of relevant metadata records does not guarantee that the associated data records meet the requirements of the searcher. Possible reasons for this are

Insufficient representation of the dataset content and properties in the metadata
Incomplete or inaccurate metadata
Specific user requirements that are not captured in the available metadata

An LLM-based Framework as an Alternative Approach

To address the remaining challenges, a Large Language Model (LLM)-based framework was developed that implements an innovative, dialog-oriented search approach. The system is deliberately designed for the integration of various open source language models. In addition to proprietary models, free LLMs such as Llama or Mixtral can also be integrated.

How the Framework Works

The system combines a chatbot for interaction in natural language with a semantic search index for geodata and its metadata. The framework is encapsulated within an easily configurable web server that exposes an OpenAPI-based REST API. This API provides multiple endpoints for chatbot inferences and for managing connected (meta-) data resources. See Figure 1 for a high-level architecture overview.

Figure 1: High level overview of the proposed LLM framework

Unlike conventional search systems, this approach makes it possible to precisely record user requirements through interactive dialog. The chatbot can ask specific questions to clarify search intentions and concretize specific requirements. Figure 2 shows a demo client implementing the proposed LLM framework. A sample search for “buildings in Dresden” demonstrates the LLM’s ability to handle natural language queries, as well as the possibility of using the LLM to interpret the retrieved results.

Figure 2: Chatbot client for testing the search assistant.

By integrating a special search index optimized for semantic indexing (vector database), both the actual geodata and its metadata can be stored and searched. This enables a semantic search for the information contained in the data.

A Concrete Example

Consider a search for “hospitals with an emergency room”. Traditional metadata-based searches would retrieve only datasets explicitly containing these keywords in their metadata. In contrast, the proposed LLM-based system can also use relevant data attributes within the geospatial datasets, such as

“emergency=yes”
“healthcare=hospital”

Additionally, the chatbot can ask specifically for further requirements, such as whether certain specialist departments are needed. This information is then compared with the attributes available in the data. This makes it possible to recognize implicit or complex correlations between search queries and available data records and thus minimize the discrepancy between user requirements and data availability.

Conclusion

Our LLM-based framework offers a novel approach to geospatial data search that overcomes the limitations of traditional search systems. By combining a chatbot with a semantic search index, we can provide precise and relevant results that meet the requirements of searchers. We believe that this approach has the potential to revolutionize the way we search for geospatial data and make it more accessible to a wider range of users.

Outlook

The concept and technology behind the framework will be presented at the FOSSGIS conference on March 26th , 2025 in Münster (conference details).

Simeon conducted this work as part of the 52°North Innovation Prize Challenge. Contact him at:

Simeon Wetzel
Technische Universität Dresden
Chair of Geoinformatics
Helmholtzstraße 10, 01069 Dresden
simeon.wetzel@tu-dresden.de

References

[2] Formal, T., Piwowarski, B., & Clinchant, S. (2022). Match Your Words! A Study of Lexical Matching in Neural Information Retrieval. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 13186 LNCS, 120–127. https://doi.org/10.1007/978-3-030-99739-7_14

[3] Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1987). The Vocabulary Problem in Human-System Communication. Communications of the ACM, 30(11), 964–971. https://doi.org/10.1145/32206.32212

[1] Hervey, T., Lafia, S., & Kuhn, W. (2020). Search Facets and Ranking in Geospatial Dataset Search. Leibniz International Proceedings in Informatics, LIPIcs, 177(5), 1–5. https://doi.org/10.4230/LIPIcs.GIScience.2021.I.5

[4] Jiang, S., Hagelien, T. F., Natvig, M., & Li, J. (2019). Ontology-Based Semantic Search for Open Government Data. Proceedings – 13th IEEE International Conference on Semantic Computing, ICSC 2019, 7–15. https://doi.org/10.1109/ICOSC.2019.8665522

[5] Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS. http://arxiv.org/abs/2104.08663

[6] Wetzel, S., & Mäs, S. (2024). Context-Aware Search for Environmental Data Using Dense Retrieval. ISPRS International Journal of Geo-Information, 13(11), 380. https://doi.org/10.3390/ijgi13110380

Challenges of Traditional Search Systems

An LLM-based Framework as an Alternative Approach

Conclusion

Leave a Reply Cancel reply