25/10/2025
A few months ago, my friend and I discovered the Spanish government, in particular the State Secretariat for Digitization and Artificial Intelligence was organising a hackathon to put their new Alia models to use. In particular, as part of the National Program of Green Algorithms (PNAV), the contest was oriented to create an LLM-based solution intended to help with some biodiversity issue. The hackathon would run for a couple of months, giving a big amount of time for participants to develop their projects.
As my friend was especially interested in biodiversity and I have a decent amount of experience using LLMs, we decided to gather a few more companions to form a group aimed to win the event.
The competition lasted from mid February to early May (although it was later extended until end of June), and consisted on several phases:
18th February - 5th March: A first phase to identify and define at least three possible problems whose resolution would benefit from Generative AI.
7th March - 28th May: A second phase to select at least one of them and implement it.
6th June - 24th May: A final phase to prepare a presentation that teams would expose in an official event on the 25th.
Once each round was over, a team of judges analysed each presented solution to determine a subset of teams that would move to the next round.
The hackathon was in an online format, except for the final, in person, event. This helped our team, as we were distributed across Spain.
During the first stage of the hackathon, our main objective was to identify three problems related to biodiversity that could benefit from the application of Generative AI. According to the organizers’ guide, one of these use cases needed to benefit a public administration, and each proposal would later be judged on creativity and impact, technological soundness, the appropriateness of the AI models used, and the quality and relevance of the datasets.
We also learned that the jury would pay special attention to projects making innovative use of the ALIA family models, particularly Salamandra 7B, which is a family of models developed by the BSC (Barcelona Supercomputing Center).
We developed the following three use cases:
Enriching species habitat information from unstructured data: we discovered that much of the habitat information for species in the Inventario Español de Patrimonio Natural y de la Biodiversidad (IEPNB) was incomplete or scattered across unstructured sources such as reports, tables, or PDFs. Our idea was to use LLMs like Salamandra 7B to extract and harmonize this information automatically, integrating it into a unified database model. Once structured, the data could feed either a chatbot for querying biodiversity data (e.g., “Which animals live in wetlands?”), or/and an automatic report generator producing detailed, user-friendly species or habitat summaries. This would facilitate research, policy decision-making, and public access to biodiversity knowledge.
Simplifying and adapting invasive-species management strategies for the public: the second proposal addressed the complexity of official documents on invasive species management. These strategies, though crucial, are often technical and hard for citizens to interpret. We proposed training Salamandra 7B using fine-tuning and prompt engineering to generate simplified, accessible summaries of these documents. The aim was to bridge the gap between scientific/administrative language and public understanding, empowering citizens to participate more actively in conservation efforts.
Generating sustainable tourism itineraries linked to biodiversity data: our third idea explored the intersection of ecotourism and AI. We envisioned an intelligent assistant capable of creating personalized, sustainable travel itineraries in Spain. By combining biodiversity datasets (e.g., GBIF, IEPNB), meteorological data, and regional regulations, the system could generate routes including species guides, environmental rules, and weather forecasts—downloadable as multilingual PDFs. This would encourage responsible tourism while highlighting Spain’s natural heritage.
These three proposals covered different angles such as data enrichment, communication and education, and sustainable tourism, but they all shared the same spirit of using LLMs to make biodiversity information more accessible, useful, and actionable.
After our three initial proposals were evaluated, our team advanced to the second phase on all three ideas. We decided to focus on the third one: creating sustainable tourism itineraries enriched with biodiversity data. We named the project BioSenda (combination of biodiversity and senda, the Spanish word for trail), and centered it on developing a platform capable of recommending eco-friendly routes across Spain while offering customisable, scientifically accurate descriptions of routes and local species.
According to the competition’s development guide, this phase required not only a working proof of concept but also the quantification of the carbon footprint of our solution using CodeCarbon, and the integration of Salamandra 7B. Judges would later assess us on four main criteria, which were the quality of the generated dataset, the technical complexity, the solution robustness and the energy efficiency and sustainability, each contributing 20% of the punctuation.
Our solution suffered important changes as the competition developed. Our plan was to not spend too much time gathering and processing the data that our application was going to use, maybe a couple of weeks. However, we severely underestimated the complecity of this task, especially considering our application was meant to generate accessible descriptions for species that anyone would be able to understand.
Thus, the project began with a massive effort to collect, clean, and unify biodiversity data from multiple open Spanish sources. We focused on six main categories:
Species names and taxonomy, obtained through the EIDOS API, covering more than 53,000 species and their common names in multiple languages.
Species descriptions, compiled from five repositories such as SEO BirdLife, Fungipedia, Vertebrados Ibéricos, and IEET Atlas, using custom web scraping scripts and manual curation. This was a crucial part of the process, as we had to make sure the data was descriptive enough to identify the species while also being easy to understand.
Conservation status, filtered from EIDOS and aligned with the IUCN classification.
Legal status, obtained from the Catálogo Español de Especies Exóticas Invasoras (CEEEI).
Geographical distribution for each species, integrating over 1 million unique locations from IEET, EIDOS, and iNaturalist, standardized into 10x10 km MGRS grid cells across Spain.
Routes and trails, gathered from Caminos Naturales, FEDME, Vías Verdes, and National Parks datasets, cleaned and spatially joined with biodiversity grids to create a unified database of over 4,700 routes.
Each dataset required a custom pipeline, mostly using Python and some manual scraping. The iNaturalist dataset was specially heavy to process, so we needed to use Azure Machine Learning Studio for that one.
One of the main challenges of this part was matching the species names/ids from different sources. All EIDOS-related sources used the same id, so we decided to use the same one for our project. However, many species had multiple common names and scientific names which appear indistinctly through the various sources we had, so one of our main efforts was to normalize and match them so we could have unified information about each species.
The other significant effort was how to decide what made a good description. In the end, we decided to discard descriptions that were too short and to completely generate new ones for all species. We made use of Gemini 2.0 Flash to produce synthesised versions that gathered all the relevant information from our different sources and ensured they were easily understandable by a general audience. For instance, compare these two descriptions for the iberian midwife toad (only for Spanish readers):
Sapo de tamaño pequeño (menor de 5 cm). Aspecto muy rechoncho, cabeza grande y hocico redondeado, ojos prominentes, pupila vertical e iris dorado con vetas negras. Tímpano visible y más pequeño que el ojo. Pliegue gular patente. Glándulas parotídeas poco aparentes. Piel granulosa con verrugas pequeñas, frecuentemente agrupadas en los costados en dos hileras dorsolaterales. Miembros cortos y fuertes, los anteriores con dos tubérculos metacarpianos, y los posteriores con membrana interdigital muy reducida y con un tubérculo metatarsiano pequeño. Los dos tubérculos metatarsianos han sido a menudo utilizados como carácter discriminante con respecto a A. obstetricans (que tiene tres generalmente) pero hay excepciones en esta última especie. Coloración dorsal parda con pequeñas manchas más oscuras y poco marcadas, a menudo coronadas por vistosos puntos naranja. Zona ventral clara.
Este sapo es pequeño, no suele superar los 5 cm, y tiene un aspecto robusto. Su cabeza es grande, con un hocico redondeado y ojos que resaltan, con un iris dorado y líneas negras. Su piel es rugosa, cubierta de pequeñas verrugas, especialmente en los costados, donde forman dos líneas. Sus patas son cortas y fuertes, con membranas muy pequeñas entre los dedos de las patas traseras. Su color es marrón en la parte superior, con pequeñas manchas oscuras y puntos naranja. La parte de abajo es de color claro.
As you can see, the new description contains mostly the same information, but using a simpler vocabulary and condensed into something users can use to identify the toad. We generated three versions of these descriptions for each species, with the aim of offering a different experience for each user that visited the page.
Our efforts were split between the aforementioned descriptions generation and general organisation of the data into a database, the development a nice user interface that rendered and displayed PDF files designed by us, and the creation of finetuned LLMs that would allow users to customise the species' descriptions.
Once the datasets were validated, we built a relational SQL database that could serve both the AI training process and the future BioSenda interface. The database linked species, routes, and locations through shared grid identifiers (MGRS) and species IDs. This database allowed us to retrieve all species observable along a given route, together with their images, conservation status, invasive classification, and public-friendly descriptions. We made public all data, including a Huggingface dataset of images that allow users to locate each route inside their region:
At first, this was something "nice to have" instead of a strict requirement. In the end, since the hackathon deadline for this phase was extended by a whole month, we thought it would increase the complexity of our project, thus giving us extra points. The underlying motivation was to allow users to adapt the descriptions that we would give them so that they would better serve their needs. For instance, a father who knows his son is a big Pokemon fan could aim to create descriptions in which similarities with Pokemon species would be enhanced. Likewise, a scientist that wanted more accurate descriptions with scientific terminology would be able to generate them as well.
To do this, the structure we wanted to follow was:
The user selects a route.
The platform generates a PDF files with sample species found in that route.
The user would input through a textbox how they would like to customise the species descriptions.
One of our LLMs would generate and execute a SQL query to our database to retrieve the relevant information needed to fulfill the user's prompt.
Another LLM would use that information to generate new descriptions, which would get served in a new PDF.
We selected as base the Salamandra-7B-instruct model, as it had a better Spanish understanding than similarly sized models (thus better serving our purposes) and due to this choice being valued by the judges. Then, we built two datasets to finetune our models, one that would serve as training for the generation of SQL queries, and one that would be used to generate the customised descriptions. For both of them, we used Gemini 2.0 Flash as teacher following our knowledge of distillation techniques.
We finetuned our models using CPUs instead of GPUs, as these were not available to us as part of the Microsoft Founders program that we were using. With this constraints, we decided to use LoRA instead of full finetuning. The results were the salamandra-7b-instruct-text2sql and salamandra-7b-instruct-description models, which we made publicly available through Huggingface. We also created quantized versions of the models using the GGUF My Repo tool to make them easier to host.
Finally, our user interface was developed using Flask and Python (with an HTML frontend). We connected it to our LLMs using KoboldCPP, an easy-to-use software for quantized models. We also made sure to easily support any other ChatCompletions-compatible API. You can find a screenshot of the resulting user interface below, in which the user can select which Spanish region and taxonomic groups they are interested in:
BioSenda is publicly available through biosenda.es. This cloud version uses our Azure hosted GPT-4.1 instead of our custom LLMs due to being a more affordable and faster option, while the platform itself is hosted through Google Cloud Platform's Cloud Run.
This ends the development section of this post, and we will focus now on the final phase of the competition.
The last stage of the Hackathon focused on transforming our technical solution into a viable business proposal. During this phase, we had to prepare a comprehensive business case showcasing not only the technical innovation behind our AI solution but also its potential for real-world impact and scalability. We had to develop a 10 slide presentation summarizing their solution, its market viability, and the results achieved throughout the competition. For this, we received support from our mentor from the first phase, which helped us to refine our strategy.
The phase culminated in a live event held in Madrid on the 24th of June, 2025, where, as part of the three finalist teams, we presented and defended our business cases before a jury. The event was live streamed and later uploaded to YouTube, in which the three presentations can be watched. Here is the direct link to the one from our team:
In the end, we came at a close second place, just one point behind a similar solution by a team of NTT Data employees. Despite this result, we felt proud of our work, and felt any of the two projects could have won.
The competition overall resulted in a positive experience; we had to work hard across many months, but we were able to build an impressive platform that we hope one day will be used by someone. To our knowledge, ours is the only project that is maintained and cloud hosted for everyone to give it a try, so we have everything set up for that to happen!