Large Language Models (LLMs) combined with visual foundation models have shown substantial progress, achieving capabilities comparable to human intelligence. This study evaluates several Multimodal LLMs (MLLMs), such as Multimodal-GPT, GPT-4 Vision, Gemini, and LLaVa, specifically within the hydrology domain. Hydrology holds critical relevance for AI applications in areas like flood management, water level monitoring, agricultural water discharge, and water pollution control.
We conducted an extensive evaluation by testing these models on hydrology-specific studies, analyzing their ability to generate meaningful responses, and assessing their feasibility for real-time applications. Complex real-world scenarios were chosen to explore the potential of MLLMs in addressing hydrological challenges. Specially crafted prompts enhanced the models’ visual inference capabilities, allowing them to comprehend image data within hydrological contexts.
Among the MLLMs, GPT-4 Vision emerged as the top performer, demonstrating exceptional proficiency in visual data inference and highlighting the potential of multimodal models for human-computer interaction in decision-making within hydrology. The results provide valuable insights into how advanced AI models can be applied to tackle complex challenges in hydrological environments, integrating both textual and visual data to support real-world hydrological inference systems.
Related Articles
- Kadiyala, L. A., Mermer, O., Samuel, D. J., Sermet, Y., & Demir, I. (2024). A Comprehensive Comparison of Multimodal Large Language Models (MLLMs) in Hydrology. https://doi.org/10.31223/X5TQ37

The workflow for MLLM benchmarking in hydrological tasks.