Abstract [eng] |
This research endeavors to investigate, document, and implement a system aimed at the recognition of geographical areas using satellite imagery. The primary objective is to explore the potential of artificial intelligence (AI) image recognition techniques in describing and identifying areas depicted in satellite images. By leveraging AI technologies, particularly large-scale language models and computer vision algorithms, the research aims to develop a search/recommendation system capable of matching the user-given prompt with corresponding satellite images in the dataset. The result from the experiments of pretrained data reveals that utilizing a common open-source dataset and a pre-trained CLIP model can match the captioning with recommended images fairly well. To adopt a longer, descriptive captioning (e.g. from Wikipedia) for photos, we will be required to fine-tune a model with a custom dataset, in our case it was decided to gather descriptions about Lithuanian churches. The Dataset had to be constructed manually, since there was no way to automate and extract data specifically about the visuals of the church and its surroundings based on Wikipedia descriptions. The images were extracted from google maps API, thanks to "Genčių genealogija" it was a possibility to get geo points of the churches and extract exact images based on those points. The document also describes the different methods of approaching such limitations working with long descriptive data and providing two solutions to overcome long descriptions, such as token pooling and extended context lengths. The experiments demonstrated that fine-tuning CLIP with an increased context length outperformed token pooling, achieving higher cosine similarity scores (80% on the custom dataset and 86% on RSICD) and improved attention precision for complex queries. Token pooling, while computationally efficient, resulted in dispersed attention and struggled with nuanced architectural descriptions, achieving only 61% cosine similarity. Dataset augmentation further enhanced generalization, with a smaller batch size of 32 yielding the best results (83% cosine similarity). Overall, the increased context length approach provided better alignment and understanding, highlighting the importance of longer textual inputs and dataset quality, as well as the importance of a larger training data for fine-tuning. |