A distributed RAG-based framework for automated extraction of information from multiple types of resources

El Gemayel, Charbel and El Gemayel, Joseph and Constantin, Joseph; (2026) A distributed RAG-based framework for automated extraction of information from multiple types of resources. In: 2025 IEEE/ACS 22nd International Conference on Computer Systems and Applications (AICCSA). 2025 IEEE/ACS 22nd International Conference on Computer Systems and Applications (AICCSA) . IEEE, QAT. ISBN 979-8-3315-5693-8

[thumbnail of El-Gemayel-etal-2025-A-distributed-RAG-based-framework-for-automated-extraction]
Preview
Text. Filename: El-Gemayel-etal-2025-A-distributed-RAG-based-framework-for-automated-extraction.pdf
Accepted Author Manuscript
License: Creative Commons Attribution 4.0 logo

Download (698kB)| Preview

Abstract

Accessing authoritative information in areas such as healthcare, cybersecurity, and artificial intelligence remains a challenge due to the heterogeneity of data sources and the varying credibility of content. With the increasing integration of advanced technologies into daily life, there is an urgent need for systems that can streamline the retrieval of information and extraction of knowledge from different formats. In this paper, we present a distributed, Retrieval-Augmented Generation (RAG) based framework that aims to automate the extraction and structuring of information from multimodal resources, such as websites, PDFs, images, audio, and video. The framework supports real-time data processing and is optimized for the creation of open data sets in any subject area. To validate our approach, we applied it to cigars and beverages, using content from online articles, reviews, and posts. Our results show the framework’s potential to simplify data integration, improve usability and enable scalable, contextual knowledge generation.

ORCID iDs

El Gemayel, Charbel, El Gemayel, Joseph ORCID logoORCID: https://orcid.org/0009-0004-4518-3071 and Constantin, Joseph;