PRODUCTION · GOVERNMENT

Arabic Government QA System

End-to-end Arabic NLP question-answering system. Semantic retrieval over 1,200+ official documents with <2s response latency.

Source Repository

System Architecture & Overview

Built during an internship at the Ministry of Higher Education and Scientific Research, this QA system delivers semantic search and answer generation over 1,200+ official cabinet circulars, executive decisions, and ministry documents.

Standard search engines inside government intranets rely heavily on exact keyword matches, leading to poor document discovery rates for employees who phrase queries naturally.

We built an end-to-end bilingual Arabic question-answering system that uses deep learning representations to understand user intent, lookup the corresponding source paragraphs, and summarize answers grounded entirely in verified government text.

Key Deliverables & Capabilities

Semantic Intent Matching: Understands complex Arabic inquiries beyond rigid word-for-word string match queries.
Arabic Preprocessing Pipeline: Integrates customized stemmers and root-based analyzer systems to clean query noise.
Direct Document Referencing: Generates grounded answers that explicitly cite the source document, page, and article.
Admin Dashboard: Empowers ministry staff to upload new PDFs, triggering automatic text extraction and vector index updates.

Critical Challenge & Pivot

Arabic is a morphologically rich language with extensive clitics and affixes, which severely degrades vector search quality when using off-the-shelf multilingual models. We solved this by using specialized AraBERT encodings paired with strict morphological normalization using CAMeL Tools.

System Benchmarks & Outcomes

Deployed in hybrid production environments, indexing over 1,200+ highly sensitive cabinet documents. Achieved an average search-and-generation response latency of <2 seconds across 5+ government sub-agencies.

Engineering Stack

AraBERT

Deployed as the core transformer embedding model, optimized for capturing fine-grained Arabic semantic relationships.

CAMeL Tools

Leveraged for morphological analysis, tokenization, orthographic normalization, and Arabic NLP preprocessing.

FAISS

Configured for highly performant vector index retrievals on dense document text encodings.

FastAPI

Managed asynchronous service requests, delivering high concurrency under peak internal ministry workloads.

Specifications

Deployment StageProduction Ready

Access LevelOpen Source / MIT

Testing Coverage> 90% Pass