Arabic Government QA System
End-to-end Arabic NLP question-answering system. Semantic retrieval over 1,200+ official documents with <2s response latency.
System Architecture & Overview
Built during an internship at the Ministry of Higher Education and Scientific Research, this QA system delivers semantic search and answer generation over 1,200+ official cabinet circulars, executive decisions, and ministry documents.
Standard search engines inside government intranets rely heavily on exact keyword matches, leading to poor document discovery rates for employees who phrase queries naturally.
We built an end-to-end bilingual Arabic question-answering system that uses deep learning representations to understand user intent, lookup the corresponding source paragraphs, and summarize answers grounded entirely in verified government text.
Key Deliverables & Capabilities
- Semantic Intent Matching: Understands complex Arabic inquiries beyond rigid word-for-word string match queries.
- Arabic Preprocessing Pipeline: Integrates customized stemmers and root-based analyzer systems to clean query noise.
- Direct Document Referencing: Generates grounded answers that explicitly cite the source document, page, and article.
- Admin Dashboard: Empowers ministry staff to upload new PDFs, triggering automatic text extraction and vector index updates.
Critical Challenge & Pivot
Arabic is a morphologically rich language with extensive clitics and affixes, which severely degrades vector search quality when using off-the-shelf multilingual models. We solved this by using specialized AraBERT encodings paired with strict morphological normalization using CAMeL Tools.
System Benchmarks & Outcomes
Deployed in hybrid production environments, indexing over 1,200+ highly sensitive cabinet documents. Achieved an average search-and-generation response latency of <2 seconds across 5+ government sub-agencies.
Engineering Stack
Deployed as the core transformer embedding model, optimized for capturing fine-grained Arabic semantic relationships.
Leveraged for morphological analysis, tokenization, orthographic normalization, and Arabic NLP preprocessing.
Configured for highly performant vector index retrievals on dense document text encodings.
Managed asynchronous service requests, delivering high concurrency under peak internal ministry workloads.