PDF template data extraction remains a substantial challenge due to semi-structured document formats and variations. While large pre-trained models achieve high accuracy, they require extensive computational resources and labeled datasets, making them impractical for resource-constrained environments. Conversely, rule-based approaches are efficient but rigid. This research addresses this gap by developing an adaptive learning system that integrates rule-based approaches with Conditional Random Fields (CRF) in a hybrid framework, designed for data-scarce scenarios. The system implements parallel extraction strategies with confidence-based selection and Human-in-the-Loop (HITL) feedback for incremental learning. Pattern learning updates rule-based strategies, while CRF models are retrained incrementally. Evaluated on synthetically generated documents across diverse template types, the system achieves 98.61% accuracy with minimal training data and 7% user correction rate, demonstrating high learning efficiency (1.88 corrections per percentage point). The improvement is statistically significant (paired t-test, p < 0.001, Cohen’s d = 8.95). The system operates on CPU-only hardware with 50-100 MB footprint and 0.1-0.5 seconds processing time. This work fills a practical gap in document extraction, providing a middle-ground solution balancing high accuracy, minimal data requirements, low resource consumption, and real-time adaptability—suitable for small organizations and rapid deployment where large models are impractical. The evaluation uses synthetic data to ensure reproducibility and controlled assessment, though real-world validation would strengthen practical applicability.
Copyrights © 2026