Policy and compliance systems increasingly express rules in natural language, yet enforcement requires deterministic decisions and auditable explanations. This paper studies a practical pipeline that converts natural-language facts and rules into a verifiable knowledge base, answers queries with three-valued semantics (True/False/Unknown), and produces machine-checkable proofs. The contribution is system-level rather than a new reasoning formalism: we integrate controlled-language parsing, symbolic proof extraction, independent proof checking, and proof-based supervision in a single auditable framework. We evaluate the pipeline on two natural-language rule-reasoning benchmarks: (i) a balanced subset of ProofWriter’s open-world-assumption tasks (360 train, 360 test), and (ii) a RuleTaker-style dataset generated from its grammar and label semantics (1800 train, 900 test), both balanced across reasoning depths 0–5. We compare a text-only logistic regression baseline, a retrieval-based “proof” baseline, a symbolic forward-chaining reasoner with proof extraction, and a proof-trained classifier using generated proofs. To ensure fairness, LR-text and LR-proof share the same TF-IDF/logistic-regression setup, and the retrieval baseline uses the same representation with a fixed top-4 configuration. On ProofWriter-Balanced, the symbolic reasoner achieves 0.803 accuracy (0.808 macro-F1), while proof-trained classification reaches 0.825 accuracy (0.825 macro-F1). On RuleTaker-Rep, both methods achieve 1.000 accuracy. Proof verifiability clearly separates faithful from post-hoc explanations: symbolic proofs are verifiable for all predictions, whereas retrieval-based proofs are verifiable for only 31.4%. Sensitivity analyses varying reasoning depth, distractors, and proof corruption show that proof-based methods remain robust to noise but depend on proof integrity. These findings demonstrate the feasibility of auditable natural-language policy reasoning in controlled settings, while highlighting limitations in parser coverage and benchmark regularity.