Designing effective fuel subsidy policies is a major challenge for governments seeking to balance energy affordability, fiscal sustainability, and environmental goals. This study introduces an adaptive simulation framework combining Deep Q-Learning and a multi-armed bandit algorithm to model fuel consumption behavior and optimize subsidy distribution strategies. Moreover, this paper simulates a dual-agent system in which a DQN-based consumer interacts with a bandit driven government selecting among three subsidy policies: universal, quota-based, and targeted. By simulating consumer responses to universal, quota-based, and targeted subsidies over 1,000 episodes, the framework demonstrates how policy can adapt in real-time to maximize social welfare and reduce inefficient spending. Results show that while universal subsidies often deliver the highest consumer satisfaction, they incur significant fiscal costs, whereas quota and targeted approaches can yield more balanced trade-offs. The study highlights the potential of reinforcement learning to enhance adaptive policymaking in complex economic systems. To strengthen the analysis, the simulation tracks both consumer and government rewards across scenarios, capturing the trade-off between satisfaction and fiscal burden. Evaluation results reveal that targeted subsidies, though less popular, often provide more sustainable outcomes. The agent-based approach enables the system to dynamically adjust policy decisions based on real-time feedback, reflecting the evolving nature of economic behavior. These findings underscore the usefulness of AI-driven simulations as decision-support tools in designing responsive and cost-efficient public policies.