This study explores text data compression as an epistemological paradigm through a comparative analysis of three fundamental approaches: traditional methods (Huffman Coding + LZW), bit-based methods (Arithmetic Coding), and machine learning approaches (Neural Language Models). Using the Project Gutenberg dataset comprising 15,000 classical literary works with a total size of 8.5 GB and 2.1-billion-word tokens, the evaluation is conducted based on compression ratio, execution time, and memory usage. The results reveal fundamental trade-offs among the paradigms. Traditional methods achieve the fastest execution (8.3 seconds/GB, 482 MB/s, 52 MB) with a compression ratio of 3.2:1. Arithmetic coding attains near-optimal performance (99.5% of the Shannon bound) with a compression ratio of 3.8:1. Neural language models yield the highest compression ratio of 4.6:1 but require substantially higher execution time and memory. The epistemological analysis highlights distinct conceptions of information—mechanistic, mathematically optimal, and semantic-aware—and provides a conceptual framework for developing adaptive compression systems.
Copyrights © 2025