Purpose – Cryptocurrency price forecasting with machine learning (ML) and deep learning (DL) has produced 48 Scopus-indexed journal articles since 2018, yet the same LSTM architecture applied to Bitcoin daily closing prices yields mean absolute percentage errors ranging from 1.7% to 4.8% across papers in this corpus. This review examines why the literature fails to accumulate knowledge despite growing output and identifies the evaluation practices responsible for that failure. Design/methodology/approach – A PRISMA 2020 compliant search of Scopus retrieved 48 peer-reviewed English-language articles on ML and DL applications to cryptocurrency price prediction published between 2018 and 2025. All articles were retained after dual-reviewer screening (κ = 0.86) and Mixed Methods Appraisal Tool quality appraisal at the ≥10/16 threshold. Structured data extraction covered architecture type, target coin, forecast horizon, evaluation metric, and train/test split specification. Finding/Results – Five evaluation failure modes affect 39 of 48 articles: calendar concealment (47.9%), split inconsistency (37.5%), normalisation silence (33.3%), baseline heterogeneity (25.0%), and single-regime evaluation (100%). CNN-LSTM hybrids outperform standalone LSTM in 9 of 12 studies that test both, yet neither this finding nor the 6× Transformer growth ratio can be verified across studies because evaluation conditions are not shared. Originality/Value – This is the first PRISMA 2020 compliant systematic review of cryptocurrency ML forecasting. It introduces a five-mode evaluation failure taxonomy and proposes a regime-stratified evaluation design prescribing three mandatory calendar-anchored test periods — the 2021 bull run, the 2022 FTX collapse, and the 2024 institutional entry period — as the minimum standard for deployment-relevant performance claims.