Journal of Information Technology
Vol 14 No 01 (2026): Journal of Information and Technology

Self-Supervised Customer Representation Learning for Segmentation and Next-Purchase Prediction on UCI Online Retail

Xin, Qi (Unknown)



Article Info

Publish Date
08 Apr 2026

Abstract

Customer analytics in financial retail, payments, and bank marketing frequently relies on segmentation and propensity prediction, but transactional logs are sparse, high-dimensional, and only weakly labeled. This paper presents a fast and reproducible self-supervised learning pipeline that converts raw e-commerce transactions into customer representations and evaluates them on two downstream tasks: customer segmentation and next-purchase prediction. We conduct full experimental evaluation on the UCI Online Retail dataset (541,909 invoice-line transactions from 2010-12-01 to 2011-12-09). After deterministic cleaning (removing cancellations and non-positive prices/quantities), 397,884 valid line items remain, spanning 4,338 customers, 18,532 invoices, 3,665 products, and 37 countries. For each customer we construct an ordered invoice sequence and define a canonical item per invoice (the item with the largest aggregated quantity). For each invoice transition we build a dual-view customer state vector that concatenates a lifetime purchase count view and a recent-window view (30 days), then learn embeddings via TF-IDF reweighting and truncated SVD. To increase robustness we introduce a denoising ridge projection (DRP) objective: a linear denoising model trained to map corrupted TF-IDF state vectors back to clean SVD embeddings without using labels, which yields denoised customer embeddings for downstream models. Our main contribution is an applied, computationally light integration of TF-IDF+SVD embeddings with a denoising linear projection for reuse across segmentation and next-purchase prediction, rather than a fundamentally new learning paradigm. In next-purchase prediction restricted to the 200 most frequent target items, a multinomial logistic model trained on DualDRP embeddings achieves Hit@20=0.587, outperforming MostPopular (Hit@20=0.327) and Markov (Hit@20=0.291). In segmentation we apply k-means clustering and analyze cluster-level RFM statistics and dominant products, showing that the learned embeddings recover actionable segments such as high-value frequent buyers and low-activity long-tail customers. All results, tables, and figures are generated with fixed random seeds and are reproducible in this environment

Copyrights © 2026






Journal Info

Abbrev

J-INTECH

Publisher

Subject

Computer Science & IT

Description

Journal of Information and Technology is a journal published by Bhinneka Nusantara University, Malang. The scope of this journal includes IT Governance, IS Strategic Planning, IS Theory and Practices, Management Information System, IT Project Management, Distance Learning, E-Government, Information ...