Xin, Qi
Unknown Affiliation

Published : 4 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search
Journal : Journal of Information Technology

Self-Supervised Customer Representation Learning for Segmentation and Next-Purchase Prediction on UCI Online Retail Xin, Qi
J-INTECH ( Journal of Information and Technology) Vol 14 No 01 (2026): Journal of Information and Technology
Publisher : LPPM Universitas Bhinneka Nusantara

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.32664/j-intech.v14i01.2229

Abstract

Customer analytics in financial retail, payments, and bank marketing frequently relies on segmentation and propensity prediction, but transactional logs are sparse, high-dimensional, and only weakly labeled. This paper presents a fast and reproducible self-supervised learning pipeline that converts raw e-commerce transactions into customer representations and evaluates them on two downstream tasks: customer segmentation and next-purchase prediction. We conduct full experimental evaluation on the UCI Online Retail dataset (541,909 invoice-line transactions from 2010-12-01 to 2011-12-09). After deterministic cleaning (removing cancellations and non-positive prices/quantities), 397,884 valid line items remain, spanning 4,338 customers, 18,532 invoices, 3,665 products, and 37 countries. For each customer we construct an ordered invoice sequence and define a canonical item per invoice (the item with the largest aggregated quantity). For each invoice transition we build a dual-view customer state vector that concatenates a lifetime purchase count view and a recent-window view (30 days), then learn embeddings via TF-IDF reweighting and truncated SVD. To increase robustness we introduce a denoising ridge projection (DRP) objective: a linear denoising model trained to map corrupted TF-IDF state vectors back to clean SVD embeddings without using labels, which yields denoised customer embeddings for downstream models. Our main contribution is an applied, computationally light integration of TF-IDF+SVD embeddings with a denoising linear projection for reuse across segmentation and next-purchase prediction, rather than a fundamentally new learning paradigm. In next-purchase prediction restricted to the 200 most frequent target items, a multinomial logistic model trained on DualDRP embeddings achieves Hit@20=0.587, outperforming MostPopular (Hit@20=0.327) and Markov (Hit@20=0.291). In segmentation we apply k-means clustering and analyze cluster-level RFM statistics and dominant products, showing that the learned embeddings recover actionable segments such as high-value frequent buyers and low-activity long-tail customers. All results, tables, and figures are generated with fixed random seeds and are reproducible in this environment