Table of Links
Abstract and 1 Introduction
- Literature Review
- Model
- Experiments
- Deployment Journey
- Future Directions and References
4 EXPERIMENTS
In this section, we conduct experiments to answer the following questions: Q1: What is the effectiveness of the proposed method? Does it outperform state-of-the-art NBR/ BIA methods? Q2: How well does this method scale up to generate recommendations for millions of users? Q3: How is model performance impacted by the input features? Q4: How do training and testing date ranges change the performance of the model?
4.1 Experimental Settings
4.1.1 Datasets. We use four publicly available datasets shown in Table 1 to compare the performance of the proposed method with existing methods in literature: ValuedShopper[2], Instacart[3], Dunnhumby[4], and TaFeng[5]. We also evaluate using an internal dataset consisting of the sales history of users at a large retailer. There are around 100M users and 3M products in this dataset.
4.1.2 Evaluation Protocol. We use recall (@K) and NDCG (@K) metrics to evaluate and compare our methods. The first metric evaluates the fraction of ground truth items, which customers bought in last trip, that have been rightly ranked over top-K items in all testing sessions. NDCG is a ranking based measure which takes into account the order of purchased items in the recommendations and generates a score between 0 to 1. We use the past baskets of a given customer to predict their last basket. We consider 80% of customers data to train the model and remaining to test using 5-fold cross validation. We reserve 10% training data as a validation dataset for hyper-parameters tuning in all the methods.
4.1.3 Baselines.
(1) TopSell: It uses the most frequent items that are purchased by users as the recommendations to all users.
(2) FBought: It uses the most frequent items that are purchased by a user as the recommendationto him.
(3) userKNN [16]: It uses classical collaborative filtering based on kNN. All the items in the historical baskets of user are merged as a set of items.
(4) RepeatNet [18]: RNN-based model for session-based recommendation which captures the repeated purchase behavior of users. It uses GRUs and Attention. To apply this method, user baskets are translated to a sequence of items.
(5) FPMC [19]: Matrix Factorization uses all data to learn the general taste of the user whereas Markov Chains can capture sequence effects in time. FPMC combines the both for Next Basket Recommendation problem.
(6) DREAM [21]: Dynamic REcurrent bAsket Model (DREAM) learns a dynamic representation of a user but also captures global sequential features among baskets.
(7) SHAN [20]: A deep model based on hierarchical attention networks. It partitions the historical baskets into long-term and short-term parts to learn the long-term preference and short-term preference based on the corresponding items attentively.
(8) Sets2Sets [12]: The state-of-the-art end-to-end method for following multiple baskets prediction based on RNN. Repeated purchase pattern is also integrated into the method.
(9) RCP [2]: Repeat Customer Probability (RCP) finds repeat probably of an item & repeat items based on that.
(10) ATD [2]: Aggregate Time Distribution Model fits a time distribution to model probablity distribution and time characteristics of repeat items.
(11) PG [2]: Poisson Gamma distribution fitted to predictaggregate purchasing behavior. (12) MPG [2]: A modified PG distribution to make the results time dependent and intergate repeat customer probability.
We use grid search to tune the hyper-parameters in compared methods. For userKNN, the number of nearest neighbors is searched from range(100, 1300). For FPMC, the dimension of factor is searched from the set of values [16, 32, 64, 128]. For RepeatNet, DREAM, SHAN, and Sets2Sets, the embedding size is searched from the set of values [16, 32, 64, 128]. For PCIC model, ARIMA model was autofitted in range (3, 3, 0).
4.2 Performance Comparson (Q1)
Table 2 gives the performance comparison of PCIC model with existing baselines. Several observations can be made from the table.
First, we observe that the PCIC model has highest recall and NDCG values in most cases on Valued Shopper, instacard and Dunhumby datasets. Surprisingly, RCP model performs well on tafeng dataset.
TIFUKNN models also performs well. Since that model is built for next basket recommendation task, whose results are under dataset TIFUKNN(NBR) in Table 2, we modified the code and ran it to run on the BIA task only (i.e. generate user embedding vectors, find scores from neighbor embeddings and then filter out recommendations which user has not purchased before), whose results are under dataset TIFUKNN(BIA). We see that this leads to a slight dip in its performance. Just like our model captures personalized category frequency, TIFUKNN model tries to explicitly capture personalized item frequency. TIFUKNN model uses nearest neighbor approach to collaborative filtering to learn repurchasing pattern from other users. In PCIC model, the survival analysis features use user repurchasing pattern at category level.
Sets2Sets captures personalized item frequency explicitly but subsequently learns coeffients for RNN. RCP, ATD, PG and MPG models do not use personalized item frequency but they try to model repeat purchase pattern using a Poisson Gamma or modified Poisson Gamma distribution. Hence, we can see that these methods perform better than any existing methods which do not capture item or category frequency such as RepeatNet, userKNN, etc.
FBought is a pretty simple baseline in that it simply ranks the most frequently bought items of a user in that order. It surprisingly performs better than many baselines here. It is a simple to implement baseline and performs pretty well.
We wanted to select the best baselines and compare performance on a much larger, real-world internal dataset. The challenges in scaling these models to score for large datasets is discussed next.
4.3 Scaling up (Q2)
We attempted to train the top performing models above on a much larger (100M user) data set. TIFUKNN uses a user embedding the size of the entire product catalog, which made it impossible to scale up to this data set. Hence, it was impossible to scale TIFUKNN to run on this large dataset. Similarly, Sets2Sets uses GRU layers with attention and training on this data set would have taken weeks. As a result, we subsampled the larger data set, creating a representative sample with 1M users. We compared TIFUKNN and Sets2Sets to PCIC using this subsampled data. We observed a 30-35% reduction
in NDCG and recall metrics in TIFUKNN and Sets2Sets against PCIC. As a result, we did not put effort into scaling either algorithm.
PCIC was implemented in a distributed hadoop cluster using Apache Spark and takes around 6-8 hours of time to train and test the model for 100M users. The main time-consuming part is to figure out ARIMA model hyper-parameters for each user-category pairs and to generate those features. FBought is straight forward to implement and takes few minutes of run time. We also implemented MPG model in distributed cluster using the maths described in the paper. Table 3 shows the performance comparison of FBought, MPG and PCIC models. Although PCIC performs well in terms of NDCG, the recall is slightly lower than MPG . Next, we calculated MPG parameters at category level instead of original item level and input it as part of features to PC. The performance of integrated PCIC(+MPG) outperforms both PCIC and MPG.
4.4 Feature Importance (Q3)
To obtain the feature importance, we replaced the original neural layer with a Gradient Boosting Tree clasifier. The values are plotted in figure 2. We can observe that the ARIMA forecasts have a very high impact on the output of the model, particularly the model that tries to predict the next purchase based on rate of individual consumption of item by the user. The survival features have smaller impact on the prediction quality meaning other user’s purchases play a small role in user’s repurchase than his own characteristics. This can be one of the reason why approaches like itemKNN or TIFUKNN which focus on collaborative user behavior don’t perform as well as PCIC. MPG does capture rate of consumption with a statistical model and it comes close to PCIC. The features such as number of days since past purchase and explicit category frequency (num purchase) also have high feature importance. if we were to collect the top 3 features, we can say that we can predict whether a user will purchase an item today based on how many times he has purchased before, how many days since his last purchase with us, how much did he purchase last time and how long will it last.
4.5 Impact of train and test data selection (Q4)
We held out one week of the most recent customer purchases from this dataset for testing and used one year of purchases made prior to that week for training. A customer and their product purchase were considered as a repeat purchase in the test period only if the customer purchased a product in the training period (y years before the test period, y =1.5) and also purchased the same product sometime in the test period. The (user, category) pairs purchased in this duration are labeled 1 and the categ
As the pandemic caused increased adoption of the app and website, users started shopping online more frequently particularly. Based on the initial feedback, we observed that the BIA list was not updating particularly for the highly engaged users. We hypothezised that this can be because of the following reasons: (1) the model being trained on all users may not be able to exactly capture the signals and behavior of highly engaged user. (2) The labels are captured based on last 1 week of purchases. But highly engaged users shop much more often, hence their labels are not very accurate. We experimented with scoring the model daily on 1 day of user purchases. We also experimented on training the model only on the most engaged users, defined as users who have made purchases in more than 25 categories.
Table 4 shows the improvement in NDCG metric for the PC model with the changes in test time frame and with training on only the most engaged users. Reducing the test time frame significantly improved the performance of the model. The most engaged users had a lower NDCG performance than all users when the test dates were 7 days. We also observed that training the model only on the most engaged users improves NDCG for all users too although it leads in savings on training time. The time taken to train the generate the features and train the model on all users is 2.5x the time taken for highly engaged users
Authors:
(1) Amit Pande, Data Sciences, Target Corporation, Brooklyn Park, Minnesota, USA ([email protected]);
(2) Kunal Ghosh, Data Sciences, Target Corporation, Brooklyn Park, Minnesota, USA ([email protected]);
(3) Rankyung Park, Data Sciences, Target Corporation, Brooklyn Park, Minnesota, USA ([email protected]).
[2] https://www.kaggle.com/c/acquire-valued-shoppers-challenge/overview
[3] https://www.kaggle.com/c/instacart-market-basket-analysis
[4] https://www.dunnhumby.com/careers/engineering/sourcefiles
[5] https://www.kaggle.com/chiranjivdas09/ta-feng-grocery-dataset