Table of Links
Abstract and 1 Introduction
- Methodology
- Experiments
- Conclusion and References
3. EXPERIMENTS
In this section, we conduct extensive offline experiments[1] and online A/B testing to prove DML’s effectiveness.
3.1 Experimental Settings for Public Data
3.1.1 Datasets. We evaluate our methods on two public datasets.
• MovieLens-1M[8]: One of the currently released MovieLens datasets, which contains 1 million movie ratings from 6,040 users on 3,416 movies.
• Amazon[14]: A series of datasets consisting of product reviews from Amazon.com. We use the sub-category of “Electronics” including 1.7 million reviews from 192,403 users on 63,001 items.
For ML-1M, we introduce the binary classification task of positive rating prediction (>=4) and the regression task of rating estimation. These two tasks are strictly correlated. For electronics, following [20], we first augment the dataset by randomly sampling un-rated items for every user. Moreover, we make sure the number of the unrated items is the same as the number of the rated items for each user. Furthermore, we introduce two binary classification tasks, namely rating prediction (whether a rating exists) and positive rating prediction. Compared with the tasks of ML-1M, the negative transfer is more likely to occur as the task relationship here is more complex(The pearson correlation coefficient [21] of these two labels is around 0.7). Both data are randomly split into the training set, validation set, and test set by the ratio of 8:1:1.
3.1.2 Evaluation Metrics. The merge function Φ in Equation (2) assumes that the model can estimate accurate interaction probabilities for binary classification tasks (e.g. clicking) and absolute values for regression tasks (e.g. watch time). Therefore, instead of the ranking metrics, such as NDCG [19] and MRR [22], we use the metrics of AUC [7] for classification tasks and Mean Squared Error (MSE) [23] for regression tasks. Please note that many other recommendation literature, such as [5, 17, 26], also use similar metrics. For AUC, a bigger value indicates better performance. While, for MSE, it is the smaller the better.
3.1.3 Models. As soft parameter sharing methods need resources to store and train multiple sets of parameters, they are not widely applied in recommender systems. Thus, we select base models to cover the other three categories. The models include SharedBottom(SB)[3], MSSM[5], MMOE[12], and PLE[17]. MSSM is a recent method belonging to the Customized Routing category and achieves better results than SNR[11] and Cross-Stitch[15]. Though with the same category of Dynamic Gating, both MMOE and PLE are tested owing to their popularity. For each base model, we will verify whether DML can achieve additional gains. For reference, we also provide the performance of single task models.
3.1.4 Implementation Details. For each feature, we use the embedding size 8. As suggested by the original papers, we use 1 level bottom sub-networks for MMOE, MSSM, and SB while 2 levels for PLE. For SB, a sub-network of 1 layer structure with 128 output dimensions is shared by the tasks. For other multi-task models, each bottom level includes three sub-networks, which have the same aforementioned architecture. For MSSM and PLE, task-specific and shared sub-networks are designated. For multi-task models, each task tower is of the three layers MLP structure (128,80,1) and each task is assigned equal loss weight. For the single-task model, each task utilizes the four layers MLP structure (128,128,80,1). For the first two MLPs at Figure 2(b), we utilize the one layer structure with 80 as the output dimension. For the last MLP at Figure 2(b), a one layer structure with 1 as the output size is used. If not explicitly specified, RELU [2] is used as the default activation function. All models are implemented with tensorflow [1] and optimized using the Adam [10] optimizer with learning rate 0.001 and mini-batch size 512. We run 20 times for each test to report the results.
3.2 Overall Performance for Public Data
Please refer to Table 1 for the overall results. First, DML achieves significant gains across all these tested multi-task models on the two public datasets, which shows DML’s wide compatibility. Second, DML-enhanced PLE and MMOE get the best performance for MovieLens and Electronics respectively. Considering their wide application in recommender systems, the results are as expected. Third, the multi-task models perform better than the single task models thanks to the knowledge transfer between tasks.
Besides AUC and MSE, DML should help to foster task consistency with CTFM and GKD. As the tasks of MovieLens are rigorously correlated, we verify whether DML really enhances task consistency on this data. First, we construct pairs of samples with different rating scores and count the pair numbers. Second, we count the number of pairs, for which the prediction scores of both tasks are in the same pair order as the rating score. The enhancement of the pair order consistency among the two prediction scores and rating score should positively contribute to the performance. Then, we can compute the metric of ’Consistency Ratio’. The listed data in Table 1 agree with our anticipation. (For Shared-Bottom, we also observe more pairs, for which predictions of both tasks are in rating score’s reverse order. This can explain its worse performance in spite of the better consistency ratio.)
3.3 Further Analysis on Public Data
We select the two latest algorithms of PLE[17] and MSSM[5] to appraise the value of 𝐷𝑀𝐿’s components, namely 𝐶𝑇 𝐹𝑀 and 𝐺𝐾𝐷. Without the stop gradient operation, 𝐶𝑇 𝐹𝑀 will be very similar to the common attention mechanism. To prove the benefit of 𝐶𝑇 𝐹𝑀’s design, we also add the assessment for 𝐷𝑀𝐿𝑣0, which reserve the design of 𝐺𝐾𝐷 while remove the gradient blocking operation of 𝐶𝑇 𝐹𝑀. Please refer to Table 2 for the evaluation results. First,𝐶𝑇 𝐹𝑀 and 𝐺𝐾𝐷 both contribute considerable gains over the base model. Second, as the integrated model, 𝐷𝑀𝐿 enhances the performance further. Third, 𝐷𝑀𝐿𝑣0 is consistently worse than 𝐷𝑀𝐿, which corroborates the value of reserving task-awareness. Compared with 𝐶𝑇 𝐹𝑀 and 𝐺𝐾𝐷, 𝐷𝑀𝐿𝑣0 performs better on MovieLens while much worse on Electronics. The task relationship of Electronics is more complex and negative transfer across tasks usually exhibits more severe impact due to task conflicts. In this case, compared with vanilla attention, 𝐶𝑇 𝐹𝑀 obtains substantial gains.
3.4 Online A/B Testing
DML is applied to the ranking stage[4] of an industrial large-scale news recommender system. PLE [17] is utilized as the base model. And the main prediction tasks are the binary classification task of Click Through Rate (CTR) and the regression task of item watch time. First, after the model converge by training with billions of samples, the AUC metric for CTR consistently increases 0.12% and the MSE metric for watch time decreases 0.14%. Moreover, the most important online metrics include effective PV(count of Page Views with watch time exceeding a threshold) and total watch time. We randomly distributed online users to two buckets with the base PLE model or PLE+DML model and evaluated the performance for two weeks. DML achieves significant (p<0.05) gains over the base model by 1.22% for effective PV and 0.61% for total watch time. DML has been deployed to our online environment based on the results.
Authors:
(1) Yi Ren, Tencent, Beijing, China ([email protected]);
(2) Ying Du, Tencent, Beijing, China ([email protected]);
(3) Bin Wang, Tencent, Beijing, China ([email protected]);
(4) Shenzheng Zhang, Tencent, Beijing, China ([email protected]).
[1] The code can be found at: https://github.com/renyi533/mtl-consistency/tree/main.