Authors:
(1) Yi Ren, Tencent, Beijing, China ([email protected]);
(2) Ying Du, Tencent, Beijing, China ([email protected]);
(3) Bin Wang, Tencent, Beijing, China ([email protected]);
(4) Shenzheng Zhang, Tencent, Beijing, China ([email protected]).
Table of Links
Abstract and 1 Introduction
- Methodology
- Experiments
- Conclusion and References
ABSTRACT
Recommender systems usually leverage multi-task learning methods to simultaneously optimize several objectives because of the multi-faceted user behavior data. The typical way of conducting multi-task learning(MTL) is to establish appropriate parameter sharing across multiple tasks at lower layers while reserving a separate task tower for each task at upper layers. With such design, the lower layers intend to explore the structure of task relationships and mine valuable information to be used by the task towers for accurate prediction.
Since the task towers exert direct impact on the prediction results, we argue that the architecture of standalone task towers is suboptimal for promoting positive knowledge sharing. First, for each task, attending to the input information of other task towers is beneficial. For instance, the information useful for predicting the “like” task is also valuable for the “buy” task. Furthermore, because different tasks are inter-related, the training labels of multiple tasks should obey a joint distribution. It is undesirable for the prediction results for these tasks to fall into the low density areas. Accordingly, we propose the framework of Deep Mutual Learning across task towers(DML), which is compatible with various backbone multitask networks. At the entry layer of the task towers, the shared component of Cross Task Feature Mining(CTFM) is introduced to transfer input information across the task towers while still ensuring one task’s loss will not impact the inputs of other task towers. Moreover, for each task, dedicated network component called Global Knowledge Distillation(GKD) are utilized to distill valuable knowledge from the global results of the upper layer task towers to enhance the prediction consistency. Extensive offline experiments and online A/B tests are conducted to evaluate and verify the proposed approach’s effectiveness.
1. INTRODUCTION
Recently, we have seen the widespread application of recommender systems, which involve different types of user feedback signals, such as clicking, rating, commenting, etc. Moreover, no single feedback signal can accurately reflect user satisfaction. For example, over-concentrating on clicking may aggravate the click-bait issue. Therefore, it is highly desirable to be able to effectively learn and estimate multiple types of user behaviors at the same time. And Multi-Task Learning is a promising technique to address this challenge. Given several related learning tasks, the goal of multi-task learning is to enhance the overall performance of different tasks by leveraging knowledge transfer among tasks. With the Multi-Task Learning(MTL) paradigm[3, 25], multiple tasks are learned simultaneously in a single model. Compared with single-task solutions, MTL costs much fewer machine resources and improves learning efficiency since we just need to train and deploy a single model. Moreover, MTL usually can warrant enhanced recommendation performance through appropriate parameter sharing.
Most existing methods of Multi-Task Learning(MTL) appropriately share parameters across multiple tasks at lower layers while keeping separate task towers at upper layers. These methods can be roughly classified into four categories. The first category comprises the methods of hard parameter sharing. Among them, embedding sharing is the most intuitive structure to share information. For instance, the ESSM model [13] shares embedding parameters between the tasks of CTR (Click-Through Rate) and CVR (Conversion Rate) for improving the prediction performance of the sparse CVR task. In addition to the embedding parameters, the Shared-Bottom structure[3] is introduced to share the parameters of lower-layer MLPs among tasks. But these methods are severely plagued by the task conflicts and negative transfer issue. Second, for the methods of soft parameter sharing, each task owns separate parameters, which are regularized during training to minimize the differences between the shared parameters. L2-constrained[6] is a typical algorithm belonging to this category. Third, for the methods of customized routing, they learn customized routing weights for each task to combine and fuse information from lower-layer networks to counteract the negative transfer issue. Cross-stitch network[15] and sluice network[16] learn separate linear weights for each task to selectively merge representations from different lower-level branches. SNR[11] modularizes the shared low-level layers into parallel sub-networks and uses a transformation matrix multiplied by a scalar coding variable to learn their connections to upper-level layers to alleviate the task conflict and negative transfer issue. MSSM[5] learns differing combinations of feature fields for each expert and designs finer-grained sharing patterns among tasks through a set of coding variables that selectively choose which cells to route for a given task. But the learned routing parameters of these methods are static for all the samples, which can hardly warrant optimal performance. Finally, the methods of dynamic gating learn optimized weights for each task based on the input sample to effectively combine the outputs of lower-level networks and achieve success in industrial applications. The MMoE [12] model adapts the Mixture-of-Experts (MoE)[9] structure to multi-task learning by sharing the expert sub-networks across all tasks, while also maintaining separate gating network optimized for each task. And Zhao et al. [26] extend the MMoE model [12] and apply it to learn multiple ranking objectives in Youtube video recommender systems. Moreover, PLE [17] achieves superior performance for news recommendation by assigning both shared expert sub-networks among tasks and dedicated expert sub-networks for each task.
AITM [24] is the most similar method, which also augments the architecture of task towers. Nevertheless, as a concrete implementation, it is not validated to enhance the performance of various multi-task models. Moreover, it can only work for the tasks with sequential dependence relations.
Admittedly, the aforementioned methods achieve impressive performance. However, as the task towers exert a direct effect on the prediction results, the standalone task towers tend not to be the most effective design for promoting positive knowledge transfer by exploiting the task relationships. First, for each task, the information selected by the relevant tasks is extremely valuable. Accordingly, we introduce the shared component of Cross Task Feature Mining(CTFM), which utilizes delicate attention mechanisms to extract relevant information from other tasks at the entry layer of the task tower. With the common attention mechanisms, the explicit task-specific information distilled by lower-level networks are mingled together and one task’s loss will undesirably affect the inputs of other task towers, which is the task awareness missing problem and can hinder the learning of lower-level networks. In contrast to the usual attention mechanisms, our design can ensure appropriate information separation. We argue that reserving explicit task-specific knowledge has a positive effect on performance, which is validated in the experimental section. Second, because the tasks for recommender systems are related, the training labels of multiple tasks should obey a joint distribution. The prediction results for these tasks should not densely fall into the low-density areas. Therefore, a dedicated network named Global Knowledge Distillation(GKD) is introduced for each task to distill valuable global knowledge from the results of the upper layer task towers. For each task, the distilled global information helps to ensure consistent predictions with other tasks. We summarize our main contributions below.
• We propose the framework of Deep Mutual Learning across task towers(DML), which is compatible with various backbone multi-task models
• The proposed novel sharing structure helps to enhance effective knowledge transfer across different tasks.
• We conduct offline experiments and online A/B testing to evaluate and understand the effectiveness of our method.