Authors:
(1) Kinjal Basu, IBM Research;
(2) Keerthiram Murugesan, IBM Research;
(3) Subhajit Chaudhury, IBM Research;
(4) Murray Campbell, IBM Research;
(5) Kartik Talamadupula, Symbl.ai;
(6) Tim Klinger, IBM Research.
Table of Links
Abstract and 1 Introduction
2 Background
3 Symbolic Policy Learner
3.1 Learning Symbolic Policy using ILP
3.2 Exception Learning
4 Rule Generalization
4.1 Dynamic Rule Generalization
5 Experiments and Results
5.1 Dataset
5.2 Experiments
5.3 Results
6 Related Work
7 Future Work and Conclusion, Limitations, Ethics Statement, and References
5.3 Results
In our experiments, we evaluated the models based on the number of steps taken by the agent – #steps (lower is better) and the normalized scores – n. score (higher is better). For TWC, Table 1 shows the comparison results of all 4 settings along with the baseline model (Text-Only agent). We compared our agents in two different test sets – (i) IN distribution: that has the same entities (i.e., objects and locations) as the training dataset, and (ii) OUT distribution: that has new entities, which have not been included in the training set. Table 2 illustrates the results on TW-Cooking. The purpose of TW-Cooking is only to verify our approach and for that, we have used EXPLORER-w/o-GEN as the neuro-symbolic setting.
For each result (shown in Tables 1 and 2), the #steps and n. score should be seen together to decide which agent is doing better than the others. On one hand, if we focus more on the #steps, the agents can be given very low max steps to complete a game, where the agents will perform well in terms of #steps (lower is better), however the n. score will be very low (higher is better) as most of the games are not completed. On the other hand, if we focus more on n. score, the agent can be given very high max steps to complete a game, where the agent will score very high n. score. As both the cases are wrong interpretations of results, we should consider taking into account both #step and n. score to judge an agent’s performance.
Qualitative Studies: In our verification dataset – TW-cooking domain games, we found that the EXPLORER does really well and beats the Text-Only and GATA agents in terms of #step and normalized scores on all the levels. In level-1 which is focused on only collecting the ingredients, EXPLORER does slightly better than the Text-Only agents as the neural models are already good in easier games (with less sequence of actions) so the scope of improvement is less for the EXPLORER. However, as the level increases in level-2 and 3, where it requires collecting ingredients, processing them and cooking to prepare the meal, and finally eating them to complete the game, the importance of neural-based exploration and symbolic-based exploitation comes into play. In Level-4, which includes navigation, the neural module helps the EXPLORER to navigate to the kitchen (i.e., exploration) from one part of the house, and then the symbolic module applies its rules to choose the best action (i.e., exploitation).
In TWC, EXPLORER with IG based generalization (hypernym Level 3) performs better than the others in the easy and medium level games, whereas exhaustive generalization works well in the hard games. This shows that on one side, the exhaustive generalization works slightly better in an environment where the entities and rooms are more, and that needs more exploration. On another side, IG-based generalization works efficiently when the agent’s main task is to select appropriate locations of different objects. In the easy and medium games, the EXPLORER-wo-GEN performs poorly in comparison with the baseline model. This indicates – only learning rules without generalization for simple environments leads to bad action selection especially when the entities are unseen. The out distribution results for the medium games are not up to the mark. Further studies on this show that this happens when the OOD games have different but similar locations (clothes-line vs. clothes-drier) along with different objects in the environment. Generalization on the location gives very noisy results (increases false-positive cases) as they already belong to a higher level in the WordNet ontology. One of the solutions to this problem is to use a better neural model for exploration which helps to learn better rules (shown in the Comparative Study sub-section in Section – 5.3). Another solution for this issue requires a different way of incorporating commonsense to the agent and we have addressed more on this in the future work section.
In the process of learning rules, we found that the symbolic agent is having difficulties choosing between multiple recommended (by the ASP solver)
symbolic actions. So, it becomes an utmost importance to have a confidence score for each rule and we have generated that by calculating the accuracy. The accuracy of a rule can be calculated by the number of times the rule gives a positive reward divided by the total number of times the rule has been used. For generalized rules we have added another component for confidence calculation – that is how close the words are in WordNet, that is the score is reversely proportional to the distance between two nodes (entity and the hypernym). So for the non-generalized rules, the max score is 1, whereas for the generalized rules, the max score is 2 (due to the additional component). Figure 6 shows a snippet of a learned set of generalized rules with different confidence scores.
Example: Figure 7 illustrates an example showing how the EXPLORER plays the TWC games. On the right-hand side of the diagram, a snippet of the symbolic rules is given that the agent learns using ILP and ILP + WordNet-based rule generalization for TWC. To generate action using the symbolic module, the agent first extracts the information about the entities from the observation and the inventory information. Then, this information is represented as ASP facts along with the hypernyms of the objects. Next, it runs the ASP solver – the s(CASP) engine to get a set of possible actions and select an action based on the confidence scores. The top-left section of the figure 7 shows how a symbolic action has been selected by matching the object (i.e., ‘clean checked shirt’) with the rule set (highlighted on the right). Here, the solver finds the location of the ‘shirt’ in ‘wardrobe’ as ‘clothing’ and ‘wearable’ are the hypernyms of the word – ‘shirt’. In EXPLORER, when the symbolic agent fails to provide a good action, the neural agent serves as a fallback. In other words, EXPLORER accords priority to the symbolic actions based on the confidence scores. EXPLORER fallbacks to the neural actions when either of the one following situations happens: (i) no policies for the given state because of early exploration stage or non-rewarded actions, or (ii) non-admissible symbolic action has been generated due to rule generalization. The bottom-left of the figure 7 shows that the location of ‘sugar’ is not covered by any rules, so the neural agent selects an action. EXPLORER learns rules in an online manner after each episode, so after the current episode, EXPLORER will add the rule for sugar and in the next episode it will become a symbolic action. In this way, both the neural and symbolic modules work in tandem, where the neural module facilitates improved exploration and the symbolic module helps to do better exploitation in EXPLORER.
Comparative Studies: One of the key contributions of EXPLORER is that it is scalable and we can use any neural model as its base. To demonstrate that we have taken BiKE (Murugesan et al., 2021b) a neural model designed for textual reinforcement learning. Then, we train EXPLORER (with Rule Generalization (hypernym level 3)) using BiKE as its neural module (instead of LSTM-A2C) and build EXPLORER w. BiKE agent. In this comparison study, we have also taken the neuro-symbolic SOTA baseline on TWC – Case-Based Reasoning (CBR) (Atzeni et al., 2022) model and trained it with BiKE as its neural module as well and crafted CBR w. BiKE agent. Now we tested these 3 models over TWC games including easy, medium, and hard levels. The performance evaluations are showcased with bar-plots in figure 8. It clearly shows EXPLORER w. BiKE is doing much better at all levels in terms of #Steps (lower is better) and normalized scores (higher is better).
Also, EXPLORER w. BiKE is outperforming others with a large margin in out-distribution cases. This clearly depicts the importance of the policy generalization, which is helping the EXPLORER w. BiKE agent to use commonsense knowledge to reason over unknown entities. In the easy-level games, the performance differences are not that huge, as the environment deals with only one to three objects in a single room, which becomes much easier for the neural agent. However, as the level increases, we can start clearly seeing the importance of the EXPLORER agent.
Text-based Reinforcement Learning: TBGs have recently emerged as promising environments for studying grounded language understanding and have drawn significant research interest. Zahavy et al. (2018) introduced the Action-Elimination Deep Q-Network (AE-DQN), which learns to predict invalid actions in the text-adventure game Zork. Côté et al. (2018) designed TextWorld, a sandbox learning environment for training and evaluating RL agents on text-based games. Building on this, Murugesan et al. (2021a) introduced TWC, a set of games requiring agents with commonsense knowledge. The LeDeepChef system (Adolphs and Hofmann, 2019) achieved good results on the First TextWorld Problems (Trischler et al., 2019) by supervising the model with entities from FreeBase, allowing the agent to generalize to unseen objects. A recent line of work learns symbolic (typically graph-structured) representations of the agent’s belief. Notably, Ammanabrolu and Riedl (2019) proposed KG-DQN and Adhikari et al. (2020b) proposed GATA. The following instruction for TBGs paper (Tuli et al., 2022), which was also focused on the TW-Cooking domain, assumes a lot about the game environment and provides many manual instructions to the agent. In our work, EXPLORER automatically learns the rules in an online manner.
Symbolic Rule Learning Approaches: Learning symbolic rules using inductive logic programming has a long history of research. After the success of ASP, many works have emerged that are capable of learning non-monotonic logic programs, such as FOLD (Shakerin et al., 2017), ILASP (Law et al., 2014), XHAIL (Ray, 2009), ASPAL (Corapi et al., 2011), etc. However, there are not many efforts that have been taken to lift the rules to their generalized version and then learn exceptions. Also, they do not perform well on noisy data. To tackle this issue, there are efforts to combine ILP with differentiable programming (Evans and Grefenstette, 2018; Rocktäschel and Riedel, 2017). However, it requires lots of data to be trained on. In our work, we use a simple information gain based inductive learning approach, as the EXPLORER learns the rules after each episode with a very small amount of examples (sometimes with zero negative examples).
7 Future Work and Conclusion
In this paper, we propose a neuro-symbolic agent EXPLORER that demonstrates how symbolic and neural modules can collaborate in a text-based RL environment. Also, we present a novel information gain-based rule generalization algorithm. Our approach not only achieves promising results in the TW-Cooking and TWC games but also generates interpretable and transferable policies. Our current research has shown that excessive reliance on the symbolic module and heavy generalization may not always be beneficial, so our next objective is to develop an optimal strategy for switching between the neural and symbolic modules to enhance performance.
Limitations
One limitation of EXPLORER model is its computation time, which is longer than that of a neural agent. EXPLORER takes more time because it uses an ASP solver and symbolic rules, which involve multiple file processing tasks. However, the neuro-symbolic agent converges faster during training, which reduces the total number of steps needed, thereby decreasing the computation time difference between the neural and neuro-symbolic agents.
Ethics Statement
In this paper, we propose a neuro-symbolic approach for text-based games that generates interpretable symbolic policies, allowing for transparent analysis of the model’s outputs. Unlike deep neural models, which can exhibit language biases and generate harmful content such as hate speech or racial biases, neuro-symbolic approaches like ours are more effective at identifying and mitigating unethical outputs. The outputs of our model are limited to a list of permissible actions based on a peer-reviewed and publicly available dataset, and we use WordNet, a widely recognized and officially maintained knowledge base for NLP, as our external knowledge source. As a result, the ethical risks associated with our approach are low.
References
Ashutosh Adhikari, Xingdi Yuan, Marc-Alexandre Côté, Mikuláš Zelinka, Marc-Antoine Rondeau, Romain Laroche, Pascal Poupart, Jian Tang, Adam Trischler, and Will Hamilton. 2020a. Learning dynamic belief graphs to generalize on text-based games. Advances in Neural Information Processing Systems, 33:3045– 3057.
Ashutosh Adhikari, Xingdi Yuan, Marc-Alexandre Côté, Mikulás Zelinka, Marc-Antoine Rondeau, Romain Laroche, Pascal Poupart, Jian Tang, Adam Trischler, and William L Hamilton. 2020b. Learning dynamic knowledge graphs to generalize on text-based games.
Leonard Adolphs and Thomas Hofmann. 2019. Ledeepchef: Deep reinforcement learning agent for families of text-based games. ArXiv, abs/1909.01646.
Prithviraj Ammanabrolu and Mark Riedl. 2019. Playing text-adventure games with graph-based deep reinforcement learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3557–3565.
Joaquín Arias, Manuel Carro, Zhuo Chen, and Gopal Gupta. 2020. Justifications for goal-directed constraint answer set programming. arXiv preprint arXiv:2009.10238.
Joaquin Arias, Manuel Carro, Elmer Salazar, Kyle Marple, and Gopal Gupta. 2018. Constraint answer set programming without grounding. TPLP, 18(3- 4):337–354.
Mattia Atzeni, Shehzaad Zuzar Dhuliawala, Keerthiram Murugesan, and Mrinmaya Sachan. 2022. Casebased reasoning for better generalization in textual reinforcement learning. In International Conference on Learning Representations.
Kinjal Basu, Keerthiram Murugesan, Mattia Atzeni, Pavan Kapanipathi, Kartik Talamadupula, Tim Klinger, Murray Campbell, Mrinmaya Sachan, and Gopal Gupta. 2022a. A hybrid neuro-symbolic approach for text-based games using inductive logic programming. In Combining Learning and Reasoning: Programming Languages, Formalisms, and Representations.
Kinjal Basu, Elmer Salazar, Huaduo Wang, Joaquín Arias, Parth Padalkar, and Gopal Gupta. 2022b. Symbolic reinforcement learning framework with incremental learning of rule-based policy. Proceedings of ICLP GDE, 22.
Kinjal Basu, Farhad Shakerin, and Gopal Gupta. 2020. Aqua: Asp-based visual question answering. In International Symposium on Practical Aspects of Declarative Languages, pages 57–72. Springer.
Kinjal Basu, Sarat Chandra Varanasi, Farhad Shakerin, Joaquin Arias, and Gopal Gupta. 2021. Knowledgedriven natural language understanding of english text and its applications. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12554–12563.
Subhajit Chaudhury, Prithviraj Sen, Masaki Ono, Daiki Kimura, Michiaki Tatsubori, and Asim Munawar. 2021. Neuro-symbolic approaches for text-based policy learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3073–3078.
Subhajit Chaudhury, Sarathkrishna Swaminathan, Daiki Kimura, Prithviraj Sen, Keerthiram Murugesan, Rosario Uceda-Sosa, Michiaki Tatsubori, Achille Fokoue, Pavan Kapanipathi, Asim Munawar, et al. 2023. Learning symbolic rules over abstract meaning representations for textual reinforcement learning. arXiv preprint arXiv:2307.02689.
Domenico Corapi, Alessandra Russo, and Emil Lupu. 2011. Inductive logic programming in answer set programming. In International conference on inductive logic programming, pages 91–97. Springer.
Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. 2018. Textworld: A learning environment for text-based games. CoRR, abs/1806.11532.
Richard Evans and Edward Grefenstette. 2018. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research, 61:1–64.
Michael Gelfond and Yulia Kahl. 2014. Knowledge representation, reasoning, and the design of intelligent agents: The answer-set programming approach. Cambridge University Press.
Michael Gelfond and Vladimir Lifschitz. 1988. The stable model semantics for logic programming. In ICLP/SLP, volume 88, pages 1070–1080.
Gopal Gupta, Yankai Zeng, Abhiraman Rajasekaran, Parth Padalkar, Keegan Kimbrell, Kinjal Basu, Farahad Shakerin, Elmer Salazar, and Joaquín Arias. 2023. Building intelligent systems by combining machine learning and automated commonsense reasoning. In Proceedings of the AAAI Symposium Series, volume 2, pages 272–276.
Matthew Hausknecht, Ricky Loynd, Greg Yang, Adith Swaminathan, and Jason D Williams. 2019. Nail: A general interactive fiction agent. arXiv preprint arXiv:1902.04259.
Daiki Kimura, Masaki Ono, Subhajit Chaudhury, Ryosuke Kohita, Akifumi Wachi, Don Joven Agravante, Michiaki Tatsubori, Asim Munawar, and Alexander Gray. 2021. Neuro-symbolic reinforcement learning with first-order logic. arXiv preprint arXiv:2110.10963.
Suraj Kothawade, Vinaya Khandelwal, Kinjal Basu, Huaduo Wang, and Gopal Gupta. 2021. Autodiscern: autonomous driving using common sense reasoning. arXiv preprint arXiv:2110.13606.
Mark Law, Alessandra Russo, and Krysia Broda. 2014. Inductive learning of answer set programs. In European Workshop on Logics in Artificial Intelligence, pages 311–325. Springer.
Vladimir Lifschitz. 2019. Answer set programming. Springer Heidelberg.
Daoming Lyu, Fangkai Yang, Bo Liu, and Steven Gustafson. 2019. Sdrl: interpretable and dataefficient deep reinforcement learning leveraging symbolic planning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2970–2977.
George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
Tom Mitchell. 1997. Machine learning. McGraw Hill series in computer science. McGraw-Hill.
Arindam Mitra and Chitta Baral. 2015. Learning to automatically solve logic grid puzzles. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1023–1033.
Stephen Muggleton and Luc De Raedt. 1994. Inductive logic programming: Theory and methods. The Journal of Logic Programming, 19:629–679.
Keerthiram Murugesan, Mattia Atzeni, Pavan Kapanipathi, Pushkar Shukla, Sadhana Kumaravel, Gerald Tesauro, Kartik Talamadupula, Mrinmaya Sachan, and Murray Campbell. 2021a. Text-based rl agents with commonsense knowledge: New challenges, environments and baselines. In Thirty Fifth AAAI Conference on Artificial Intelligence.
Keerthiram Murugesan, Mattia Atzeni, Pavan Kapanipathi, Kartik Talamadupula, Mrinmaya Sachan, and Murray Campbell. 2021b. Efficient text-based reinforcement learning by jointly leveraging state and commonsense graph representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, volume 2, pages 719–725. Association for Computational Linguistics.
Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. 2015. Language understanding for text-based games using deep reinforcement learning. arXiv preprint arXiv:1506.08941.
Dhruva Pendharkar, Kinjal Basu, Farhad Shakerin, and Gopal Gupta. 2022. An asp-based approach to answering natural language questions for texts. Theory and Practice of Logic Programming, 22(3):419–443.
Oliver Ray. 2009. Nonmonotonic abductive inductive learning. Journal of Applied Logic, 7(3):329–340.
Raymond Reiter. 1988. Nonmonotonic reasoning. In Exploring artificial intelligence, pages 439–481. Elsevier.
Tim Rocktäschel and Sebastian Riedel. 2017. Endto-end differentiable proving. arXiv preprint arXiv:1705.11040.
Farhad Shakerin, Elmer Salazar, and Gopal Gupta. 2017. A new algorithm to automate inductive learning of default theories. Theory and Practice of Logic Programming, 17(5-6):1010–1026.
Mohan Sridharan, Ben Meadows, and Rocio Gomez. 2017. What can i not do? towards an architecture for reasoning about and learning affordances. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 27, pages 461–469.
Adam Trischler, Marc-Alexandre Côté, and Pedro Lima. 2019. First TextWorld Problems, the competition: Using text-based games to advance capabilities of AI agents.
Mathieu Tuli, Andrew C Li, Pashootan Vaezipoor, Toryn Q Klassen, Scott Sanner, and Sheila A McIlraith. 2022. Learning to follow instructions in text-based games. arXiv preprint arXiv:2211.04591.
Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. 2022. Behavior cloned transformers are neurosymbolic reasoners. arXiv preprint arXiv:2210.07382.
Fangkai Yang, Daoming Lyu, Bo Liu, and Steven Gustafson. 2018. Peorl: Integrating symbolic planning and hierarchical reinforcement learning for robust decision-making. arXiv preprint arXiv:1804.07779.
Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J Mankowitz, and Shie Mannor. 2018. Learn what not to learn: Action elimination with deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 3562–3573.
Yankai Zeng, Abhiramon Rajasekharan, Parth Padalkar, Kinjal Basu, Joaquín Arias, and Gopal Gupta. 2024. Automated interactive domain-specific conversational agents that understand human dialogs. In International Symposium on Practical Aspects of Declarative Languages, pages 204–222. Springer.