Magma is based on LLM technology of the type that feeds from tokens that are passed to a neuronal network, and differs from conventional language and vision models overcoming what is known as verbal intelligence to also include the known as intelligence spatial, which allows both planning and executing actions. This together with your training mixing images, videos, robotics data and user interface interactions; They make it a real multimodal agent.

The functions that allow Microsoft Magma to control robots

This model has, on the other hand, with two distinctive technical components. The first one is Set-of-Markwhich identifies objects that can be manipulated generating numerical labels to interactive elements, such as buttons in which you can click on a user interface, or objects that can be taken and grabbed in a work space with robots. The second is Trace-of-Markwhich is capable of learning movement patterns from data in videos.

According to Microsoft, these two functions allow the model to perform tasks such as moving through user interfaces, or directing robotics to grab objects. As for its variants, the company ensures that the results that Magma-8b has obtained in several test banks in terms of the types of tasks mentioned, are quite good, surpassing even those obtained by models such as OpenVla in terms of vision-length -Acción, in various robotic manipulation tasks.

Of course, as with all the models of AI, Magma is not perfect, and it still has technical limitations in the complex decision making that must be carried out step by step and that need to take several steps over time. Microsoft says that it continues to work to improve these functions, and that you are investigating to get it.

Meanwhile, the Microsoft Magma inference and training code is now available in Githubwhich will allow external researchers to work based on it. If Magma meets the promises they have made of him from Microsoft, she could lead to the company’s assistants beyond limited text interactions, and allow them to operate software autonomously, in addition to the execution of real world’s tasks through robotics.