Microsoft Research has launched a new foundational model of AI that combines the language and images process to control software interfaces and robotic systems. That is to say, Microsoft Magmathat is the name of this model, is capable of controlling robots.
At least that is what they say from the company’s laboratories, which have tried the model internally, and that they emphasize that it could mean an advance in the field of multimodal IAS that can work interactively in real and digital spaces.
Redmond’s point out that Microsoft Magma is the first model of AI that can not only process multimodal data (text, images and video), but can also act natively about them. It can do it both facilitating the manipulation of physical objects and moving through a user interface.
The Magma Development Project, of course, is not exclusive to Microsoft, but is a collaboration between a group of its researchers, Maryland University, Wisconsin-Madison, Washington, and Kaist (Institute of Science and Advanced Korea technology).
Unlike multimodal AI systems for robots, such as PALM-E, RT-2 or ChatgPT for robotics, which use large models of language as an interface, and need independent models for perception and control, magma integrates all their capacities in a only foundational model.
As for its end, Microsoft catalogs it as an advance towards the AI of agents, that is, towards a system that can perform plans autonomously, as well as perform tasks composed of several steps, on behalf of a human, instead of Just answer questions depending on what the robot sees or perceives.
Thus, according to researchers who have developed and proven it, when Microsoft Magma is described a goal to achieve, it is able to formulate plans and execute actions to achieve it. «By effectively transferring the knowledge of data and language freely available, Magma covers temporal, spatial and verbal intelligences, and uses them to address complex tasks and configurations«.
Magma is based on LLM technology of the type that feeds from tokens that are passed to a neuronal network, and differs from conventional language and vision models overcoming what is known as verbal intelligence to also include the known as intelligence spatial, which allows both planning and executing actions. This together with your training mixing images, videos, robotics data and user interface interactions; They make it a real multimodal agent.
The functions that allow Microsoft Magma to control robots
This model has, on the other hand, with two distinctive technical components. The first one is Set-of-Markwhich identifies objects that can be manipulated generating numerical labels to interactive elements, such as buttons in which you can click on a user interface, or objects that can be taken and grabbed in a work space with robots. The second is Trace-of-Markwhich is capable of learning movement patterns from data in videos.
According to Microsoft, these two functions allow the model to perform tasks such as moving through user interfaces, or directing robotics to grab objects. As for its variants, the company ensures that the results that Magma-8b has obtained in several test banks in terms of the types of tasks mentioned, are quite good, surpassing even those obtained by models such as OpenVla in terms of vision-length -Acción, in various robotic manipulation tasks.
Of course, as with all the models of AI, Magma is not perfect, and it still has technical limitations in the complex decision making that must be carried out step by step and that need to take several steps over time. Microsoft says that it continues to work to improve these functions, and that you are investigating to get it.
Meanwhile, the Microsoft Magma inference and training code is now available in Githubwhich will allow external researchers to work based on it. If Magma meets the promises they have made of him from Microsoft, she could lead to the company’s assistants beyond limited text interactions, and allow them to operate software autonomously, in addition to the execution of real world’s tasks through robotics.