close
close

Solondais

Where news breaks first, every time

sinolod

What the IEEE Spatial Web Standard Means for Embodied AI

(Image by Gerd Altmann from Pixabay)

Today, the tools and processes for creating embodied AI (EAI) systems that learn by interacting with their environment are comparable to the state of natural language processing (NLP) in 2017. This is the year where Google researchers introduced the attention mechanism in transformers as an attention mechanism. a sort of glue between different aspects of AI development. This paved the way for large language models (LLMs) and applications like ChatGPT, which sparked the recent boom in generative AI.

However, these existing tools are trained on what people say about the world rather than how AI systems can learn by interacting with the world. The problem is that there are many ways to describe the world depending on the use case, including 4D (space and time) models, physics, chemistry, different ways of perceiving it, and different representations of the world in business systems.

The new IEEE P2874 draft standard for Spatial Web Protocol, Architecture, and Governance introduces Hyperspace Transaction Protocol (HSTP) and Hyperspace Modeling Language (HSML) to provide some order and interoperability between different representations. At a high level, this is akin to the vector embeddings that underlie LLMs for representing words, entities, and concepts, but for describing real-world interactions. Dan Mapes, founder and president of Verses, who worked on the standard, explains:

This is very parallel to the World Wide Web which has a protocol layer, HTTP and HTML, and is not owned by anyone. It’s just an open protocol. You can read the HTML yourself. And I think we now have 3.5 billion websites. There’s no way AOL could have built this on their own. This is the power of decentralization. You let people from all over the world build websites in Kenya and India and everywhere, but they all run on one protocol, so they can all be linked together, and now we’re moving from 2D web pages to 3D web spaces.

The Challenges of Embodied AI

Before delving deeper, it seems important to clarify the distinction between embodied AI systems that learn from their interactions and NLP that learns from the words we write. Researchers have been exploring these two disciplines since the beginnings of AI. And much of the innovation has been aimed at improving distinct components of work systems. For example, with NLP, this included components on symbolic reasoning, encoding vector representations and their manual implementation in chatbots. Transformers provided a unified framework that helped industrialize LLM development on extremely large datasets across all of these processes.

EAI is generally seen as a way to build better robots and self-driving cars. But on a broader level, it’s about finding better ways to train AI systems to perceive, act, remember and learn. For example, researchers have suggested that smart routers and recommendation engines on social networks, which learn from their interactions, constitute another type of EAI. The same could apply to supply chain recommendations, and computer automation systems that learn from human feedback and sensors are also embodied. Mapes says:

The new embodied AI model can see the world. He observes it through satellites. He watches it through street cameras. It collects temperature data from all temperatures and reads data from buoys in the ocean to understand what is happening in ocean currents. We have a complete holographic model of the world. It updates in real time. We just never had a way to integrate it into a world model that an AI could operate on.

The most common approach to creating them today uses deep reinforcement learning (DRL) algorithms that require human experts to develop policies specifying specific goals. Verses and other researchers are developing a new approach based on active inference that allows agents to reduce their “free energy,” which is a sophisticated way of describing how to learn to represent the world or act more effectively. Others are exploring intrinsic motivation algorithms that can teach agents curiosity. Underlying all of these approaches is the need to better represent how different agents perceive, act and model the world in a more coherent manner.

A crowded field

It is important to understand that HSML and HSTP enter a crowded field of standards, specifications, and protocols to represent the world. The Universal Scene Description (USD) format can improve the interoperability of various 3D representations. WebGL simplifies the distribution of 3D content, like GIF and JPEG for images. The Digital Twin Consortium develops open source libraries to improve the interoperability of digital twins across various industries. The 3D Tiles open standard improves the interoperability of 3D representations in geospatial applications.

So why do we need a new standard, protocol or specification? The problem is that existing approaches were designed to streamline data exchange for existing known use cases. However, they may struggle to unify different representations across different types of agents or workflows, which could lead to discovering better representations.

Take something as simple as a 3D model for a new plane or factory. USD can be translated into various tools to model its mechanical and electrical performance, supply requirements, manufacturing operations, financial aspects, customer experience and security. However, this does not allow an EAI system to easily develop a coherent representation of the who, what, why, when, and how related to its actions over time.

This is where HSML and HSTP come in. They create a basis for the EAI equivalent of the vector embeddings that helped evolve LLMs, but for the loop of perception, action, remembering and learning for EAI agents.

HSML provides a way to describe the world. HSTP provides a framework for creating semantically compatible connections between connected objects and software systems. It specifies a spatial range query format; semantic data ontology to describe objects, relationships and actions; an accreditation system; and a human-readable language that enables the understanding, expression and automated execution of legal, financial and physical activities.

Mapes explains:

The Hyperspace Transaction Protocol and Hyperspace Modeling Language are quite similar to how we build computer games. It is a modeling language. Now anyone can build a 3D world, including a metaverse, a digital copy of a hospital, or a smart city. And now there is a protocol to connect them all together. So instead of a network of pages, we can now have a network of spaces. And just like I can go from Facebook to Amazon to Google to whatever, I can now go from Cedar Sinai Hospital to Gucci to Abu Dhabi. They are all linked together and all operate on a single global protocol.

Mapes says that today, AI does not see the world. He looks at his training data. It’s all about content, but that’s not how humans do it:

So instead of looking at a massive brain with billions of parameters to answer your question, I just look at the entire world. Where is the coffee? Oh, it’s over here. I look with my eyes. I look at the world and I understand. I can look around me. I see the whole room I’m sitting in here and I have a model of the world. When I pick up the coffee and move it, I update the model of the world. Now I know coffee is off the table. It’s over in the kitchen. I just updated everything in real time. So I update all these vectors immediately.

The long-term vision is to facilitate the connection of EAIs who talk to each other, learn and grow to create a new type of collective intelligence that could be greater than any single LLM. Mapes envisages:

We’re now going to have a global space network, all done locally, but they all communicate with each other. Just like there is no single intelligent person who knows everything, no, it’s a conversation between thousands of scientists who all write papers, give lectures, specialize in different things. You want people to build things from the ground up, and from there the collective wisdom of the world is born. So we need to build things on a protocol that makes it easy for anyone, anywhere in the world to build AI applications.

My opinion

Mapes paints a compelling vision for how industrial-scale AI could evolve beyond processing language to processing experiences. It’s reasonable to imagine that people will find ways to develop global models that incorporate different types of experiences for various use cases. Active inference approaches based on scientific research on human and animal brains suggest a promising path forward.

Some of Mapes’ earlier predictions were off by a few years. When I met him in 1992, he predicted that virtual reality would become widespread in 1994 with the launch of the Intel 486 computers. It took a few more years for the development of GPUs to allow 3D games to become widespread. Industrial digital twins took another twenty years but are now starting to prove their usefulness.

However, Verses is making significant progress in this new approach. Using this new approach, they partnered with NASA JPL to improve spatial data representations and with the city of Abu Dhabi in implementing a smart city.

Prominent neurologist Karl Friston, professor of imaging neuroscience at University College London, has also signed on as chief scientist at Verses. He conducted pioneering research in active inference and led work on renormalized generative models that achieved 99.8% accuracy on some tasks while using only 10% of the training data compared to LLMs .

It’s also important to understand that major breakthroughs like this will take time. Five years passed between the development of Transformers in 2017 and ChatGPT in 2022, sparking the fastest app adoption in history. Things could move faster with active inference or other EAI approaches that can take advantage of the rapid growth of AI infrastructure. If disillusionment with LLMs creates overcapacity, these alternatives could grow more quickly, particularly if they are a little less mind-blowing.