OpenAI's new model GPT-4o allows users to interact using voice, video or text in a single omnipotent system

GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction

May 14, 2024

OpenAI officially launched its latest AI platform, GPT-4o, aiming to radically enhance how people communicate with artificially intelligent systems. In a demonstration event, leadership provided a glimpse of GPT-4o's uniquely integrated design allowing real-time, multimodal conversations.

"We're entering a new era of collaboration between humans and machines," said CTO Mira Murati during the unveiling. GPT-4o merges capabilities like voice, video and text that previously required separate models into a single "omnimodel". This integrated approach was showcased through seamless transitions between mediums, with the model maintaining fluent discussions no matter the input method.

Two researchers tested GPT-4o's abilities via a video call. It effortlessly assisted with an algebra problem by observing the live video feed and guiding the user step-by-step without providing direct answers, much like a tutor. Impressively, it adapted its tone immediately when asked to shift to a dramatic or robotic voice for a brief excerpt, demonstrating impressive linguistic range.

For text interactions on websites and applications, the model conveyed answers at a rapid pace while still thinking through nuanced replies. It was also able to fluidly switch to discussion via a phone call without missing a beat in the dialogue. According to OpenAI, merging before separate functions into a unified structure enables faster response times and smoother task transitions.

Where ChatGPT was previously limited to text-based discussions, it has now advanced its abilities to understand images. For instance, one can take a photo of a menu in a foreign language and ChatGPT will provide an on-the-spot translation. It will also provide background on the cultural significance of dishes and advice on what to order.

Looking ahead, OpenAI plans to launch new "Voice Mode" in beta soon. This will allow for more natural back-and-forth discussions with ChatGPT via voice. Users may share a live sports broadcast, for example, and get clear rules explanations from the AI assistant. The company aims to deliver this new conversational experience through both web and mobile apps.

Accessibility is another focus. ChatGPT's language skills have been improved for speed and quality. The agent can now assist users in over 50 languages throughout the usage process from account creation to settings.

While live demonstrations come with inherent issues, GPT-4o recovered gracefully from minor inconsistencies. The event highlighted how far AI has progressed towards human-like interactions across multiple mediums.

In a noteworthy shift, OpenAI will now offer the full suite of GPT-4o's voice, video and text faculties to all users absolutely free through official channels. This move positions the company as a leader dismantling barriers around conversational AI access. Only higher subscription tiers receive expanded model capabilities and capacities.

When using GPT-4o, ChatGPT Free users will now have access to features such as:

Experience GPT-4 level intelligence
Get responses(opens in a new window) from both the model and the web
Analyze data(opens in a new window) and create charts
Chat about photos you take
Upload files(opens in a new window) for assistance summarizing, writing or analyzing
Discover and use GPTs and the GPT Store
Build a more helpful experience with Memory

OpenAI's new model GPT-4o allows users to interact using voice, video or text in a single omnipotent system

GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction

Discussion about this post