> VilmaTech Blog > OpenAI Introduces GPT-4o

OpenAI Introduces GPT-4o

OpenAI live-streamed the Spring Update online, which was held the day before Google’s annual I/O conference, whetting the appetites of media and internet population. So, what has been updated in GPT-4o, and has OpenAI retained its absolute power in the field of large models? Why did netizens exclaim, “You’re still a master” after watching the demo? I have spent a day intensively reading all kinds of media, bloggers, organizations, as well as promotional videos, interpretations, analysis and testing experiences from OpenAI officials, and this article is here to clarify the OpenAI Spring Update, which is called the “Little Spring Festival Gala in the Science and Technology Sector”.

On April 30th, a big model called gpt2-chatbot quietly appeared on the big model arena LMSYS. Although the benchmark scores were not made public, according to real-world tests by netizens, its performance outperforms all the bigram models currently on the market. Through cue-word guidance and token-based classifier research, it was found that gpt2-chatbot is most likely from OpenAI, and should be an improved version of GPT-4. It has advantages in logic, code, and math that are unmatched by any other big models today.

An analysis of the mysterious model that has been widely circulated online states, “It is likely that the mysterious model is actually GPT-4.5, released as an instance of an ‘incremental’ model update”. The model’s structured responses appear to be strongly influenced by techniques such as modified CoT (Chain of Thought). The overall quality of the output – especially its formatting, structure and comprehension – is absolutely superb. Multiple people experienced in LLM prompting and chatbots (in both public and private settings) have noted that the quality of the output is surprisingly good.

As various media outlets picked up on the story, LMSYS quietly removed the model and updated its usage policy to make it clear that commercial companies could make new models available for public testing on the LMSYS platform on an “anonymous release” basis, that LMSYS would provide feedback and partial samples of the model to the modeler, and that the modeler had the right to withdraw the model at any time. Shortly thereafter, LMSYS relaunched two slightly different variants of the mysterious model under the model names im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot.

Meanwhile, OpenAI CEO Sam Altman confirmed to all and sundry in a Harvard speech that gpt2-chatbot is not OpenAI’s “next generation big model” (i.e., the rumored GPT-5). In May, OpenAI recently updated its official website, and users found more than 50 new subdomains for OpenAI’s domain name, most notably Media reports indicate that OpenAI has been working on a web search product, which it analyzes as intensifying its competition with Google. The search service will be partially powered by Bing. OpenAI may soon launch a new search engine, according to the news agency, which cited a source, Jimmy Apples, who said the company plans to hold an event this month, tentatively scheduled for May 9, 2024, at 10 a.m. The source also said that OpenAI has been working on a new search engine, which will be partially powered by Bing. Sources also said OpenAI has been hiring an events team since early January to organize internal events, and on May 10, Reuters reported that OpenAI may schedule a search product launch the day before Google’s annual I/O conference.

On May 11, OpenAI announced that it would present the latest ChatGPT and GPT-4 related updates live on May 13 at 10 a.m. U.S. time on its official web site. Meanwhile, OpenAI CEO Sam Altman has refuted a Reuters report that OpenAI will launch a search product next Monday. In a post on X, Altman said that while OpenAI is scheduled to make an announcement on Monday morning, “it’s not GPT-5, it’s not a search engine,” but whatever it is, he said it “feels like magic. “The only detail provided in OpenAI’s official post is that the release will be an update to ChatGPT and its newest model, GPT-4. It was then revealed that the so-called “search product” was bait thrown by OpenAI to uncover internal leaks. The leaker, who often leaked to Jimmy Apples and Flowers, has been fired by OpenAI. Information is reporting that OpenAI is working on a full-fledged AI voice assistant, which it expects to show off next week. The new technology will be able to communicate with people through voice and text, recognizing different people’s intonations and tones, as well as objects and images. It wasn’t until after the launch that we learned from tweets from OpenAI officials that im-also-a-good-gpt2-chatbot had been officially recognized as GPT-4o. They claimed that “not only is it the best model in the world, but it’s available for free in ChatGPT, which is unprecedented among cutting-edge models.” The model is said to be partly a product of applying Q-learning and A* search (Q*). In addition, LMSYS confirms that all gpt2-chatbots are from OpenAI and top the internal leaderboards with very similar Arena ELOs, confidence intervals, coding results, win rates, etc.

The launch demonstrated the straightforwardness of OpenAI’s engineering staff by putting the main point of the event in the back slides, which centered on the GPT-4o model, which is “accessible to everyone”. GPT-4o is a new basic model launched by OpenAI after GPT-4, o stands for omni, which reflects the comprehensive upgrade of the model in multimodality and other aspects.

The biggest highlight of GPT-4o is that it supports multimodal input and output, and is able to accept and generate any combination of text, speech, and images, making human-computer interaction more natural and smooth. The voice response speed is dramatically improved and can give a response within 232 milliseconds, which is close to the human level. In terms of performance, GPT-4o is comparable to GPT-4 turbo in English text and programming, but has significant improvements in non-English text, vision, and speech understanding. At the same time, inference is faster and the API price is reduced by 50%.

In several benchmarks, GPT-4o has set a new best score for the previous model. According to the official data released by OpenAI, GPT-4o scored 88.7 on the MMLU (Measurement of Language Understanding), the highest score for an integrated model; 27.5% better than the previous best model (Claude3-Opus) on MATH mathematical reasoning; and 90.5 on the HumanEval programming test, also the highest.

At the launch event, GPT-4o’s multimodal capabilities were the focus of the demonstration. In the demonstration, the voice assistant and video call functions were especially amazing: the “next decade” belonging to voice assistants seems to be really coming.

According to OpenAI’s official description of the GPT-4o model, GPT-4o is a truly multimodal, end-to-end model that accepts text, visual (picture/video), and auditory (audio) inputs, and outputs any combination of the three. In other words, the voice assistant function that used to require Whisper (OpenAI’s speech-to-text model), GPT, and TTS to work one by one can now be solved by a single model, and even supports video input.

End-to-end multimodal models are not new. The Gemini model introduced by Google once provided us with an end-to-end multimodal example.

I don’t know if you still remember, when Gemini was released, the demo video, which could analyze and respond according to the input of the video, which made a lot of netizens feel excited at that time.

However, Gemini ultimately failed to make much of a splash, and its demo video was called into question for being full of flaws. Officials had to admit that there was some speeding up and splicing in the video, and that it even required human prompts for Gemini to make compliant judgments based on the video inputs.

OpenAI has clearly learned from its predecessors. On the page where GPT-4o was released, it was specifically labeled “All videos on this page are at 1x real time.”, which also shows OpenAI’s confidence in its modeling capabilities.

Although we can’t experience GPT-4o’s voice and video dialog function right now, from the official demo on the spot, GPT-4o’s multimodal effect is already amazing enough.

First of all, GPT-4o is an end-to-end multimodal model, which eliminates the step of voice-to-text conversion, and compared with the traditional text generation model, it can directly capture information that is difficult to be expressed in text in audio and video, such as human expression, tone of voice, ambient sound, and the identity of the speaker.

Whereas once upon a time with ChatGPT voice conversations, the software would use the Whisper model to send the audio to the model for recognition when the user paused, Whisper’s ability was simply to convert the audio into a subtitle-like form. Even when all of Whisper’s capabilities are called upon, it is only able to distinguish a general speaker and recognize sound effects such as singing and applause.

In terms of output, traditional TTS models used to output fixed speech, and the model (or program) itself did not have the ability to understand the textual content, and there was no way to analyze the emotion. Until the advent of SSML (Speech Synthesis Markup Language), people (or big models) can guide speech synthesis programs to generate voices with different ’emotions’ by adding markers for intonation and sentence breaks in the input text. But this is also essentially the result of pre-programming; no TTS model can understand the emotion of what it is reading aloud without markup cues. This explains why OpenAI’s TTS model was praised by netizens last year when it was released for its mimicry of human intonation, stuckness, and other subtle movements.

GPT-4o’s emotional capabilities, on the other hand, are perfectly displayed at both the input and output ends. In addition to capturing information in audio and video that is difficult to express in text, the voice output from GPT-4o is no longer just a combination of cold text and fixed emotional tone, but truly leaves every byte of output to the big model itself. For example, during a voice conversation, GPT-4o actually performs audio-to-audio output without the thought process of switching to text in between, so the big model is equipped with the emotional capability of being able to listen and speak.

What is even more amazing is that GPT-4o is a multimodal model that supports three input types. At the launch event, we saw that when GPT-4o ‘saw’ the text ‘I ❤️ ChatGPT’ written by a human on a piece of paper, it actually responded in a moving way. What is involved here is audio and video-to-audio multimodal emotional capabilities.

The launch of GPT-4o was accompanied by a lexicon update that significantly improves multilingual processing while dramatically reducing Token usage.

According to OpenAI, the new lexicon performs very well across multiple languages. For example, the number of Token for Gujarati has been reduced by a factor of 4.4, from 145 to 33; Telugu has been reduced by a factor of 3.5, from 159 to 45; and even for Chinese, which is a relatively complex language, the number of Token has been reduced by 40%, from 34 to 24.

According to the current analysis, the new word splitter, named “o200k_base”, contains more words, thus significantly compressing the number of Token for different languages.

The update of the lexicon is also the reason why GPT-4o generates faster. Even if the arithmetic power and model size remain unchanged, by reducing the number of Token (e.g., a Token includes more characters, such as Chinese idioms, sayings, etc.), the advocates can also perceive obvious generation speed tips. What’s more, in the current experience of calling the API, the number of Tokens generated by the GPT-4o model per second has also been significantly improved.

From the revelation of the mysterious model gpt2-chatbot at the end of April to the official release of GPT-4o in the middle of May, OpenAI’s spring update has undoubtedly stirred up the discussion of big models, or OpenAI itself.

However, according to the market’s general prediction, OpenAI’s “big move” this year is much more than that. Its “next-generation” model, GPT-5, has basically completed its training, and not long ago began to enter the red team security testing phase, and is expected to be officially released in the middle of this year at the earliest.

I guess this OpenAI Spring Update is a “half-generation upgrade before the big show”, a routine upgrade to regain the focus of public opinion and industry discourse. As for the next-generation big model “GPT-5” that people are waiting for, let’s wait and see.

Comments are closed.

Latest Posts