by Larry Magid
This has been a big week for generative AI (GAI), the technology that can answer questions or create content based on a user’s prompt.
Both OpenAI and Google announced major updates to their GAI products. OpenAI has released its new ChatGPT 4o (the o stands for “omni”) while Google has announced several updates to its AI products, including its Gemini chatbot that the company is integrating into the ubiquitous Google search engine.
ChatGPT 4o offers multiple modalities for text, vision and audio and is being upgraded to not only understand the human voice in multiple languages but engage in a conversation. You can even interrupt it while it’s speaking or displaying text with follow-up questions or comments.
In a demonstration video, OpenAI’s Mira Murati and colleagues showed how the new version not only speaks but understands and expresses emotions. The demo had a very natural sounding female voice that was expressive, emotive and even humorous at times. It would be easy to think you’re talking with a real person if you didn’t know it was a computer. It’s so lifelike that I worry that some people may forget it’s a machine and not a living person.
In addition to answering questions, ChatGPT is now able to analyze what it’s looking at. In the video, a person pointed his phone’s camera to a computer screen displaying computer code and got ChatGPT to describe the code and what it can do. In another example, the person asked ChatGPT to help him with math by giving him steps, not answers. Using its friendly voice, ChatGPT was amazingly patient and helpful, making me think it has a future as a teacher or tutor. Of course, this was a demonstration not a hands-on review, so as impressive as it was, the real proof will come when people have a chance to try it in the real world.
In addition to new capabilities, ChatGPT 4o is much faster and is available for free, though there will continue to be a paid version for those who want more features or are heavy users. The company is also releasing an application interface (API) that will enable independent developers to incorporate ChatGPT 4o into other products. Those apps and service, in the long run, may be more powerful and widely used than whatever OpenAI offers directly to the public.
Google puts AI into search and other products
At its Google I/O developers conference, Google announced enhancements to its Gemini AI chatbot as well as further integration of generative AI into Google search. Search, which has historically been a way for Google to direct people toward helpful websites, is morphing into a tool that will provide answers directly rather than just links.
In her presentation at Google I/O, Google’s VP and head of Search, Liz Reid described what the company calls AI Overviews as “Whatever’s on your mind, whatever you need to get done, just ask, and Google will do the googling for you.” The process will enable users to “ask your most complex questions, with all the nuances and caveats you have in mind, all in one go,” rather than having to break your question into multiple services, according to a Google blog post.
Not everyone is happy about this. Danielle Coffey, the chief executive of the News/Media Alliance, told CNN, that it “will be catastrophic to our traffic.” Some publishers fear that Google and other generative AI companies will deliver summary information, perhaps from the news organizations themselves, that readers might otherwise require users to visit news sites which are compensated by advertisers and/or subscription fees.
Google is also integrating its Gemini AI into its other products, including Google Photos, which can now, for example, be used to tell your license plate number if it has a picture of your plate. It can also help evaluate how your kids are doing with their swimming lessons, based on photos and videos of your child in pools, lakes and ocean waters.
In one demonstration, a Google executive used her phone to video a record player arm skipping across a record, then asked Google “why will this not stay in place” to get results on what might be wrong with the device. In another demo, a Google employee scanned a room with her phone and said “tell me when you see something that makes sound” and heard “I see a speaker, which makes sound.” She then asked it to describe the parts of the speaker to learn about tweeters and woofers.
Where are my glasses?
My favorite demonstration is when that same Google employee asked “where did I put my glasses” and got an answer from the phone, which had recorded her movement in the room and told her where to find her spectacles. Although I worry a little about the privacy implications of that powerful tool, I fully appreciate how useful it can be, especially for those of us who have a habit of misplacing things.
As I’ve said before, in my four decades as a technology columnist, I put generative AI high on the list of major developments, akin to the launch of broadband, the World Wide Web and the iPhone. But what’s even more amazing is how quickly GAI is evolving and improving.
This post first appeared in the Mercury News