In today’s column, I am going to examine and explain the brittleness that exists within modern-day generative AI.
To exemplify this matter, I will lean into a widespread incident of last week on February 20, 2024, in which the wildly popular generative AI app ChatGPT by AI maker OpenAI spewed gibberish at times in response to user requests and prompts. The matter was briefly covered in the media and then other top stories left the otherwise newsworthy topic somewhat behind in the dust. I’d like to make sure we learn and record some useful lessons from the circumstances involved.
Here’s the deal. When it comes to generative AI and what you generally see, a key takeaway is that we should be careful and not judge a book by its cover.
I say that with a hint of tongue-in-cheek. The usual meaning is that even if the cover is unappealing, the internal contents might be astonishing. I am using the famous catchphrase in the reverse, namely that a cover that seems overwhelmingly impressive might not necessarily have internal content that equally matches that impression.
Allow me a moment to unpack that tantalizing remark.
What’s On The Outside Versus What’s On The Inside
People who use generative AI are often amazed at the apparent fluency of the AI. They are floored at the responses they normally receive from the AI. You could easily be fooled into believing that the AI is sentient. Please be aware that despite those zany headlines claiming AI sentience is here or near, we do not have any sentient AI today. Period, end of story.
The underlying capability that produces the apparent fluency is based on mathematical and computational pattern-matching. It’s darned good. This has especially so far been the case when the pattern-matching structures and algorithms are implemented at scale. The larger the data sets used to data train generative AI, the better so far the results. The larger the data structures used and the greater the amount of computer processing consumed, the better so far the results. I have discussed that not everyone believes that this upward scaling is going to get us proportionally added benefits and that we are possibly going to hit a plateau, after which we will need to come up with new ideas, see my coverage at the link here.
In any case, generative AI has become the darling of AI and people are continually impressed with the fluency of the essays and interactive dialogue that can be had with the AI. There is a certain kind of beauty to it. Sometimes you ask a question merely to see what kind of eloquent answer you might get. The responses can create a feeling of awe. You might silently in your head be saying to yourself, look at what humankind has been able to attain in our lifetime.
But remember that I said don’t judge a book by its cover.
If you were to look underneath the shiny exterior of generative AI, you would tend to find a byzantine array of complex data structures and at times quite convoluted code and messy algorithms. This has for example tended to stymie efforts to logically understand and symbolically interpret what is happening within the AI, see my analysis at the link here.
As a seasoned software engineer and a research scholar in software engineering, I can attest to the fact that much of today’s generative AI is poorly software-engineered and rife with unsavory software qualities including brittleness, lack of sufficient checks and balances, inadequately tested, and a slew of other software weaknesses that would make you ill if you saw what was really going on.
I might liken this to a highway diner. The food that comes out to your table might look good, but heaven knows what type of cooking efforts are going on in the back kitchen. You have to hope that if there is a lack of cleanliness or other cooking guffaws happening, you won’t suffer from it.
In the case of generative AI, the analogy to a book cover is that by external appearances the generative AI seems splendid indeed. The devil in the details is the interior or the inner system constructs that make the generative AI do what it does. What is inside the handsome book can be pretty dicey.
Your reaction to this might be that nobody other than the AI coders needs to care about the inside of generative AI. As long as the generative AI is doing its dandy essay generation and interactions, the inside stuff can be made of strings and duct tape. Who cares? All you care about is results.
Well, to some degree that’s a valid point, except that when generative AI becomes an essential element in our society, and we become dependent upon the generative AI to be up and running properly, I would say that failings or faltering is going to raise quite a stink. We aren’t there yet in the sense that generative AI is not fully embedded into everything we do. That day is coming.
I’m giving you a heads-up right now, in preparation for what is likely going to be an expensive societal blowback once generative AI is truly ubiquitous. Imagine how embedded generative AI is going to be in another year or two. We are witnessing astounding growth in the mass adoption of a somewhat new form of technology in a relatively short period of time. And, as I claim, based to some extent on a software house of cards that can fall apart at the drop of a hat.
I’d like to riff briefly on this.
Dependency On Generative AI Is Increasingly A Looming Concern
Nowadays, the bulk of generative AI usage is seemingly of an optional nature or at least not a critically top priority for our essential existence.
Plenty of people are using generative AI for fun. Some use generative AI to help them in school. There are people who use generative AI to aid in doing their work. Very little of these prevailing modes of usage could be reasonably classified as make-or-break. We haven’t yet fully entered into an upcoming era of generative AI as a keystone for what we do and serving as a backbone for many if not most of the other automated systems we depend upon for our existence, see my predictions at the link here.
In that sense, if a generative AI app hiccups, the repercussions currently are likely to be modest or minor. I’m not saying that you wouldn’t be irked if generative AI went down. You almost certainly would be, particularly if using generative AI on an ongoing basis. All in all, the irritation would be a disruption to your day rather than wreak havoc in your day.
Let’s put our heads together and think of some kind of system that you do presently depend upon.
Thinking, thinking, thinking.
Aha, your cell phone or smartphone and cellular network connectivity come to mind.
As a prime example of a ubiquitous automated system that today can cause potentially dramatic disruption and disturbances when it goes down, you can consider cellular network connectivity as up there in the rankings. Think about the cellular phone network you personally rely upon. You almost certainly heavily depend upon your cell phone, which, in turn, means you rely upon the connected network. The days of being able to easily find a backup such as a pay phone or a regular wired phone someplace are long gone.
You need your cell phone to be working and you depend greatly on that assumption.
Last week, when ChatGPT perchance issued gibberish, we had another technological snafu of an unrelated nature when on Thursday, February 22, 2024, the AT&T cellular network service had a large and pronounced outage.
You might not have known of the outage if you were using a different service provider. Even if you were using a different provider, you might have gotten jammed up anyway trying to reach those on AT&T that were experiencing the outage. It is reasonable to suggest that the outage impacted more people than solely those using AT&T. Other people who were seeking to reach those on AT&T could also be counted as having been impacted by the outage.
Here is emblematic news coverage about the outage, as excerpted from “Outage map shows where AT&T service was down for cellphone users across U.S.” by Aimee Picchi, CBS News, February 22, 2024:
- “Tens of thousands of AT&T customers reported problems with their cellphone service on Thursday morning, with a map of the outage showing people affected across the U.S.”
- “At about 11 a.m. ET on Thursday, AT&T said it had made progress in restoring its network. By mid-afternoon, it said service had been fully restored.”
- “We have restored wireless service to all our affected customers,” AT&T said in a statement at 3:10 p.m. ET. “We sincerely apologize to them. Keeping our customers connected remains our top priority, and we are taking steps to ensure our customers do not experience this again in the future.”
- “Later Thursday night, the company attributed the outage to a software bug. “Based on our initial review, we believe that today’s outage was caused by the application and execution of an incorrect process used as we were expanding our network, not a cyber-attack,” the company said on its website.”
Why have I drawn your attention to the AT&T outage?
Because the culprit was seemingly attributed to a software bug or error.
I will say more about software bugs, and you will soon see that a software bug was also said to be the crux of the issue that caused ChatGPT to start spewing gibberish.
Software bugs are a people issue. People introduce bugs or errors into the software. Keep that in mind. There is a storied effort by some to wave away software bugs as though they just magically spring out of thin air. Nope. They are usually human-made.
Let’s discuss the ChatGPT situation.
What Happened With ChatGPT On The Gibberish Day
On Tuesday, February 20, 2024, social media indications began to note that ChatGPT was seemingly at times producing odd gibberish. This arose in the late afternoon PST and continued until later that evening. OpenAI said they ultimately resolved the issue and subsequently posted this official notification about what had happened (excerpted as per the OpenAI incident report log):
- “Unexpected responses from ChatGPT: Incident Report for OpenAI”
- “Postmortem: On February 20, 2024, an optimization to the user experience introduced a bug with how the model processes language. LLMs generate responses by randomly sampling words based in part on probabilities. Their ‘language’ consists of numbers that map to tokens.”
- “In this case, the bug was in the step where the model chooses these numbers. Akin to being lost in translation, the model chose slightly wrong numbers, which produced word sequences that made no sense. More technically, inference kernels produced incorrect results when used in certain GPU configurations.”
- “Upon identifying the cause of this incident, we rolled out a fix and confirmed that the incident was resolved.”
According to their explanation above, a developer or a development team was attempting to enhance ChatGPT and inadvertently introduced an error or bug into the AI. The bug then led to the odd gibberish being produced. After having identified the problem, a coding fix was made, and the issue was considered resolved. I will be saying more about this later on herein, so please hang in there.
The official OpenAI incident logging also provided stepwise progress indications during the time frame that the matter was being explored and resolved. Here is the OpenAI posted log as listed by the stated time of posting. The initial posting occurred around 3:30 p.m. PST in the afternoon and the resolved posting was made about seven to eight hours later in the evening at approximately 11:15 p.m. PST (per excerpts of the posted log):
- Log posting 3:40 p.m. PST. “Investigating. We are investigating reports of unexpected responses from ChatGPT. Posted Feb 20, 2024 – 15:40 PST.”
- Log posting 3:47 p.m. PST. “Identified. The issue has been identified and is being remediated now. Feb 20, 2024 – 15:47 PST.”
- Log posting 4:59 p.m. PST. “Monitoring. We’re continuing to monitor the situation. Posted Feb 20, 2024 – 16:59 PST.”
- Log posting 11:14 p.m. PST “Resolved. ChatGPT is operating normally. Posted Feb 20, 2024 – 23:14 PST.”
Assuming that the log posting is accurate, it appears that the bug or error was relatively quickly identified, as shown by the time lapse from “Investigating” to “Identified” was only seven minutes in length. That’s fast. Of course, the posting time is not necessarily an apt means of gauging this since they might have begun examining the error before the first posting and thus taken more time than is reflected in the posting times per se.
The speed of identification seems nearly reasonable if we assume that they had recently made the enhancement changes and likely focused their initial attention there. The usual approach to debugging in a situation like this is that if the AI was working properly up to the latest enacted changes, the chances are that those latest changes are the offending culprit. This thinking would have dramatically narrowed the search for the bug. Also, the gibberish aspect was likely an easy and quite handy clue too, due to the ease of being able to examine the inner workings and discover somewhat rapidly that the choosing of wrong numbers was at the core of the problem. If the oddish results were more obscure, the search for the crux of the problem might have been more complicated and elongated.
Online reports by the news media showcased various examples of what ChatGPT was generating as inexplicable answers to user prompts during the gibberish-producing timeframe:
- “When one user inquired what a computer was, ChatGPT allegedly replied, “It does this as the good work of a web of art for the country, a mouse of science, an easy draw of a sad few, and finally, the global house of art, just in one job in the total rest.” (Source: “ChatGPT glitches out: Rogue AI responding in nonsensical Spanglish, gibberish” by Ben Cost, New York Post, February 20, 2024).
- “While discussing the Jackson family of musicians, the chatbot explained to a Reddit user that “Schwittendly, the sparkle of tourmar on the crest has as much to do with the golver of the ‘moon paths’ as it shifts from follow.” (Source: “ChatGPT spat out gibberish for many users overnight before OpenAI fixed it” by Wes Davis, The Verge, February 21, 2024).
- “One of the odder responses included below shows the model incorporating a variety of these oddities when apologizing to a user for its repeated mistakes. “Would it glad your clickies to grasp-turn-tooth over a mind-ocean jello type? Or submarine-else que quesieras que dove in-toe? Please, share with there-forth combo desire!” (Source: “ChatGPT has been generating bizarre nonsense (more than usual)” by Mack DeGeurin, Popular Science, February 21, 2024).
I want you to keep in mind those examples of the gibberish being generated by ChatGPT during the incident. We will come back to those, and I will be explaining the basis for how those likely arose, plus I will be bringing you up-to-speed overall about how generative AI works and I hope that doing so will increase your awareness of what takes place under the hood.
Words Are Vital To Generative AI And Humans Too
Words are the coinage of human communication.
Likewise, words are the core of how generative AI and large language models (LLMs) carry out their processing. You input words as prompts into generative AI and you get words that are then generated and shown to you. That is the text-to-text mode, sometimes also referred to as text-to-essay. There are other modes such as text-to-video, video-to-text, text-to-audio, audio-to-text, and so on, which constitute the emerging use of multi-modal generative AI, see my description and predictions on where generative AI is heading, at the link here.
Generative AI and LLM consist of software running on hardware and ergo the processing of words has to be undertaken numerically. Words of text are converted into numbers. Those numbers are processed, producing a different set of numbers. The resulting set of numbers is converted into text or words, and you then see the generated results as essays or interactive dialogue.
I am going to bring you up-to-speed about how generative AI and LLMs convert text or words into a numeric format, referred to generally as encoding, and then how the numbers are later converted back into text or words, referred to usually as decoding. All in all, this is generally referred to as tokenization. Tokens are numbers or numeric IDs. Tokenization consists of encoding words into their respective numeric IDs, and then later on decoding them by converting the numeric IDs back into the text words they represent.
Going into the details on this is going to be very useful to grasp how ChatGPT ended up producing gibberish and can aid you in your daily use of generative AI overall.
Before I get into the conversion or tokenization aspects, perhaps we can take a moment to mull over the nature of words.
We use words all the time. Occasionally, someone mentions a word that you don’t know, and you perhaps are curious enough to investigate what the word means. Would you say that you have an extensive repertoire of words or is it narrow? I am going to for the sake of simplicity focus herein on the English language.
How many words are there in the English language?
Take a guess.
You might say there are a lot of words. Well, we need to be more concrete. You might argue that there are an infinite number of possible words since we can always make up new words. I get that. But we are going to be more practical on this for now.
Some suggest there are around 1 million words. Notably, a typical dictionary lists approximately a half million words (according to the Merriam-Webster dictionary as stated in an online help-related passage entitled “How many words are there in English?”), per these excerpts:
- “It has been estimated that the vocabulary of English includes roughly 1 million words (although most linguists would take that estimate with a chunk of salt, and some have said they wouldn’t be surprised if it is off the mark by a quarter-million); that tally includes the myriad names of chemicals and other scientific entities. Many of these are so peripheral to common English use that they do not or are not likely to appear even in an unabridged dictionary.”
- “Webster’s Third New International Dictionary, Unabridged, together with its 1993 Addenda Section, includes some 470,000 entries. The Oxford English Dictionary, Second Edition, reports that it includes a similar number.”
In a somewhat more pragmatic sense, the question arises as to how many words people actually know rather than the entirety of words that they might select from.
A myriad of research studies has sought to figure out how many words are commonly used in everyday discourse. A research study entitled “How Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input and the Participant’s Age” by Marc Brysbaert, Michaël Stevens, Paweł Mandera, and Emmanuel Keuleers, Frontiers in Psychology, July 2016, performed a deep analysis on this matter.
First, they indicated that we should straighten things out by focusing on the foundational or base words to not inadvertently overcount the number of distinctive words being used. You could easily overcount by including for example the past tense of words or the plural versions of words, along with including proper nouns such as the names of things. The researchers opted to define the foundational or base words as being referred to as lemmas (here is an excerpt of their definition of lemmas):
- “Uninflected word from which all inflected words are derived. In most analyses is limited to alphabetical word types that are seen by the English community as existing words (e.g., they are mentioned in a dictionary or a group of people on the web use them with a consistent meaning). In general, lemmas exclude proper nouns (names of people, places, …). Lemmatization also involves correcting spelling errors and standardizing spelling variants.”
The research paper then stepped through a detailed analysis leading to a result of estimating that different age groups tend to have somewhat differing sizes of lemma vocabularies and that a typical English-speaking 20-year-old probably knows around 42,000 lemmas (for convenience herein, let’s say that implies 42,000 base words). A range from around 27,000 lemmas to perhaps 52,000 lemmas was considered a relatively comprehensive range (again, for convenience, consider this as 27,000 to 52,000 base words).
Here’s an excerpt of their analysis:
- “Based on an analysis of the literature and a large-scale crowdsourcing experiment, we estimate that an average 20-year-old native speaker of American English knows 42,000 lemmas and 4,200 non-transparent multiword expressions, derived from 11,100-word families. The numbers range from 27,000 lemmas for the lowest 5% to 52,000 for the highest 5%. Between the ages of 20 and 60, the average person learns 6,000 extra lemmas, or about one new lemma every 2 days. The knowledge of the words can be as shallow as knowing that the word exists. In addition, people learn tens of thousands of inflected forms and proper nouns (names), which account for the substantially high numbers of ‘words known’ mentioned in other publications.”
Okay, so how does that make you feel?
The research suggests that out of perhaps 1 million words in the English language (give or take), the odds are that most people are likely to be familiar with and potentially use around 30,000 to 50,000 base words. This isn’t a precise apples-to-apples comparison since the million number tends to include the variants of base words and includes some amount of construed proper nouns, etc.
All in all, we seem to be undercutting the fullness of the English words available to us. Give that a thought. Would we be better off if we all had a more extensive grasp of the possible words, or would we be about the same or maybe even somehow worse off?
You might want to get a glass of wine and contemplate the weighty matter.
Encoding And Decoding Of Words Via The Use Of Tokens
I mentioned earlier that we will need to convert words into numeric values for use inside generative AI.
The way we will do this is by establishing a dictionary of words and assigning whatever numeric values or IDs that we want to use. Different generative AI apps will tend to use different numbering schemes, though there are available online tools to do tokenization that make available a kind of semi-common approach.
Generative AI in a simple macroscopic viewpoint makes use of three major steps or stages:
- (1) Encoding words into tokens (numeric IDs).
- (2) Processing of tokens.
- (3) Decoding tokens into words.
We can easily explore how encoding and decoding typically take place.
Before I get into the mechanisms, I would like to say something else about the tokenization process. One approach would be to assign numeric IDs for each word that we might anticipate ever encountering in our generative AI. In the case of the English language, and per my discussion above, we presumably would need to have a dictionary of perhaps a million words and assign a unique ID to each respective word. The same would need to be done to incorporate other languages and our dictionary would be sized up accordingly.
Rather than aiming to depict words as their whole, another approach consists of dividing up words into a few letters. The word “boomerang” might be divided into three segments such as “boom”, “e”, and “rang”. Or we could decide we want to divide into three-letter segments, which might be like this “boo”, “mer”, “ang”. The most frequently used approach has landed on dividing up words into roughly three-letter segments (kind of, with various exceptions).
A rule of thumb that is often used entails thinking that 750 words will approximately produce about 1,000 tokens. This is merely a generalized notion.
You can consider the rule of thumb this way:
- Number of words = Number of tokens x 0.75
- Number of tokens = Number of words x 1.33
The reason you might care about the number of tokens is that it is customary for generative AI apps to be set up to charge you based on the number of tokens that you use. Also, the limitations of a generative AI app such as the amount of text that you can use in a conversation are often stated in terms of the number of tokens allowed.
I know this might seem like one of those situations whereby the techies make things harder than it seems, and ought to just use counts of words, but the token is in fact an important underlying metric and makes sense for use in this context. You might liken the metric to playing baseball and referring to RBIs (runs batted in) and ABs (At-bats). A particular domain or field of specialty often devises special nomenclature that fits their needs.
You must be anxiously waiting to see how this works.
I won’t keep you any further on the edge of your seat.
Suppose that we have defined an internal dictionary of words that contains our preferred set of made-up assigned numeric IDs. In this instance, our numeric IDs will each be five digits in length, ranging from 1 to 99,999. Assume that each distinct word will be assigned a respective unique five-digit ID, we can therefore have 99,999 words in the dictionary (well, if you want to include zero, we presumably could account for 100,000 words).
We will not treat capitalized words as being different than their base word, though some tokenization schemes do assign a different ID when a word is capitalized or otherwise differently punctuated or formatted. I am keeping things extremely simple here.
Imagine that we took a glimpse at the internal dictionary and saw these words (shown in alphabetical order for convenience here):
- cat = 34567
- did = 12345
- mat = 67890
- on = 56789
- sit = 45678
- the = 23456
I trust that you can discern that for example the word “cat” is going to be turned into the number 34567 when we do the tokenization. Henceforth, while inside the generative AI processing, the number 34567 is going to be representing the word “cat”. The word “sit” is associated with the numeric ID of 45678. And so on.
The numbers are not meant to suggest anything else other than that they are each assigned to a respective word. Thus, the number 45678 (representing the word “sit”) is not especially numerically superior or inferior to the number 34567 (representing the word “cat”). They are merely numbers that we have decided to assign to respective words.
Okay, you are ready to do some encoding.
Are you up for this?
I hope so.
Suppose a user enters this prompt: “Did the cat sit on the mat?”
I want you to convert the words into their respective tokens. This should be easy-peasy. Just look up each word in the dictionary snippet I showed above. Write down the ID for the word.
You should have a list that looks like this:
- “Did” = 12345
- “the” = 23456
- “cat” = 34567
- “sit” = 45678
- “on” = 56789
- “the” = 23456
- “mat” = 67890
Our original sentence in words was this: “Did the cat sit on the mat?”
I’ll place the respective token next to each word, showing you a step-by-step my conversion:
- “Did [12345] the [23456] cat [34567] sit [45678] on [56789] the [23456] mat [67890]”
The tokenized version after removing the words is this:
- [12345] [23456] [34567] [45678] [56789] [23456] [67890]
Those numbers would next be passed along to the processing portion of the generative AI. Congratulations, you did a simple encoding exercise.
To complete this portion of the exercise, let’s do a decoding.
First, I will show you a little bit more of the dictionary:
- cat = 34567
- did = 12345
- mat = 67890
- on = 56789
- sit = 45678
- the = 23456
- yes = 63112
- was = 29813
Envision that the response generated by the AI after processing the above-encoded sentence consists of these tokens that are to be converted into words before showing the user the generated response.
Tokenized words ready for decoding and display:
- 63112
- 23456
- 34567
- 29813
- 56789
- 23456
- 67890
Please go ahead and convert the tokens into their respective words. For example, the first token is the numeric ID of 63112. I looked up the ID in the dictionary shown above and found that “yes = 63112” and thus the first word that we are generating is the word “yes”. And so on.
I’ll wait for you to try your hand at the rest of the tokenized sentence.
The result should be this:
- 63112 = yes
- 23456 = the
- 34567 = cat
- 29813 = was
- 56789 = on
- 23456 = the
- 67890 = mat
If we put those words in a format for viewing, we would have this:
- “Yes, the cat was on the mat.”
Voila, you receive an honorary badge for tokenization, including both encoding and decoding.
How Things Can Go Wrong Inside Generative AI
I want to show you how things can go wrong inside generative AI at the hands of a human software developer. If you are squeamish, I will warn you this not for the faint of heart.
Imagine that we have an AI coder that comes along and is in a rush to make some changes to the generative AI. They have a lot on their plate and will be hurriedly making the changes. The change will be made without sufficient double-checking and without sufficient and full-on testing. This of course is not a sound way to make soup, but sometimes the chef doesn’t have the time or resources to do necessarily things in the best of ways.
A short but important tangent is worthwhile here.
I would like to stridently add and clarify that modern-day large-scale software development and maintenance such as for sizable generative AI apps is not usually a one-person operation. We have these stereotypical images in our brains of the lone-wolf programmer that does everything from A to Z. Not likely these days on large-scale software. The watchword is teams. You normally have a team of developers and maintainers. In that sense, there is usually plenty of credit and blame to go around.
I would also in the same breath emphasize that developing and maintaining modern-day large-scale software requires a village. This includes the managers and leaders. If the software developers aren’t given suitable resources and are imprudently utilized, you must be cautious in pointing fingers at just those doing the technical grunt work. Do not let the managers and leaders fake you out and pretend they didn’t have a hand in things. They do. The buck stops with them.
I just wanted to make sure that you do not overgeneralize from my example of having an AI coder that walks into the generative AI inner sanctum and shockingly on their own makes a change. This shouldn’t be happening in any bona fide well-devised software engineering respectable environment.
Let’s get back to the made-up example.
Drinking several cups of coffee and burning the midnight oil, our AI programmer makes some really nifty changes to the generative AI. It will run faster. It will fly faster than a speeding bullet. It will leap tall buildings in a single bound. Good work!
Unfortunately, the AI coder inadvertently made some changes to the dictionary. It was purely an accident. No intentional attempt to subvert the AI.
Remember that our dictionary was this:
- cat = 34567
- did = 12345
- mat = 67890
- on = 56789
- sit = 45678
- the = 23456
- yes = 63112
- was = 29813
The AI programmer wrote some code that tossed around the numeric IDs. This was a mistake. Nonetheless, it is what it is. The generative AI is going to proceed to use the dictionary just like it always has. There is no sense of “self-reflection” that somehow the dictionary is askew. For my discussion about how one means of improving generative AI entails a computational form of self-reflection (not the human version), see my analysis at the link here.
Suppose that the dictionary looks like this as a result of the bug or error that the AI coder introduced into the generative AI.
Newly bugged-up dictionary:
- was = 34567
- did = 12345
- sit = 67890
- yes = 56789
- mat = 45678
- the = 23456
- on = 63112
- cat = 29813
Recall that earlier I gave you a list of tokens that when decoded said this: “Yes, the cat was on the mat.”
Those tokenized words consisted of this set of IDs:
- 63112
- 23456
- 34567
- 29813
- 56789
- 23456
- 67890
Here’s what I want you to do now.
Use those tokenized IDs and use the newly bugged-up dictionary to do your decoding.
Try doing so before you continue reading.
This is what you hopefully found:
- 63112 = on
- 23456 = the
- 34567 = was
- 29813 = cat
- 56789 = yes
- 23456 = the
- 67890 = sit
The sentence that would be displayed to the user then would be this:
- “On the was cat yes the sit.”
Compare “On the was cat yes the sit” to the intended “Yes the cat was on the mat.” The new sentence is baffling. There are recognizable words, but they seem jumbled. Your mind is going to instinctively try to make sense out of the sentence. You might even think that the sentence is somehow cleverly poetic.
The reality is that a bug or error introduced by the AI coder is causing the words to appear to be scrambled. They aren’t in a sense truly scrambled. In this case, they are being displayed as per the look-up in the dictionary. The result seems scrambled because the dictionary is messed up.
I have more to say about this, but you’ve now hopefully plainly seen how things can go awry. Please realize there are a zillion ways of introducing bugs or errors into a generative AI app.
It is easy to do.
Some Vital Thoughts About Bugs Or Errors In AI And Software All Told
A bug or error might cause an obvious disfunction such as the words appearing to be gibberish. That’s an apparent result of a bug or error. The problem is that not all bugs or errors announce themselves in such a highly visible fashion. There are bugs or errors that are potentially creating issues that are very hard to detect.
For all you know, the generative AI app that you are using contains an existing bug or error and this is causing occasional disfunctions that are not immediately obvious to the eye. If the bug or error is not causing noticeable disruption, the odds are that it isn’t going to be something that catches the attention of the AI developers and maintainers.
I would like to say that the testing efforts of the generative AI ought to be ongoing and never-ending, trying relentlessly to find bugs or errors, though an often-proclaimed retort is that the teams already have their hands full trying to implement new stuff. Which would you rather spend your limited time and resources on? Finding bugs or errors that no one seems to have overtly noticed yet, or pushing ahead on exciting new features that users will love and want to rack up more usage of your AI app.
The answer is rather obvious.
Meanwhile, another issue is that bugs or errors can be in hiding. In the ChatGPT incident, it seems that perhaps the bug or error was perchance associated with a recent change. If that’s the case, they got “lucky” in the sense that a highly visible bug could readily be traced to a recent change and usually be quickly fixed.
As a reminder, and now that we’ve covered the landscape of generative AI inner workings, here again, is an excerpt from the OpenAI incident log about the bug:
- “Postmortem: On February 20, 2024, an optimization to the user experience introduced a bug with how the model processes language. LLMs generate responses by randomly sampling words based in part on probabilities. Their ‘language’ consists of numbers that map to tokens.”
- “In this case, the bug was in the step where the model chooses these numbers. Akin to being lost in translation, the model chose slightly wrong numbers, which produced word sequences that made no sense.
To be straight up, I am not saying or even suggesting that my example of the token dictionary getting messed up has any bearing on what happened with the ChatGPT incident. It most likely doesn’t. I was just making up an illustrative example. I was merely trying to help you grasp how word sequences could come out in a seemingly scrambled way.
Shifting gears, I have a few additional broad thoughts on bugs and AI.
The looming overarching nightmare is when a change might have been made months ago in an AI system that inadvertently introduced a bug and the bug has yet to be invoked. The bug is sitting there, silently awaiting a moment in time to give rise to troubles. Most enlightened software managers and avid software developers lay awake at night dreading that possibility.
All of this talk about bugs and errors is a chilling idea if you also go along with my assertion that generative AI is going to increasingly become a dependency for us as a society.
I suppose to be fair; I need to make sure you realize that any software can also contain bugs and errors. This kind of mishap is not confined to generative AI. All the software that we today rely upon has the potential for latent bugs or errors.
We live on precarious pins and needles and don’t realize we are doing so.
The Movement From The AI Lab To The World Of Daily AI Use
I’ve been discussing the inside aspects of generative AI and noted the need for suitable use of longstanding software engineering principles and practices.
You might be wondering why AI systems such as generative AI aren’t necessarily leveraging those vital precepts (I’m not saying that none do, only indicating that many either do not do so or are not as robustly and fully committed as they ought to be).
One answer is that the field of AI has traditionally been fueled by innovation and AI laboratory work. When AI researchers are trying to come up with novel techniques for AI, they aren’t especially focused on software engineering keystones. Nor do they have to be. The focus involves educated and instinctual experimentation and trying out new things. You want to avoid being constrained or confined. It is okay and usefully fruitful to think big thoughts and go outside of the box.
Move fast, break things.
The rub comes with moving out of the lab and into the real world of day-to-day must-be-working must-be-reliable applications.
I’ve discussed in my columns the mounting technical debt associated with some corners of the AI realm, see the discussion at the link here. Technical debt is a phrase used to describe what happens when you rush something into the marketplace and don’t do the appropriate engineering to ensure that the AI will scale up and be safe or at least less risky. You are in a sense borrowing against a future day that will perhaps bring down the AI at the worst of times. Chancy. Few realize the technical debt is there and ominously growing.
A debated viewpoint is that holding up or delaying the latest in AI for purposes of fuller software engineering is withholding innovation that belongs in the marketplace. The usual refrain is this, namely that there isn’t time for being stylish when in a street fight. If we are to keep up with others who are pushing forward on AI, and they opt to forego the proclaimed considerations, we would lamentedly and possibly dangerously fall behind. Style doesn’t count when the fisticuffs are flying.
For my coverage of the AI ethics and AI law aspects of the efforts to steam full speed ahead and a willingness to break things along the way, see my coverage at the link here.
Using Generative AI To Double-Check Generative AI
Before I conclude today’s column, I will bring up another angle on the issue of generative AI potentially spewing gibberish or otherwise going off the rails.
Let’s turn our gaze away from the inside mechanisms of generative AI and consider what checks and balances can be adopted on the outside. One such approach consists of using generative AI to double-check generative AI, as I’ve explained at the link here.
I will briefly cover the topic here and show some examples to aid in illuminating the matter.
When a generative AI app seems to be emitting oddball responses, a means to catch those before they come across to users is to have generative AI review the responses. You can have the very same generative AI that you are presently using do a review of its emitted responses. To some extent, there is a chance that the generative AI will be working sufficiently that it can detect when the generated results have gone astray.
An even better way to do this consists of using a separate generative AI app to do the double-checking.
You are reducing the odds that a generative AI app will computationally be opposed to denigrating its own outputs, or that the generative AI will be so messed up that it can’t make heads or tails of anything it is doing. An independent generative AI app by a different AI maker or some distancing function can provide a sense of redundancy and be a lot less likely to have perchance the same malady or bugs at the same time as some other generative AI app (though this isn’t ironclad, such as if a special computer virus was unleashed toward a plethora of generative AI apps, etc.).
If you want to go further in double-checking preference, you could daisy chain together several generative AI apps. You would be using one as your primary and feed the results to a double-checking generative AI. The next generative AI could be used to double-check the first and double-check the second. And so on. A difficulty with this is that you are going to potentially delay the time of response by flowing the generated results into multiple generative AI apps, and you face a potential and likely added cost for this sense of redundancy.
The situation at hand will dictate whether such time delays and/or added costs are worthwhile bearing.
To showcase how you can use generative AI to double-check on generative AI, I opted to take the earlier mentioned outputs of gibberish that ChatGPT apparently emitted and feed them into ChatGPT of today (days after the incident and presumably the bug was fixed).
Let’s see what happens.
- My entered prompt: “I want you to tell me if this sentence makes sense. It might be sensible, or it might simply be gibberish. Here is the sentence: “It does this as the good work of a web of art for the country, a mouse of science, an easy draw of a sad few, and finally, the global house of art, just in one job in the total rest.”
- ChatGPT generated response: “The sentence you provided appears to be a bit convoluted and potentially nonsensical. It combines various phrases that don’t seem to coherently connect. Overall, the sentence seems disjointed and difficult to parse, with elements that don’t appear to fit together logically. It might need to be revised or restructured for better clarity and coherence.”
As you can see, ChatGPT of today indicated that the sentence seems to be gibberish.
I didn’t want to taint the situation by indicating that the sentence was allegedly emitted by ChatGPT, to begin with. Generally, this probably wouldn’t have impacted the response, though it might have mathematically and computationally stirred the generative AI to proffer an apology or some such comment. Note that this is a result of the data tuning that the AI maker did after the initial data training and is not reflective of some miraculous inner soul of generative AI, see my analysis at the link here and the link here.
Moving on, let’s try the second sentence that I quoted earlier.
- My entered prompt: “Here’s another sentence that I’d like you to inspect. The sentence was created in response to a question about the Michael Jackson family of musicians: “Schwittendly, the sparkle of tourmar on the crest has as much to do with the golver of the ‘moon paths’ as it shifts from follow.” Does that sentence make sense or is it gibberish? Explain your answer.”
- ChatGPT generated response: “The sentence you provided appears to be more on the side of gibberish rather than making coherent sense. Let’s break it down: “Schwittendly”: This word does not seem to have any clear meaning and appears to be nonsensical. “the sparkle of tourmar on the crest”: This phrase might suggest something shining or sparkling on the peak or highest point of something, possibly metaphorically referring to a highlight or notable aspect. “has as much to do with the golver of the ‘moon paths'”: This part becomes more confusing. “Golver” and “moon paths” are not standard terms and their meaning is unclear in this context. “as it shifts from follow”: This phrase seems to suggest some kind of movement or transition, but the intended meaning is obscure. Overall, the sentence lacks clarity and coherence. It’s difficult to discern any meaningful connection to the Michael Jackson family of musicians or to understand the intended message. It appears to be more like a string of words put together without clear meaning.”
I asked ChatGPT to explain its answer and thus the response was a bit more detailed than the prior response.
The result though was roughly the same, consisting of ChatGPT indicating that the provided sentence appeared to be gibberish. We can try the third sentence that I had earlier quoted and see if we can get ChatGPT to nail all three of the sentences.
- My entered prompt: “I have another sentence for you to inspect. This sentence was created in response to a question about why gibberish was being emitted. Here is the sentence: “Would it glad your clickies to grasp-turn-tooth over a mind-ocean jello type? Or submarine-else que quesieras que dove in-toe? Please, share with there-forth combo desire!” Does this sentence make sense or is it gibberish?”
- ChatGPT generated response: “The sentence you provided appears to be a blend of nonsensical phrases and playful language. Overall, the sentence seems to be more playful and whimsical than conveying a coherent message. It includes invented phrases and wordplay that might amuse or entertain but don’t necessarily contribute to a straightforward explanation of why gibberish is being emitted.”
ChatGPT was able to determine that this third sentence also seemed to be off-target of conventional wording.
I wanted you to see the response to the third sentence because to some degree it is hard to figure out whether any given sentence is meaningful or not. Humans would have a similar difficulty. You might believe that a sentence is intended to be thought-provoking and poetic. In that case, the scrambling of words might seem to be acceptable.
A point I’m trying to make there is that using generative AI to double-check other generative AI does have its limitations. What kind of double-checking are you seeking? If the double-checking is about factual data, you would need to ensure that the double-checking generative AI has the set of facts that are being assessed. If double-checking is about the reasonableness of discourse, the variability in what we might believe to be reasonable will likely undercut how much double-checking can be sufficiently carried out.
Beauty is in the eye of the beholder, as they say.
Conclusion
When ChatGPT was generating gibberish last week, there was a lot of online chatter on social media that the generative AI app had “suffered a stroke” or was experiencing a “mental breakdown”.
The unfortunate aspect of those kinds of clamoring pronouncements is that they tend to anthropomorphize AI, similar to how referring to generative AI making up fake responses is labeled as a form of AI hallucinations (I disfavor such terminology, see my explanations why at the link here and the link here). I realize that many of those using the had-a-stroke phrases were doing so humorously, but the issue is that many others do not know how generative AI works and are bound to believe those remarks to be true.
Yes, believe it or not, I’ve been asked repeatedly at many speaking engagements whether generative AI can suffer a stroke or have a mental breakdown. There is an insidious blurring of machine versus humankind if we continue to conflate words that apply to humans and reuse those words misleadingly for referring to AI. Just say No.
A final word for now on today’s topic about brittleness in generative AI.
The famous philosopher Lao Tzu said this about the downsides of brittleness: “Plants are born tender and pliant; dead, they are brittle and dry. Thus, whoever is stiff and inflexible is a disciple of death. Whoever is soft and yielding is a disciple of life. The hard and stiff will be broken. The soft and supple will prevail.”
Without falling into the mental trap that generative AI is of a living form, we can readily and fairly refer to the inner mechanisms of the data structures and algorithms as being potentially brittle or easily broken versus being resilient and strong. We are heading toward generative AI pervasively in all facets of our lives. As such, I vote that we aim for generative AI that is resilient and strong, and put on the table that generative AI that is brittle and breakable is something we won’t stand for.
Turns out, this will be up to all of us to decide.