I asked ChatGPT…

Dave Wilton

24 Jun 2025 • 10 min read

TL;DR: ChatGPT’s Research AI is fundamentally flawed. It misleads and hallucinates, only scratches the surface of what information is available, and ends up creating more work for the researcher than if they had just started from scratch on their own.

I set out to test ChatGPT’s Research AI to see how useful it might be in researching the history of words and phrases. There have been numerous examples in the news of late of the failure of a large language model algorithm (LLM) to accurately convey information, perhaps most notably US Health and Human Services Secretary Robert F. Kennedy, Jr. releasing a report, apparently generated at least in part by an LLM, that contained bogus source citations. My findings do not put ChatGPT’s Research AI in a good light, at least for the purposes it is being touted for.

This was a test of a large language model (LLM) for a particular purpose for which it is being touted by those selling it. LLMs may function well at other tasks, and my little test says nothing about other types of so-called artificial intelligence (AI) used for other purposes, such as image analysis tools.

I must note that what I did was not a thorough or controlled test. I was essentially just playing around with the tool. Still, what I found convinced me that constructing and conducting a thorough test would be a waste of time. When one finds fundamental errors and misinformation on one’s first “back-of the-envelope” test, that does not bode well for success in a more rigorous examination. (Which is not to say that a more thorough and rigorous test would not be welcome, just that I’m not going to bother myself.) I do think that this test is enough to conclude that one should not use ChatGPT’s Research AI or rely upon the LLM-generated summaries provided by Google and other search engines, at least until more comprehensive tests on a future, improved version show that it is reliable. I don’t, however, believe this is possible given the architecture of LLMs; the errors they produce are fundamental to and result from their design.

First, ChatGPT’s Research AI does not truly conduct “research,” in that it doesn’t create new knowledge. It is essentially a tool for literature review, finding out what others have already discovered. That is not in and of itself a bad thing—literature review is an essential function—but the label of “research” is misleading in that implies that the tool is pushing the boundaries of what is known.

I ran two separate tests of the tool. The first was to ask the origin of the word firework, the second the phrase agree to disagree. The prompts and complete LLM responses can be found below. I chose those two terms because I had written draft website entries for them, but as of the time of the test I had not published them. So I would have something to compare the LLM’s results with, but my work would not be available to the LLM and there would be no possibility of my work “contaminating” the LLM’s results.

The results of my two tests were inconsistent, with the tool performing better on one test than on the other.

On 28 May 2025, I prompted the ChatGPT Research AI to find the origin of the word firework with the following:

What is the origin of the word “firework”? Include multiple examples of early use for various senses and footnotes and bibliography in Chicago style.

(ChatGPT’s responses in full are available below.)

First, the LLM got the etymology wrong. While it correctly identified the word as a compound, it said it was “likely borrowed from similar constructs in European languages—particularly Italian fuochi artificiali (“artificial fires”) or German Feuerwerk.” This is wrong; it was most likely compounded within English.

It also said, “the earliest known usage of firework dates to 1575, and it initially referred specifically to pyrotechnic displays.” Again, this is wrong. The word dates to nearly a half century earlier, and the original sense is that of combustibles and explosives used by the military. The sense of a pyrotechnic display came later.

Then we move to the examples of use. The first, allegedly sourced from the Oxford English Dictionary (OED), was “1575 — ‘There were great fyre workes shewed upon the Thames.’” But the first citation in the OED is from 1528, and as far as I can tell the quotation the LLM gave is a hallucination, although the quotation may exist in some early-modern text that is unavailable to me using the resources of the Princeton University library. The bibliography claims the LLM accessed the OED on 28 May 2025, the day I asked the question. Clearly, it did not.

The second example is allegedly from George Puttenham’s The Arte of English Poesie. The citation is dated 1691, although Puttenham’s book was published two years earlier. The LLM’s dating is also internally inconsistent: while the text gives the date as 1691, both the footnote and bibliography give the correct date for the work, 1689. And again, the LLM hallucinated the quotation; Puttenham’s book doesn’t use the word fireworks.

The next example given is in a military context. The author, Henry Hexham, did indeed write about military affairs in the period, but as far as I can tell the book, a 1625 text titled A Treatise of the Art of Warre, does not exist.

The last example of use is in the figurative sense, allegedly taken from John Milton’s 1671 Paradise Regained. At least this time the book is real; the quotation, however, is not.

Finally, the LLM notes, “By the 18th century, ‘fireworks’ had also come to denote a spectacle or a display of emotion, often intense or dramatic.” Again this is incorrect. This figurative sense was in place at the start of the seventeenth century.

The second test was to find the origin of the phrase agree to disagree. The LLM did much better at this one. There were no hallucinations, and it did not fall for the common belief that the phrase was coined by theologian John Wesley. The information it provided was essentially correct; the only “errors” I found was giving a first citation of 1608, when the earliest I have found is 1601, and the quotation from Wycherley does not cite the earliest edition of the poem. These are not disqualifying.

The chief problem with this second response is it did not format the notes or bibliography in Chicago style, as requested. Page numbers and other critical information were missing from the scholarly apparatus, making it difficult to find the sources being referenced. This sounds like a quibble, but it is significant. The purpose of source citations is to allow readers to locate the original information; if page numbers are not provided, this task becomes difficult, if not all but impossible. Also, formatting source citations to conform with a particular published style is something that LLMs should perform nearly flawlessly but that this current iteration of ChatGPT's Research AI cannot accomplish.

A lesser, but highly revealing, problem is the type of sources cited. It seems that all of the primary source citations were gleaned via two articles, a blog post and a newspaper article. While in this particular case both of these were well researched, that is not always the case with blogs and newspapers. They should not be the go-to secondary sources for serious research, used only upon careful assessment of their quality. In this instance the LLM was fortunate in stumbling upon two good ones. I say fortunate, because the hallucinations in the first test make it clear that the LLM is incapable of assessing information quality.

The question arises is why did the LLM fail so miserably on the one task, yet do a decent job on the other? While LLM is a black box and the answer cannot be known for certain, I think the answer lies in the two terms chosen. The term firework has little online commentary about its origin, while there is extensive commentary about the origin of agree to disagree (most of it centered on the belief that Wesley coined it). And, indeed, all the information and quotations provided in this second response can be found in the two secondary sources that the LLM cited. So when it has information that is succinctly and correctly stated in a readily available source (e.g., a blog post or Wikipedia), the LLM does a reasonable job. But when it needs to assemble a response on its own, it falls into hallucination and error.

From this brief test, it is clear that ChatGPT’s Research AI is worse than useless in the tasks it is being touted for. It is actively misleading. This should not be a surprise. Large language models (LLMs) are essentially predictive text engines, determining what is the most statistically likely word to follow. They are not “intelligent” in any sense of that word. They have no knowledge of the real world or what is true or false, fact or fiction. As a result, I do not believe it is possible to create an LLM that can perform this function adequately. Not only are the answers it produces not to be trusted as correct, those answers are not even internally consistent within the same response, as the different dates given for Puttenham’s work show.

Image credit: Unknown creator, 2018. Wikimedia Commons. Public domain image.

The prompts and responses follow:

First Prompt to ChatGPT’s Research AI:

What is the origin of the word "firework"? Include multiple examples of early use for various senses and footnotes and bibliography in Chicago style.

ChatGPT’s response:

The word "firework" has an intriguing etymology that traces back to the early modern English period, and its development reflects both the literal and figurative senses in which it has been used over time. Here's a detailed breakdown of its origin, historical usage, and references in Chicago style.

Origin and Etymology

The English word “firework” is a compound of “fire” and “work,” following the common practice in Germanic and Romance languages of forming compounds to describe man-made contrivances involving fire. The term appears in English in the late 16th century, likely borrowed from similar constructs in European languages—particularly Italian fuochi artificiali ("artificial fires") or German Feuerwerk.

According to the Oxford English Dictionary (OED), the earliest known usage of firework dates to 1575, and it initially referred specifically to pyrotechnic displays.

Examples of Early Use

1. Literal Sense: Explosive Display

1575 – “There were great fyre workes shewed upon the Thames.”¹
This is among the earliest known usages, documenting a pyrotechnic celebration held for Queen Elizabeth I. The spelling reflects the orthographic conventions of the period.
1591 – George Puttenham, in The Arte of English Poesie, refers to “artificiall fires and fireworks,” demonstrating that the term had entered broader descriptive vocabulary by the end of the 16th century.²

2. Military and Technical Context

1625 – In A Treatise of the Art of Warre by Henry Hexham, we find mention of “diverse sorts of fireworks used in assault.”³
Here, the term applies to military pyrotechnics, such as incendiary devices and signal flares.

3. Figurative Sense

1671 – John Milton’s Paradise Regained includes a metaphorical use: “Flames of zeal... not fireworks for the eyes.”⁴
This marks one of the earliest known figurative uses, contrasting superficial spectacle with spiritual intensity.

Development of Senses

By the 18th century, "fireworks" had also come to denote a spectacle or a display of emotion, often intense or dramatic. This figurative usage still persists today, as in “There were fireworks during the debate.”

Conclusion

The term firework has evolved from its initial technical and celebratory meanings to encompass broader metaphorical connotations. Its roots lie in European traditions of pyrotechnic display, and its usage in English reflects both cultural practices and linguistic borrowings.

Footnotes

Bibliography

Hexham, Henry. A Treatise of the Art of Warre. London: M. Flesher, 1625.
Milton, John. Paradise Regained. In The Complete Poetry and Essential Prose of John Milton, edited by William Kerrigan, John Rumrich, and Stephen Fallon, 543. New York: Modern Library, 2007.
Oxford English Dictionary. “Firework.” Accessed May 28, 2025. https://www.oed.com.
Puttenham, George. The Arte of English Poesie. London: Richard Field, 1589.

Would you like this formatted as a PDF or added to a document?

Footnotes

Oxford English Dictionary, s.v. “firework,” accessed May 28, 2025, https://www.oed.com.
George Puttenham, The Arte of English Poesie (London: Richard Field, 1589), 135.
Henry Hexham, A Treatise of the Art of Warre (London: M. Flesher, 1625), 74.
John Milton, Paradise Regained, in The Complete Poetry and Essential Prose of John Milton, ed. William Kerrigan, John Rumrich, and Stephen Fallon (New York: Modern Library, 2007), 543.

Second Prompt to ChatGPT’s Research AI:

What is the origin of the phrase “agree to disagree”? Include multiple examples of early use for various senses and footnotes and bibliography in Chicago style.

ChatGPT’s response:

Origin and Early Usage of “Agree to Disagree”

Historical Development

1608 – James Anderton (alias John Brereley)
- In The Protestants Apologie for the Roman Church, Anderton writes:

“And as our learned adversaries do thus agree to disagree in their owne translations, mutually condemning (as before) each other...”

This usage lacks the modern connotation of mutual tolerance and instead highlights conflicting viewpoints among Protestant scholars.
https://wjdw.nl/2019/04/12/was-john-wesley-the-first-to-put-the-phrase-agree-to-disagree-in-print/?utm_source=chatgpt.com

1704 – John Piggott
- In his sermon On Union and Peace, Piggott states:

“And now why should we not agree to differ, without either enmity or scorn?”

This reflects an early sentiment of amicable disagreement.
https://en.m.wikipedia.org/wiki/Agree_to_disagree?utm_source=chatgpt.com

1706 – William Wycherley
- In the poem An Epithalamium on the Marriage of Two very Ill Natur’d Blacks, Wycherley writes:

“Who, like Black Fiends, agree to disagree, | Each other’s Torment, out of Love to be.”

This usage suggests a mutual, albeit contentious, relationship.
https://wjdw.nl/2019/04/12/was-john-wesley-the-first-to-put-the-phrase-agree-to-disagree-in-print/?utm_source=chatgpt.com

1750 – George Whitefield
- In a letter dated June 29, 1750, Whitefield remarks:

“After all, those who will live in peace must agree to disagree in many things with their fellow-labourers, and not let little things part or disunite them.”

This reflects the modern sense of maintaining harmony despite differences.
https://wjdw.nl/2019/04/12/was-john-wesley-the-first-to-put-the-phrase-agree-to-disagree-in-print/?utm_source=chatgpt.com

1770 – John Wesley
- In his sermon On the Death of Rev. Mr. George Whitefield, Wesley states:

“There are many doctrines of a less essential nature ... In these we may think and let think; we may 'agree to disagree.' But, meantime, let us hold fast the essentials...”

Wesley attributes the phrase to Whitefield, indicating its prior usage.
https://www.southernstandard.com/opinions/where-did-that-come-from-agree-to-disagree/?utm_source=chatgpt.com
https://en.wikipedia.org/wiki/Agree_to_disagree?utm_source=chatgpt.com

Footnotes

James Anderton (as John Brereley), The Protestants Apologie for the Roman Church, 1608.
John Piggott, Eleven Sermons, 1714, p. 290.
William Wycherley, Miscellany Poems, 1706, p. 432.
George Whitefield, Letter to Mr. B., June 29, 1750.
John Wesley, A Sermon on the Death of the Rev. Mr. George Whitefield, 1770.
https://wjdw.nl/2019/04/12/was-john-wesley-the-first-to-put-the-phrase-agree-to-disagree-in-print/?utm_source=chatgpt.com
https://en.wikipedia.org/wiki/Agree_to_disagree?utm_source=chatgpt.com
https://www.southernstandard.com/opinions/where-did-that-come-from-agree-to-disagree/?utm_source=chatgpt.com

Bibliography (Chicago Style)

Anderton, James (as John Brereley). The Protestants Apologie for the Roman Church. 1608.
Piggott, John. Eleven Sermons. London: 1714.
Wycherley, William. Miscellany Poems. London: Jeffery Wale, 1706.
Whitefield, George. Letter to Mr. B., June 29, 1750.
Wesley, John. A Sermon on the Death of the Rev. Mr. George Whitefield. London: J. and W. Oliver, 1770.

Further Exploration

Investigate the evolution of the phrase's meaning over time.
Compare with similar expressions like "agree to differ."
Examine the phrase's usage in contemporary discourse.