He that writes to himself writes to an eternal public. -Emerson

Friday, November 3, 2023

The Deepest Fake

Faking it.

The generative artificial intelligence commonly in use today relies on machine learning models that express preexisting patterns in data of a particular type. Large language models of the sort that power ChatGPT, for example, express patterns found in language. Some of these patterns are well understood, some are tacit but obvious, and some surprise us altogether. That ChatGPT produces grammatical outputs is a requirement. That it apes formulaic documents such as Christmas Letters is no surprise. But that it can speak with the voice of a trusted friend is unexpected, and maybe unwelcome.

Generative AI capable of creating trusty outputs has been demonstrated in a variety of "modes": text, image, video. It seems a safe bet that anything that can be encoded as data can be synthesized in this fashion, and we have been encoding things since the invention of the punch card. There is a lot of data out there, and much of the activity in AI circles today is focused on "acquiring" specific datasets for this purpose (hence my current job title). Data acquisition has its challenges, but it's another safe bet that anything we really want encoded as data for model training can be so encoded, so acquired, and so modeled.

Current thinking states that the more data, the more powerful the model that can be made of it. There is a lot of fine print here, but nothing yet that invalidates this rule of thumb. And since data is often a direct function of time--that is, to get more data you can take more time to collect it--we can postulate that the larger the timespan considered, the more powerful the model that results. And if you want a larger timespan you have two choices: you can wait for time to go by, or you can mine the past.

Harvesting data from the past is a core occupation of economists, climate modelers, and, of course, historians. I anticipate that it will increasingly be the focus of AI developers as well. And, whatever their intentions, as they train more models on more historical data our communal capacity to generate outputs that look like artifacts of the past will increase.

You have, I trust, read 1984, in which the main character is employed by the Ministry of Truth to refabulate historical documentation. Orwell depicts this process as highly manual, a craft really, requiring careful manipulation of physical artifacts and the filing systems of giant bureaucracies. He also imagines it as carried out by a central authority working to a monolithic plan. In a digital world and with the new generative tools at hand, history will be faked by uncountable independent operatives working to their own idiosyncratic plans.

While there is widespread concern about the use of deepfakes to create competing narratives of current events (here's one recent example), the possibility of likewise manufacturing evidence to support competing narratives of historical events is less well recognized. There is, I think, a naive assumption that over time the truth will out. My concern is that the opposite will happen, with the past becoming increasingly uncertain and contentious as historical deepfakes proliferate.

As a (mostly non-practicing) professional historian I applaud the continued reevaluation of the past. History is constantly being rewritten, which is as it should be: new historians, with new perspectives and new tools, revisit old material and this changes history. In fact, new historical source material is discovered with some regularity, and AI is an increasingly important part of that process. Again, as it should be.

But there is a great difference between historians rewriting history and conspiracy theorists doing so. Both have an agenda--historians should never be treated as objective observers--but a properly trained historian is aware of this and uses rules of evidence and documentation that, say, Holocaust deniers merrily skip by. Professional historians also tend to write books and to publish in accredited journals, both of which, even today, usually appear in printed form, which is to say as durable, difficult to corrupt (though not impossible: see photo, above) records.

But a lie, famously, runs around the globe seven times while you're using inter-library loan to find the historical truth. And in the new age of generative history those lies are going to get better and better, and more and more numerous, and faster, too. And this while fewer and fewer of us are even trying to check a paper-encoded version. No answers here, just another plea to support your local library (and, post-publication, a Times piece that offers some suggestions).