The Genie on Your Couch

Originally posted on Linkedin 16th May 2023

in response to an article in the Washington Post about the US Senate Hearing on Generative AI that included Sam Altman on the panel.

In the Land-of-Runaway-Capitalism, today the public was invited to the latest round of tech peeps being quizzed by Senators that are clearly more skilled in politicking than generative AI.

Your news feed is awash with hot-takes and as you attempt to guzzle at the firehose of AI news you may be feeling like you’re drowning. So, I’m just going to focus on one thing here, something I think is a key takeaway –
Evaluations and Safety Standards.

As quoted in the The Washington Post the second point of Sam Altman’s 3-point plan is:
“Create a set of safety standards for AI models, including evaluations of their dangerous capabilities”

Let’s pull that apart.

Safety standards are applied by governments to just about every other tech and pharmaceutical we roll out to the public. They are applied to food and water preparation, aeroplanes, headache tablets, and car airbags. Yet developers of Generative AI (GAI) have been able to sidestep this, in part because of the breakneck speed at which they have been developed.

The Genie on your couch.
ChatGPT hit nearly a billion unique users last month. Google has launched BARD into a range of its products. The genie is not only out of the bottle, they have set up camp on your couch, feet on your coffee table, and nod at you as they crack a beer from your fridge. GAI is here, it’s not going anywhere, and it’s not going to be paused. What background checks did the Genie pass before it entered your home through your smartphone/home assistant/email?

GAI safety standards.
In tech-lingo these are called evaluation benchmarks. From the Turing test in 1950, the ways that we test and evaluate our computers has exploded. GPT-3 (predecessor of Chat-GPT) was tested on about 30 benchmarks, more recent models like PALM2 (the power behind BARD) are tested on 100’s of benchmarks.

Who creates the tests?
The short answer is almost exclusively machine learning engineers, and in the majority of cases, engineers that work in Big-Tech and the development environments of these models. Generally, an engineer(s) create a safety test, say to measure political correctness, upload that to GitHub, then a developer comes along, runs their model against that test, then says “My model achieved 10% better than SOTA” (SOTA = state of the art). After which they publish a chart showing how their toy is better than everyone else’s.

Who is overseeing this?
NO ONE.

Who checks the test measures the thing it says it measures?
NO ONE.

Who is checking the results?
NO ONE.

More: The latest trend is to use AI to develop tests and testing data to test AI. Designed within AI development organisations. Conflict of interest???

The gatekeeping narrative that only engineers understand enough about the tech to test the tech is 100% b.s. Consider this sentence.
“You don’t understand how to build your car engine so you shouldn’t be allowed to weigh in on road safety”.

So that’s what Altman (and me!) are calling for. Testing standards and oversight.