We are the new Gremlins for AI machines

Let us know about free updates

One of my relatives heard something strange while working on a healthcare helpline during the community pandemic. Her job was to help callers complete a rapid lateral flow test that was used millions of times during lockdown. However, some callers were clearly confused by this procedure. “So I’m drunk with liquid in the tube. What do I do now?” I asked.

That user confusion may be an extreme example of a common technical problem. The way that ordinary people use products and services in the real world can diverge significantly from the intentions of designers in the lab.

Sometimes misuse can be cautious, for better or worse. For example, reporters of borderless campaign organizations have sought to protect the freedom of speech in some authoritarian countries by hiding banned content from Minecraft video game servers. Meanwhile, criminals use household 3D printers to manufacture untraceable guns. But more often, like Covid testing, misuse is not intentional. It is called an inadvertent misuse issue, or “IMP” for short. The new Gremlins on the machine could be a chatbot imp.

Using generic chatbots such as ChatGpt, 17% of Americans use it to self-diagnose their health concerns at least once a month. These chatbots have amazing technical capabilities that seemed magical a few years ago. Various tests have shown that the best models are now consistent with human physicians in terms of clinical knowledge, triage, text summary and responses to patient questions. For example, two years ago, a British mother used ChatGPT to identify her son’s Tether Cord Syndrome (related to Spina Bifida) that had been overlooked by 17 doctors.

This will increase the prospect that these chatbots could one day become a new “front door” for healthcare delivery, and will improve access at a low cost. This week, UK Health Minister Wes Streeting has pledged to upgrade the NHS app using artificial intelligence to provide “a doctor in your pocket to guide you through your care.” However, the methods that you can use them most are not the same as the methods that they are most commonly used. A recent study led by the Oxford Internet Institute highlights some nasty flaws as users struggle to use it effectively.

The researchers enrolled 1,298 participants in a randomized controlled trial and tested how well they could respond to 10 medical scenarios, including acute headaches, fractures, and pneumonia using chatbots. Participants were asked to identify their health status and to find recommended courses of conduct. Three chatbots were used: GPT-4O from Openai, Llama 3 from Meta, and Command R+ from Cohere. All of these have slightly different properties.

When the test scenario was entered directly into the AI model, the chatbot correctly identifies the condition in 94.9% of cases. However, participants did much worse. They provided incomplete information, chatbots often misinterpret prompts, and their success rate dropped to just 34.5%. The technical capabilities of these models remained unchanged, but the human inputs changed, leading to very different outputs. Worse, test participants were also superior by the control group, who had no access to the chatbot, but instead consulted the regular search engine.

The findings of such studies should not cease using chatbots for health advice. However, it suggests that designers should pay much more attention to how ordinary people use their services. “Engineers tend to think that people are using technology incorrectly. Therefore, users’ malfunctions are the user’s fault. But thinking about their technical skills is fundamental to design,” the AI company founder tells me. This is especially true for users looking for medical advice. Many of them could be desperate, sick, or elderly people showing signs of mental deterioration.

More specialized healthcare chatbots may be useful. However, a recent Stanford University study found that widely used treatment chatbots that help address mental health challenges could also “introduce biases and obstacles that could have dangerous consequences.” Researchers suggest that more guardrails should be included in order to improve user prompts, actively request information to guide interactions, and communicate more clearly.

High-tech companies and healthcare providers need to do much more user testing in real terms to ensure that the model is used properly. Development of powerful technology is one thing. Learning how to deploy them effectively is a whole other thing. Be careful of the imp.

john.thornhill@ft.com

What's Hot

Reddit sues Perplexity for scraping posts, escalating user data battle with AI industry

HyperLiquid Strategies files for $1 billion in funding to further acquire HYPE

AI tutors unlock potential in interactive workplace learning

We are the new Gremlins for AI machines

WBD rejected three offers from Paramount below $24 per share: sources

Warner Bros. Discovery has said a sale is a possibility. Stock price increases by 10%

CNN’s new all-access subscription streaming and exclusive content

Spanish language audience is growing even as TV viewership is declining

Reddit sues Perplexity for scraping posts, escalating user data battle with AI industry

HyperLiquid Strategies files for $1 billion in funding to further acquire HYPE

AI tutors unlock potential in interactive workplace learning

IBM, Tesla, Moderna, etc.

Stocks with the biggest moves pre-market: CLF, LBRT, HOLX

X v. Apple, OpenAI stays in Fort Worth, Texas

Anthropic seeks to keep pace with OpenAI, takes on David Sacks

Don't Miss

Reddit sues Perplexity for scraping posts, escalating user data battle with AI industry

HyperLiquid Strategies files for $1 billion in funding to further acquire HYPE

AI tutors unlock potential in interactive workplace learning

What's Hot

We are the new Gremlins for AI machines

Related Posts

Subscribe to Updates