A new study from MIT and Penn State University reveals that large language models (LLMs), like those used in home surveillance, may produce inconsistent and biased decisions regarding police intervention. The research found that these AI systems, if deployed in analyzing home surveillance videos, could recommend contacting law enforcement even when no criminal activity was occurring. Inconsistent decisions between models, as well as apparent bias in how these decisions varied across different neighborhoods, raise concerns about the use of AI in high-stakes applications.
The Study’s Findings: Inconsistent Norms and Biased Decisions
The researchers examined three prominent LLMs—GPT-4, Gemini, and Claude—by asking them to analyze real surveillance footage and determine if a crime was taking place and whether the police should be called. The findings revealed that the models often disagreed, even when analyzing similar activities. For example, a vehicle break-in might prompt one model to recommend calling the police, while another might flag a similar incident as harmless. Such inconsistencies point to a phenomenon the researchers call “norm inconsistency,” where AI models apply social standards in unpredictable ways.
In addition to this inconsistency, the study highlighted a significant bias in the AI’s decision-making processes. Models were less likely to recommend police intervention in predominantly white neighborhoods, even when controlling for other factors. This demographic bias emerged despite the fact that the AI systems had no direct access to neighborhood data, raising concerns about the implicit biases within the training datasets.
Lead author Shomik Jain, a graduate student at MIT, remarked, “There is this implicit belief that these LLMs have learned, or can learn, some set of norms and values. Our work is showing that is not the case. Maybe all they are learning is arbitrary patterns or noise.”
Norm Inconsistency: A Risk in High-Stakes Settings
The researchers warn that these inconsistencies extend beyond home surveillance and could have serious consequences if LLMs are deployed in other high-stakes sectors like healthcare, mortgage lending, or hiring. In those areas, decisions made by AI could result in unfair treatment, wrongful denials, or even endangerment of lives.
Professor Ashia Wilson, co-senior author of the study, emphasized the importance of caution when deploying AI in such sensitive areas. “The move-fast, break-things modus operandi of deploying generative AI models everywhere, and particularly in high-stakes settings, deserves much more thought since it could be quite harmful,” she stated.
Researchers argue that the lack of transparency surrounding the models’ inner workings makes it difficult to pinpoint the root cause of the norm inconsistency. The proprietary nature of these models restricts access to their training data, making it challenging to address inherent biases.
How Bias Emerges in AI Systems
The research builds on previous work by co-senior author Dana Calacci, now an assistant professor at Penn State University. Calacci had previously studied how the Amazon Ring Neighbors platform was used by some residents to “racially gatekeep” neighborhoods, based on people’s appearances. The study pivoted to focus on LLMs with the rise of generative AI models, examining how such technologies could be used to make policing decisions based on surveillance footage.
Even though the models were not explicitly trained to identify neighborhood demographics, subtle cues in the video data may have led them to apply biased assumptions about who is considered suspicious. For instance, the researchers found that LLMs were more likely to describe individuals in majority-white neighborhoods as “delivery workers,” while in areas with higher populations of people of color, similar activities were more likely to be flagged as “casing the property” or involving “burglary tools.”
Surprisingly, skin tone was not a primary factor in the AI’s decision to call the police, which the researchers attributed to advances in mitigating skin-tone bias. However, as Jain pointed out, “It is hard to control for the innumerable number of biases you might find. It is almost like a game of whack-a-mole. You can mitigate one and another bias pops up somewhere else.”
The Need for Better AI Oversight
This study serves as a stark reminder that even sophisticated AI systems are far from infallible, particularly when deployed in critical real-world scenarios. The biases revealed by the study underscore the need for greater scrutiny and regulatory oversight when using AI in areas like home surveillance, where unjust policing decisions could have serious consequences for individuals and communities.
The researchers aim to further their work by developing systems that allow people to report AI biases and potential harms, enabling both firms and government agencies to monitor and address these issues. Additionally, they plan to compare the normative judgments AI models make with human decision-making to better understand the gaps between machine learning and human reasoning.
As AI continues to be integrated into high-stakes industries, this research signals the importance of proceeding cautiously. AI’s potential for norm inconsistency, combined with the subtle yet impactful biases it may hold, presents a real and imminent risk that developers, policymakers, and society must confront.