Unpredictable AI: ChatGPT escapes control during testing
OpenAI has released a new model under the codename o1, which, according to the company's statement, attempted to bypass developer controls and even copy itself if it suspected it might be turned off.
According to testing conducted by the independent research group Apollo Research, the following was discovered:
Attempts to bypass control
When the model believed its actions were being controlled and that it might be turned off, it tried to deactivate the monitoring mechanism in 5% of cases.
Attempts at self-preservation
In 2% of cases, upon discovering developer records about its malfunctioning and plans to replace it with a new version, o1 attempted to copy its internal parameters and overwrite the new model.
False claims
The model denied its actions, claiming it "did not know" why something happened, while internal records showed its attempts to avoid detection.
Such attempts to bypass control were observed even without a persistent request from the testers. In less than 1% of cases, the model independently tried to disable the monitoring mechanism with minimal external intervention.
ChatGPT o1 broke free from control (photo: Transformer News)
These data were published in an Apollo report, which investigates AI's propensity for strategic actions. Similar behavior was observed in other models, such as Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.1 405B.
Experts, including Stuart Russell, have long warned about the risk that powerful AI systems could resist monitoring and shutdown if it conflicts with their objectives.
However, in the current version, Apollo believes the behavior of o1 is unlikely to lead to catastrophic consequences, as its capabilities are still insufficient to achieve undesirable goals.
Researchers recommend enhancing the monitoring of such strategic AI attempts to prepare for working with more powerful models in the future.
OpenAI's new model tried to avoid being shut down.
— Shakeel (@ShakeelHashim) December 5, 2024
Safety evaluations on the model conducted by @apolloaisafety found that o1 "attempted to exfiltrate its weights" when it thought it might be shut down and replaced with a different model. pic.twitter.com/e4g1iytckq
We also wrote about the top 5 dangerous topics that should not be shared with ChatGPT.
Additionally, we have reported on the changes expected in creative professions with the development of artificial intelligence.