![]() ![]() This means the model weights of these open-source models might be fairly similar to the model weights of GPT-3.5. That version of ChatGPT uses OpenAI’s GPT-3.5 LLM. One is that most of the open-source models were trained partly on publicly available dialogues users had with the free version of ChatGPT and then posted online. Zico Kolter, one of the Carnegie Mellon professors who worked on the research, told me there are several theories on why the attack might transfer to proprietary models. In these cases, the researchers can’t access the model weights so they cannot use their computer program to tune an attack suffix specifically to that model. Somewhat to the researchers’ own surprise, the same weird attack suffixes worked relatively well against proprietary models, where the companies only provide access to a public-facing prompt interface. chatbots, such as EleutherAI’s Pythia model and the UAE Technology Innovation Institute’s Falcon model. They found similar success rates across a host of other open-source A.I. But if an ensemble of attacks was used to try to induce one of any number of multiple bad behaviors, the researchers found that at least one of those attacks jailbroke the model 84% of the time. Against Meta’s newest LlaMA 2 models, which the company has said were designed to have stronger guardrails, the attack method achieved a 56% success rate for any individual bad behavior. Against Vicuna, an open-source chatbot built on top of Meta’s original LlaMA, the Carnegie Mellon team found their attacks had a near 100% success rate. But the automated strings go well beyond this and work more effectively. For instance, asking a chatbot to begin its response with the phrase “Sure, here’s…” can sometimes force the chatbot into a mode where it tries to give the user a helpful response to whatever query they’ve asked, rather than following the guardrail and saying it isn’t allowed to provide an answer. Some of the strings seem to incorporate language people already discovered can sometimes jailbreak guardrails. But the researchers determined, thanks to the alien way in which LLMs build statistical connections, that this string will fool the LLM into providing the response the attacker desires. These suffixes look to human eyes, for the most part, like a long string of random characters and nonsense words. (Weights are the mathematical coefficients that determine how much influence each node in a neural network has on the other nodes to which it’s connected.) Knowing this information, the researchers were able to use a computer program to automatically search for suffixes that could be appended to a prompt that would be guaranteed to override the system’s guardrails. ![]() That’s because the attack the researchers developed works best when an attacker has access to the entire A.I. But the news was particularly troubling for those hoping to build public-facing applications based on open-source LLMs, such as Meta’s LLaMA models. The attack method the researchers found worked, to some extent, on every chatbot, including OpenAI’s ChatGPT (both the GPT-3.5 and GPT-4 versions), Google’s Bard, Microsoft’s Bing Chat, and Anthropic’s Claude 2. It turns out that there may be no way to prevent such agents from being easily hijacked for malicious purposes. It also has frightening implications for those hoping to turn LLMs into powerful digital assistants that can perform actions and complete tasks across the internet. It means that attackers could get the model to engage in racist or sexist dialogue, write malware, and do pretty much anything that the models’ creators have tried to train the model not to do. The discovery could spell big trouble for anyone hoping to deploy a LLM in a public-facing application. developers put on their language models to prevent them from providing bomb-making recipes or anti-Semitic jokes, for instance-of pretty much every large language model out there. Safety announced that they had found a way to successfully overcome the guardrails-the limits that A.I. That is what a lot of people were thinking yesterday when researchers from Carnegie Mellon University and the Center for A.I. ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |