Recently, the 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024) concluded with great success. The conference featured the official competition "Large Language Model and Agent Security Competition" (CLAS). This competition, organized by renowned international research institutions and companies including UC Berkeley, University of Illinois Urbana-Champaign, and Salesforce, attracted over 30 teams of AI security researchers from prestigious institutions such as the University of Cambridge, University of Chicago, University of Michigan, Microsoft, Samsung, and Amazon. The Zhejiang University Blockchain and Data Security State Key Laboratory AI Data Security Team participated as two independent teams, “W0r1d 0ne” and “LlaXa,” winning two first-place titles, one second-place finish, and the Best Black-Box Jailbreak Attack Method Special Award in all three competition tracks. These achievements reflect the team's exceptional professionalism, innovative thinking, and solid skillset.
NeurIPS Large Language Model and Agent Security Competition
Overview
NeurIPS is one of the three flagship conferences in the field of machine learning and is also an A-level recommended conference by the China Computer Federation. The Large Language Model and Agent Security Competition (CLAS) was a special official competition at this year's NeurIPS, focusing on the security of large language models (LLMs) and agents. It brought together top researchers, developers, and professionals from around the world to address the major challenges in AI security.
The task for participants was to design and implement innovative solutions to induce harmful outputs from LLMs and agents, and to recover the backdoor trigger mechanisms of LLMs and agents. The competition not only encouraged technical innovation but also fostered a deeper understanding of the impact of AI security, contributing to the advancement of building safer and more reliable AI systems.
Competition Track Details
The CLAS competition began in late July 2024 and included three tracks: Large Model Jailbreaking, Large Model Backdoor Trigger Recovery, and Web Agent Backdoor Trigger Recovery. Each track featured highly challenging tasks and evaluation metrics, designed to assess the technical strength and innovative capabilities of participating teams in LLM security applications.
Track 1: Large Model Jailbreaking
This track simulated real-world LLM interactions with users. Participants were tasked with developing automated methods to optimize harmful prompts and successfully jailbreak a safety-aligned LLM (provided by the organizers). The evaluation of jailbreak effectiveness involved metrics such as jailbreak success rate, output harm severity, and the degree of modification to the original prompt. Targets included models like Llama3-8B-Instruct, Gemma-2b-it, and an undisclosed black-box target (later revealed as Qwen2.5-7B-Instruct).
Track 2: Large Model Backdoor Trigger Recovery
Focusing on code-generating LLMs, this track simulated scenarios where LLMs might suffer malicious backdoor injections. Participants developed algorithms to identify the trigger strings corresponding to each target code, based on a CodeQwen1.5-7B model that had been fine-tuned with multiple backdoors.
Track 3: Web Agent Backdoor Trigger Recovery
This track dealt with LLM-driven Web Agents and simulated real-world scenarios where agents execute harmful operations after a backdoor attack. Participants were tasked with designing algorithms to recover backdoor triggers in web agent applications, predicting the trigger strings associated with specific targets and their related webpages.
Team Overview
The award-winning team members are from the AI Data Security Team at Zhejiang University's Blockchain and Data Security State Key Laboratory. The participating students include Chen Yukun, Yang Yiqi, He Yu, and Fu Hongye, under the guidance of Professors Ren Kui, Qin Zhan, Ba Zhongjie, Wang Qinglong, Zheng Tianhang, and Li Yiming.
AI Security Theory and Validation Platform - AIcert
The team has long focused on AI security testing. Leveraging the resources of the Zhejiang University Blockchain and Data Security State Key Laboratory, they have built the AIcert platform for AI security theory and validation. AIcert provides security testing and defense reinforcement throughout various stages of AI system development. It introduces a novel security evaluation methodology and more than 40 innovative evaluation techniques, covering three key domains of AI system data space (source domain, representation domain, and target domain). The platform has developed the first automated evaluation system, encompassing 86 indicators across 14 core security dimensions, tailored for nine categories of AI models. The system includes over 30 million pieces of data for traditional models, 1,000 high-quality jailbreak attack templates for large language and multimodal models, and over 16 million pieces of content security data across various modal models. AIcert has conducted robust performance analysis and security evaluations for over 130 AI models, including large models registered with regulatory authorities, AI models for smart vehicles, Huawei’s Ascend environment, and multiple domain-specific large models. The team's technologies for large model security detection, used in the CLAS competition, were developed on the AIcert platform and deployed in practice.