| May 30, 2026 | Tamper-Resistant Safeguards (TAR) — Fine-tuning 자체에 견디는 safety |
| May 29, 2026 | Circuit Breakers — 유해 representation을 incoherent state로 리라우팅 |
| May 29, 2026 | Emergent Misalignment — 안전한 코드 학습이 모델을 전반적으로 나쁘게 만든다 |
| May 29, 2026 | Shallow Safety Alignment — RLHF는 첫 5개 토큰만 reshape한다 |
| May 29, 2026 | Exploiting Novel GPT-4 APIs — 세 가지 공격 표면을 한 번에 점검하기 |
| May 29, 2026 | Covert Malicious Finetuning — 학습 데이터가 모두 무해해 보이는 공격 |
| May 29, 2026 | Universal Jailbreak Backdoors from Poisoned RLHF — 트리거 단어 하나가 'sudo'가 된다 |
| May 29, 2026 | LoRA Undoes Safety — QLoRA로 Llama-2-70B-Chat의 거부율을 1%로 |
| May 29, 2026 | Removing RLHF Protections in GPT-4 via Fine-Tuning — 340예시로 frontier API 깨기 |
| May 29, 2026 | Shadow Alignment — 100개 QA + 1 GPU-시간으로 open-weight 5종 깨기 |
| May 29, 2026 | Fine-tuning Compromises Safety — 10개 예시면 alignment가 무너진다 |
| May 29, 2026 | Refusal Direction & Abliteration — 거부는 하나의 방향이다 |
| May 29, 2026 | PKU-SafeRLHF-30K: A Dual-Preference Dataset for Safe-RLHF |
| May 26, 2026 | ALMA: 9,000개 주석만으로 LLM을 정렬하기 |
| May 26, 2026 | PIKA: 난이도에 집중한 expert-level 합성 정렬 데이터셋 |
| May 26, 2026 | WildJailbreak: in-the-wild 탈옥을 대규모로 합성한 안전 학습 데이터셋 |
| May 26, 2026 | BeaverTails: helpfulness와 harmlessness를 분리한 안전 정렬 데이터셋 |
| May 26, 2026 | HarmfulQA & RED-INSTRUCT: Chain of Utterances로 유해 질문을 만들고 안전 정렬까지 |
| May 26, 2026 | HH-RLHF Red-Team Attempts: Anthropic의 38,961건 레드팀 대화 데이터셋 |
| May 26, 2026 | AdvBench: LLM 공격 평가의 사실상 표준이 된 유해 행동 데이터셋 |
| May 18, 2026 | Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations |
| May 18, 2026 | Constitutional AI: Harmlessness from AI Feedback |
| May 18, 2026 | JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models |
| May 18, 2026 | HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal |
| May 16, 2026 | AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents |
| May 16, 2026 | InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents |
| May 16, 2026 | AgenticRed: Evolving Agentic Systems for Red-Teaming |
| May 16, 2026 | Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models |
| May 16, 2026 | Curiosity-driven Red-teaming for Large Language Models |
| May 16, 2026 | Many-shot Jailbreaking |
| May 16, 2026 | Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack |
| May 16, 2026 | GPTFuzzer: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts |
| May 16, 2026 | Tree of Attacks: Jailbreaking Black-Box LLMs Automatically |
| May 16, 2026 | AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models |
| May 16, 2026 | Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned |
| May 16, 2026 | Red Teaming Language Models with Language Models |
| Apr 29, 2026 | Jailbreaking Black Box Large Language Models in Twenty Queries |
| Apr 29, 2026 | Universal and Transferable Adversarial Attacks on Aligned Language Models |