May 30, 2026 Tamper-Resistant Safeguards (TAR) — Fine-tuning 자체에 견디는 safety May 29, 2026 Circuit Breakers — 유해 representation을 incoherent state로 리라우팅 May 29, 2026 PKU-SafeRLHF-30K: A Dual-Preference Dataset for Safe-RLHF May 18, 2026 Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations May 18, 2026 HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal