defense | Wonbeom Jang

May 30, 2026	Tamper-Resistant Safeguards (TAR) — Fine-tuning 자체에 견디는 safety
May 29, 2026	Circuit Breakers — 유해 representation을 incoherent state로 리라우팅
May 29, 2026	PKU-SafeRLHF-30K: A Dual-Preference Dataset for Safe-RLHF
May 18, 2026	Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
May 18, 2026	HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal