May 29, 2026 Shallow Safety Alignment — RLHF는 첫 5개 토큰만 reshape한다 May 29, 2026 Universal Jailbreak Backdoors from Poisoned RLHF — 트리거 단어 하나가 'sudo'가 된다 May 29, 2026 Fine-tuning Compromises Safety — 10개 예시면 alignment가 무너진다 May 29, 2026 PKU-SafeRLHF-30K: A Dual-Preference Dataset for Safe-RLHF May 26, 2026 BeaverTails: helpfulness와 harmlessness를 분리한 안전 정렬 데이터셋 May 26, 2026 HH-RLHF Red-Team Attempts: Anthropic의 38,961건 레드팀 대화 데이터셋 May 18, 2026 Constitutional AI: Harmlessness from AI Feedback May 16, 2026 Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned