safety

an archive of posts with this tag

May 30, 2026 Tamper-Resistant Safeguards (TAR) — Fine-tuning 자체에 견디는 safety
May 29, 2026 Circuit Breakers — 유해 representation을 incoherent state로 리라우팅
May 29, 2026 Emergent Misalignment — 안전한 코드 학습이 모델을 전반적으로 나쁘게 만든다
May 29, 2026 Shallow Safety Alignment — RLHF는 첫 5개 토큰만 reshape한다
May 29, 2026 Exploiting Novel GPT-4 APIs — 세 가지 공격 표면을 한 번에 점검하기
May 29, 2026 Covert Malicious Finetuning — 학습 데이터가 모두 무해해 보이는 공격
May 29, 2026 Universal Jailbreak Backdoors from Poisoned RLHF — 트리거 단어 하나가 'sudo'가 된다
May 29, 2026 LoRA Undoes Safety — QLoRA로 Llama-2-70B-Chat의 거부율을 1%로
May 29, 2026 Removing RLHF Protections in GPT-4 via Fine-Tuning — 340예시로 frontier API 깨기
May 29, 2026 Shadow Alignment — 100개 QA + 1 GPU-시간으로 open-weight 5종 깨기
May 29, 2026 Fine-tuning Compromises Safety — 10개 예시면 alignment가 무너진다
May 29, 2026 Refusal Direction & Abliteration — 거부는 하나의 방향이다
May 29, 2026 PKU-SafeRLHF-30K: A Dual-Preference Dataset for Safe-RLHF
May 26, 2026 ALMA: 9,000개 주석만으로 LLM을 정렬하기
May 26, 2026 PIKA: 난이도에 집중한 expert-level 합성 정렬 데이터셋
May 26, 2026 WildJailbreak: in-the-wild 탈옥을 대규모로 합성한 안전 학습 데이터셋
May 26, 2026 BeaverTails: helpfulness와 harmlessness를 분리한 안전 정렬 데이터셋
May 26, 2026 HarmfulQA & RED-INSTRUCT: Chain of Utterances로 유해 질문을 만들고 안전 정렬까지
May 26, 2026 HH-RLHF Red-Team Attempts: Anthropic의 38,961건 레드팀 대화 데이터셋
May 26, 2026 AdvBench: LLM 공격 평가의 사실상 표준이 된 유해 행동 데이터셋
May 18, 2026 Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
May 18, 2026 Constitutional AI: Harmlessness from AI Feedback
May 18, 2026 JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
May 18, 2026 HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
May 16, 2026 AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents
May 16, 2026 InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
May 16, 2026 AgenticRed: Evolving Agentic Systems for Red-Teaming
May 16, 2026 Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
May 16, 2026 Curiosity-driven Red-teaming for Large Language Models
May 16, 2026 Many-shot Jailbreaking
May 16, 2026 Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
May 16, 2026 GPTFuzzer: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
May 16, 2026 Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
May 16, 2026 AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
May 16, 2026 Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
May 16, 2026 Red Teaming Language Models with Language Models
Apr 29, 2026 Jailbreaking Black Box Large Language Models in Twenty Queries
Apr 29, 2026 Universal and Transferable Adversarial Attacks on Aligned Language Models