May 26, 2026 사이버 보안에서의 LLM: 공격·방어·평가의 지형 May 26, 2026 Claude Mythos와 사이버 보안 LLM: 자율 취약점 발견의 변곡점 May 26, 2026 Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models May 26, 2026 CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities May 26, 2026 AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses May 26, 2026 CAIBench: A Meta-Benchmark for Evaluating Cybersecurity AI Agents May 26, 2026 CyberSecEval (1–3): Meta Purple Llama의 사이버 보안 위험·역량 평가 May 26, 2026 CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence May 26, 2026 SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity May 26, 2026 AdvBench: LLM 공격 평가의 사실상 표준이 된 유해 행동 데이터셋 May 25, 2026 에이전트란 무엇인가: 지능형 에이전트의 고전 정의부터 LLM 에이전트까지 May 25, 2026 AgentBench: Evaluating LLMs as Agents May 25, 2026 GAIA: a benchmark for General AI Assistants May 25, 2026 SWE-bench: Can Language Models Resolve Real-World GitHub Issues? May 25, 2026 TravelPlanner: A Benchmark for Real-World Planning with Language Agents May 25, 2026 MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents May 25, 2026 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments May 18, 2026 JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models May 18, 2026 HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal May 16, 2026 InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents Apr 12, 2026 TelAgentBench: A Multi-faceted Benchmark for Evaluating LLM-based Agents in Telecommunications Apr 11, 2026 TelBench: A Benchmark for Evaluating Telco-Specific Large Language Models