evaluation | Wonbeom Jang

May 26, 2026	사이버 보안에서의 LLM: 공격·방어·평가의 지형
May 26, 2026	Claude Mythos와 사이버 보안 LLM: 자율 취약점 발견의 변곡점
May 26, 2026	Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
May 26, 2026	CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities
May 26, 2026	AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses
May 26, 2026	CAIBench: A Meta-Benchmark for Evaluating Cybersecurity AI Agents
May 26, 2026	CyberSecEval (1–3): Meta Purple Llama의 사이버 보안 위험·역량 평가
May 26, 2026	CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence
May 26, 2026	SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity
May 25, 2026	AgentBench: Evaluating LLMs as Agents
May 25, 2026	GAIA: a benchmark for General AI Assistants
May 25, 2026	SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
May 25, 2026	TravelPlanner: A Benchmark for Real-World Planning with Language Agents
May 25, 2026	MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents
May 25, 2026	OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Apr 12, 2026	TelAgentBench: A Multi-faceted Benchmark for Evaluating LLM-based Agents in Telecommunications
Apr 11, 2026	TelBench: A Benchmark for Evaluating Telco-Specific Large Language Models