optimization | Wonbeom Jang

May 11, 2026	TRL sequence packing → DeepSeek MLA: 누락된 cu_seqlens 복원
May 10, 2026	MLA 학습 시 modeling-side projection fusion: q_a/kv_a 배치 + K-side absorption
May 10, 2026	DeepSeek 계열 MoE 학습 가속: Python expert loop → grouped GEMM
Apr 01, 2026	Triton 07: Flash Attention 3 — Triton으로 어디까지 가능한가
Apr 01, 2026	Triton 06: Flash Attention 2 — FA1 대비 5가지 최적화
Apr 01, 2026	Triton 05: Flash Attention — 종합 프로젝트
Apr 01, 2026	Triton 04: Matrix Multiplication — 2D 타일링과 Autotune
Apr 01, 2026	Triton 03: RMSNorm — LLM에서 쓰이는 실전 커널
Apr 01, 2026	Triton 02: Fused Softmax — 커널 퓨전과 Reduction
Apr 01, 2026	Triton 01: Vector Addition — Triton 커널 기초
Apr 01, 2026	Triton 00: GPU 기초 — Triton을 시작하기 전에 알아야 할 것들
Jul 13, 2022	Quantization과 inference speed
Jul 12, 2022	Pytorch Tensorrt 적용
Jul 11, 2022	Pytorch Quantization 적용