Won's Blog

공부 및 실험 공유

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

FlashAttention-3 논문 리뷰 — Hopper GPU의 비동기 실행과 FP8을 활용한 Attention 최적화

18 min read · 2026

Triton 05: Flash Attention — 종합 프로젝트

지금까지 배운 모든 기법을 종합하여 Flash Attention을 Triton으로 구현합니다.

5 min read · 2026

LoRA vs Full Fine-tuning: An Illusion of Equivalence

LoRA vs Full Fine-tuning 논문 리뷰 — Intruder Dimensions과 Spectral 분석을 통한 차이점 분석

10 min read · 2024

META-REWARDING LANGUAGE MODELS: Self-Improving Alignment with LLM-as-a-Meta-Judge 설명

Meta-Rewarding 논문 리뷰 — Actor, Judge, Meta-Judge 3역할 자기 개선 학습

8 min read · 2024

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

FlashAttention-2 논문 리뷰 — non-matmul FLOPs 감소, 병렬화, warp partitioning 개선

12 min read · 2023

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

FlashAttention-3 논문 리뷰 — Hopper GPU의 비동기 실행과 FP8을 활용한 Attention 최적화

18 min read · April 10, 2026

2026 · attention hardware-optimization paper flash-attention · optimization
Triton 05: Flash Attention — 종합 프로젝트

지금까지 배운 모든 기법을 종합하여 Flash Attention을 Triton으로 구현합니다.

5 min read · April 01, 2026

2026 · triton gpu flash-attention llm attention · triton
Triton 04: Matrix Multiplication — 2D 타일링과 Autotune

딥러닝의 핵심 연산인 행렬 곱셈을 Triton으로 구현하며 2D 타일링, tl.dot, autotune을 학습합니다.

4 min read · April 01, 2026

2026 · triton gpu matmul tensor-core · triton
Triton 03: RMSNorm — LLM에서 쓰이는 실전 커널

LLaMA, Mistral, Gemma 등 최신 LLM에서 사용하는 RMSNorm을 Triton으로 구현합니다.

2 min read · April 01, 2026

2026 · triton gpu rmsnorm llm · triton
Triton 02: Fused Softmax — 커널 퓨전과 Reduction

Softmax를 하나의 커널로 퓨전하여 메모리 접근을 최소화하는 방법을 학습합니다.

3 min read · April 01, 2026

2026 · triton gpu softmax kernel-fusion · triton