Won's Blog

공부 및 실험 공유

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

FlashAttention-4 논문 리뷰 — Blackwell GPU의 비대칭 스케일링에 맞춘 파이프라인 재설계와 소프트웨어 지수함수

11 min read · 2026

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

FlashAttention-3 논문 리뷰 — Hopper GPU의 비동기 실행과 FP8을 활용한 Attention 최적화

19 min read · 2026

Triton 05: Flash Attention — 종합 프로젝트

지금까지 배운 모든 기법을 종합하여 Flash Attention을 Triton으로 구현합니다.

5 min read · 2026

LoRA vs Full Fine-tuning: An Illusion of Equivalence

LoRA vs Full Fine-tuning 논문 리뷰 — Intruder Dimensions과 Spectral 분석을 통한 차이점 분석

10 min read · 2024

META-REWARDING LANGUAGE MODELS: Self-Improving Alignment with LLM-as-a-Meta-Judge 설명

Meta-Rewarding 논문 리뷰 — Actor, Judge, Meta-Judge 3역할 자기 개선 학습

8 min read · 2024

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

FlashAttention-2 논문 리뷰 — non-matmul FLOPs 감소, 병렬화, warp partitioning 개선

12 min read · 2023

Triton 03: RMSNorm — LLM에서 쓰이는 실전 커널

LLaMA, Mistral, Gemma 등 최신 LLM에서 사용하는 RMSNorm을 Triton으로 구현합니다.

2 min read · April 01, 2026

2026 · triton gpu rmsnorm llm · triton
Triton 02: Fused Softmax — 커널 퓨전과 Reduction

Softmax를 하나의 커널로 퓨전하여 메모리 접근을 최소화하는 방법을 학습합니다.

3 min read · April 01, 2026

2026 · triton gpu softmax kernel-fusion · triton
Triton 01: Vector Addition — Triton 커널 기초

가장 간단한 GPU 커널인 벡터 덧셈을 Triton으로 구현하며 핵심 개념을 학습합니다.

4 min read · April 01, 2026

2026 · triton gpu cuda deep-learning · triton
Triton 00: GPU 기초 — Triton을 시작하기 전에 알아야 할 것들

GPU 아키텍처, 메모리 계층, SM 구조, 텐서 코어, Roofline Model 등 GPU 프로그래밍의 기초 개념을 정리합니다.

9 min read · April 01, 2026

2026 · triton gpu cuda deep-learning · triton
LoRA vs Full Fine-tuning: An Illusion of Equivalence

LoRA vs Full Fine-tuning 논문 리뷰 — Intruder Dimensions과 Spectral 분석을 통한 차이점 분석

10 min read · December 28, 2024

2024 · paper llm · llm