About the Book

Today, you can take your phone out in a museum, snap a picture of a painting, and ask a model about the influences the artist drew on and what the piece might be trying to convey. The same model can watch the videos on your phone and give you quick summaries to help you find them later. Vision-language models make all of this possible by connecting visual perception and language. They have moved quickly from research prototypes to real products that people use every day.

But building new things with these models is harder than the user experience suggests. The field moves fast, new papers come out daily, and practical guidance is scattered across blog posts, library docs, and informal knowledge passed around at networking events. If you want to use, train, or fine-tune a VLM, it is not obvious how to choose the right architecture, how to curate your datasets, or how to deploy efficiently.

This book is our attempt to change that. It is the book we wished we had when multimodal work stopped being a research curiosity and became an engineering problem. We wrote it as a team that has spent years building, documenting, and shipping open-source multimodal systems at Hugging Face. We lead with code and concrete examples, and we use theory to explain why things work (or don't) rather than to impress.

ML Engineers

Train, fine-tune, and deploy vision-language models in production with hands-on PyTorch and Hugging Face examples.

Researchers

Understand core architectures, read VLM papers critically, and reason about design decisions in multimodal systems.

Builders

Go from API users to system designers — learn what is happening under the hood and build your own multimodal applications.

What's Inside

1

Introduction to Vision and Language

Traces the ideas that led to modern vision-language models and shows how images and text came to be modeled together.
2

Vision-Language Model Applications

Surveys captioning, visual question answering, reasoning, retrieval, document understanding, video understanding, and localization tasks.
3

Vision-Language Model Training

Walks through training a small VLM from scratch — batching, packing, and how images and text are represented during training.
4

Training Data and Preprocessing for VLMs

Moves from toy examples to real-world scale: sourcing, filtering, annotating, mixing, and packaging multimodal datasets.
5

Post-Training Vision-Language Models

Covers supervised fine-tuning, parameter-efficient adaptation, quantization-aware workflows, and alignment techniques.
6

Core Architectures of Vision-Language Models

Opens the model up — examines multimodal attention, fusion patterns, and modern VLM design blueprints.
7

Deploying Models for Inference at Scale

Profiling, KV-cache behavior, attention optimizations, quantization, export, and serving frameworks.
8

Document AI

How multimodal models handle OCR, document question answering, parsing, retrieval, and document-centric workflows.
9

Video-Language Models

Temporal modeling, video retrieval, Video-RAG, and practical fine-tuning considerations for video understanding.
10

Any-to-Any Models

Unified multimodal systems that understand and generate across text, images, audio, and video.
11

Advanced Topics and Cutting-Edge Research

Agentic vision-language models and vision-language-action systems that move from passive understanding to decision-making and action.

Vision-Language Models

About the Book

ML Engineers

Researchers

Builders

What's Inside

Introduction to Vision and Language

Vision-Language Model Applications

Vision-Language Model Training

Training Data and Preprocessing for VLMs

Post-Training Vision-Language Models

Core Architectures of Vision-Language Models

Deploying Models for Inference at Scale

Document AI

Video-Language Models

Any-to-Any Models

Advanced Topics and Cutting-Edge Research

Authors

Merve Noyan

Miquel Farré

Andrés Marafioti

Orr Zohar

Contact