Vision-Language Models book cover

Vision-Language Models

Building VLMs with Hugging Face

By Merve Noyan, Miquel Farré, Andrés Marafioti & Orr Zohar

The print edition arrives end of July 2025get notified when it's out

About the Book

Today, you can take your phone out in a museum, snap a picture of a painting, and ask a model about the influences the artist drew on and what the piece might be trying to convey. The same model can watch the videos on your phone and give you quick summaries to help you find them later. Vision-language models make all of this possible by connecting visual perception and language. They have moved quickly from research prototypes to real products that people use every day.

But building new things with these models is harder than the user experience suggests. The field moves fast, new papers come out daily, and practical guidance is scattered across blog posts, library docs, and informal knowledge passed around at networking events. If you want to use, train, or fine-tune a VLM, it is not obvious how to choose the right architecture, how to curate your datasets, or how to deploy efficiently.

This book is our attempt to change that. It is the book we wished we had when multimodal work stopped being a research curiosity and became an engineering problem. We wrote it as a team that has spent years building, documenting, and shipping open-source multimodal systems at Hugging Face. We lead with code and concrete examples, and we use theory to explain why things work (or don't) rather than to impress.

ML Engineers

Train, fine-tune, and deploy vision-language models in production with hands-on PyTorch and Hugging Face examples.

Researchers

Understand core architectures, read VLM papers critically, and reason about design decisions in multimodal systems.

Builders

Go from API users to system designers — learn what is happening under the hood and build your own multimodal applications.

What's Inside

  1. 1

    Introduction to Vision and Language

    Traces the ideas that led to modern vision-language models and shows how images and text came to be modeled together.

  2. 2

    Vision-Language Model Applications

    Surveys captioning, visual question answering, reasoning, retrieval, document understanding, video understanding, and localization tasks.

  3. 3

    Vision-Language Model Training

    Walks through training a small VLM from scratch — batching, packing, and how images and text are represented during training.

  4. 4

    Training Data and Preprocessing for VLMs

    Moves from toy examples to real-world scale: sourcing, filtering, annotating, mixing, and packaging multimodal datasets.

  5. 5

    Post-Training Vision-Language Models

    Covers supervised fine-tuning, parameter-efficient adaptation, quantization-aware workflows, and alignment techniques.

  6. 6

    Core Architectures of Vision-Language Models

    Opens the model up — examines multimodal attention, fusion patterns, and modern VLM design blueprints.

  7. 7

    Deploying Models for Inference at Scale

    Profiling, KV-cache behavior, attention optimizations, quantization, export, and serving frameworks.

  8. 8

    Document AI

    How multimodal models handle OCR, document question answering, parsing, retrieval, and document-centric workflows.

  9. 9

    Video-Language Models

    Temporal modeling, video retrieval, Video-RAG, and practical fine-tuning considerations for video understanding.

  10. 10

    Any-to-Any Models

    Unified multimodal systems that understand and generate across text, images, audio, and video.

  11. 11

    Advanced Topics and Cutting-Edge Research

    Agentic vision-language models and vision-language-action systems that move from passive understanding to decision-making and action.

Authors

Merve Noyan

Merve Noyan

Hugging Face

Miquel Farré

Miquel Farré

Hugging Face

Andrés Marafioti

Andrés Marafioti

Hugging Face

Orr Zohar

Orr Zohar

Hugging Face

Contact

For questions, comments, or requests to interview the authors, please reach out via our Hugging Face page.

To submit errata or report errors, please do so via the O'Reilly platform.