LocalAI

The free, OpenAI, Anthropic alternative. Your All-in-One Complete AI Stack - Run powerful language models, autonomous agents, and document intelligence locally on your hardware.

No cloud, no limits, no compromise.

Tip

⭐ Star us on GitHub - 33.3k+ stars and growing!

Drop-in replacement for OpenAI API - modular suite of tools that work seamlessly together or independently.

Start with LocalAI’s OpenAI-compatible API, extend with LocalAGI’s autonomous agents, and enhance with LocalRecall’s semantic search - all running locally on your hardware.

Open Source MIT Licensed.

Why Choose LocalAI?

OpenAI API Compatible - Run AI models locally with our modular ecosystem. From language models to autonomous agents and semantic search, build your complete AI stack without the cloud.

Key Features

LLM Inferencing: LocalAI is a free, Open Source OpenAI alternative. Run LLMs, generate images, audio and more locally with consumer grade hardware.
Agentic-first: Extend LocalAI with LocalAGI, an autonomous AI agent platform that runs locally, no coding required. Build and deploy autonomous agents with ease.
Memory and Knowledge base: Extend LocalAI with LocalRecall, A local rest api for semantic search and memory management. Perfect for AI applications.
OpenAI Compatible: Drop-in replacement for OpenAI API. Compatible with existing applications and libraries.
No GPU Required: Run on consumer grade hardware. No need for expensive GPUs or cloud services.
Multiple Models: Support for various model families including LLMs, image generation, and audio models. Supports multiple backends for inferencing.
Privacy Focused: Keep your data local. No data leaves your machine, ensuring complete privacy.
Easy Setup: Simple installation and configuration. Get started in minutes with Binaries installation, Docker, Podman, Kubernetes or local installation.
Community Driven: Active community support and regular updates. Contribute and help shape the future of LocalAI.

Quick Start

Docker is the recommended installation method for most users:

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest

For complete installation instructions, see the Installation guide.

Get Started

Install LocalAI - Choose your installation method (Docker recommended)
Quickstart Guide - Get started quickly after installation
Install and Run Models - Learn how to work with AI models
Try It Out - Explore examples and use cases

Learn More

Beginners

Overview

LocalAI is your complete AI stack for running AI models locally. It’s designed to be simple, efficient, and accessible, providing a drop-in replacement for OpenAI’s API while keeping your data private and secure.

Why LocalAI?

In today’s AI landscape, privacy, control, and flexibility are paramount. LocalAI addresses these needs by:

Privacy First: Your data never leaves your machine
Complete Control: Run models on your terms, with your hardware
Open Source: MIT licensed and community-driven
Flexible Deployment: From laptops to servers, with or without GPUs
Extensible: Add new models and features as needed

Core Components

LocalAI is more than just a single tool - it’s a complete ecosystem:

LocalAI Core
- OpenAI-compatible API
- Multiple model support (LLMs, image, audio)
- Model Context Protocol (MCP) for agentic capabilities
- No GPU required
- Fast inference with native bindings
- Github repository
LocalAGI
- Autonomous AI agents
- No coding required
- WebUI and REST API support
- Extensible agent framework
- Github repository
LocalRecall
- Semantic search
- Memory management
- Vector database
- Perfect for AI applications
- Github repository

Getting Started

LocalAI can be installed in several ways. Docker is the recommended installation method for most users as it provides the easiest setup and works across all platforms.

Recommended: Docker Installation

The quickest way to get started with LocalAI is using Docker:

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest

For complete installation instructions including Docker, macOS, Linux, Kubernetes, and building from source, see the Installation guide.

Key Features

Text Generation: Run various LLMs locally
Image Generation: Create images with stable diffusion
Audio Processing: Text-to-speech and speech-to-text
Vision API: Image understanding and analysis
Embeddings: Vector database support
Functions: OpenAI-compatible function calling
MCP Support: Model Context Protocol for agentic capabilities
P2P: Distributed inference capabilities

Community and Support

LocalAI is a community-driven project. You can:

Join our Discord community
Check out our GitHub repository
Contribute to the project
Share your use cases and examples

Next Steps

Ready to dive in? Here are some recommended next steps:

Install LocalAI - Start with Docker installation (recommended) or choose another method
Explore available models
Model compatibility
Try out examples
Join the community
Check the LocalAI Github repository
Check the LocalAGI Github repository

License

LocalAI is MIT licensed, created and maintained by Ettore Di Giacinto.

Chapter 2

Installation

LocalAI can be installed in multiple ways depending on your platform and preferences.

Tip

Recommended: Docker Installation

Docker is the recommended installation method for most users as it works across all platforms (Linux, macOS, Windows) and provides the easiest setup experience. It’s the fastest way to get started with LocalAI.

Installation Methods

Choose the installation method that best suits your needs:

Docker ⭐ Recommended - Works on all platforms, easiest setup
macOS - Download and install the DMG application
Linux - Install on Linux using the one-liner script or binaries
Kubernetes - Deploy LocalAI on Kubernetes clusters
Build from Source - Build LocalAI from source code

Quick Start

Recommended: Docker (works on all platforms)

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest

This will start LocalAI. The API will be available at http://localhost:8080. For images with pre-configured models, see All-in-One images.

For other platforms:

macOS: Download the DMG
Linux: Use the curl https://localai.io/install.sh | sh one-liner

For detailed instructions, see the Docker installation guide.

Docker Installation

Tip

Recommended Installation Method

Docker is the recommended way to install LocalAI as it works across all platforms (Linux, macOS, Windows) and provides the easiest setup experience.

LocalAI provides Docker images that work with Docker, Podman, and other container engines. These images are available on Docker Hub and Quay.io.

Prerequisites

Before you begin, ensure you have Docker or Podman installed:

Install Docker Desktop (Mac, Windows, Linux)
Install Podman (Linux alternative)
Install Docker Engine (Linux servers)

Quick Start

The fastest way to get started is with the CPU image:

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest

This will:

Start LocalAI (you’ll need to install models separately)
Make the API available at http://localhost:8080

Tip

Docker Run vs Docker Start

docker run creates and starts a new container. If a container with the same name already exists, this command will fail.
docker start starts an existing container that was previously created with docker run.

If you’ve already run LocalAI before and want to start it again, use: docker start -i local-ai

Image Types

LocalAI provides several image types to suit different needs:

Standard Images

Standard images don’t include pre-configured models. Use these if you want to configure models manually.

CPU Image

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest

GPU Images

NVIDIA CUDA 12:

docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12

NVIDIA CUDA 11:

docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-11

AMD GPU (ROCm):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-gpu-hipblas

Intel GPU:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-gpu-intel

Vulkan:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-gpu-vulkan

NVIDIA Jetson (L4T ARM64):

docker run -ti --name local-ai -p 8080:8080 --runtime nvidia --gpus all localai/localai:latest-nvidia-l4t-arm64

All-in-One (AIO) Images

Recommended for beginners - These images come pre-configured with models and backends, ready to use immediately.

CPU Image

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-aio-cpu

GPU Images

NVIDIA CUDA 12:

docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-aio-gpu-nvidia-cuda-12

NVIDIA CUDA 11:

docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-aio-gpu-nvidia-cuda-11

AMD GPU (ROCm):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-aio-gpu-hipblas

Intel GPU:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-aio-gpu-intel

Using Docker Compose

For a more manageable setup, especially with persistent volumes, use Docker Compose:

version: "3.9"
services:
  api:
    image: localai/localai:latest-aio-cpu
    # For GPU support, use one of:
    # image: localai/localai:latest-aio-gpu-nvidia-cuda-12
    # image: localai/localai:latest-aio-gpu-nvidia-cuda-11
    # image: localai/localai:latest-aio-gpu-hipblas
    # image: localai/localai:latest-aio-gpu-intel
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 1m
      timeout: 20m
      retries: 5
    ports:
      - 8080:8080
    environment:
      - DEBUG=true
    volumes:
      - ./models:/models:cached
    # For NVIDIA GPUs, uncomment:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: 1
    #           capabilities: [gpu]

Save this as docker-compose.yml and run:

docker compose up -d

Persistent Storage

To persist models and configurations, mount a volume:

docker run -ti --name local-ai -p 8080:8080 \
  -v $PWD/models:/models \
  localai/localai:latest-aio-cpu

Or use a named volume:

docker volume create localai-models
docker run -ti --name local-ai -p 8080:8080 \
  -v localai-models:/models \
  localai/localai:latest-aio-cpu

What’s Included in AIO Images

All-in-One images come pre-configured with:

Text Generation: LLM models for chat and completion
Image Generation: Stable Diffusion models
Text to Speech: TTS models
Speech to Text: Whisper models
Embeddings: Vector embedding models
Function Calling: Support for OpenAI-compatible function calling

The AIO images use OpenAI-compatible model names (like gpt-4, gpt-4-vision-preview) but are backed by open-source models. See the container images documentation for the complete mapping.

Next Steps

After installation:

Access the WebUI at http://localhost:8080
Check available models: curl http://localhost:8080/v1/models
Install additional models
Try out examples

Advanced Configuration

For detailed information about:

All available image tags and versions
Advanced Docker configuration options
Custom image builds
Backend management

See the Container Images documentation.

Troubleshooting

Container won’t start

Check Docker is running: docker ps
Check port 8080 is available: netstat -an | grep 8080 (Linux/Mac)
View logs: docker logs local-ai

GPU not detected

Ensure Docker has GPU access: docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
For NVIDIA: Install NVIDIA Container Toolkit
For AMD: Ensure devices are accessible: ls -la /dev/kfd /dev/dri

Models not downloading

Check internet connection
Verify disk space: df -h
Check Docker logs for errors: docker logs local-ai

macOS Installation

The easiest way to install LocalAI on macOS is using the DMG application.

Download

Download the latest DMG from GitHub releases:

Installation Steps

Download the LocalAI.dmg file from the link above
Open the downloaded DMG file
Drag the LocalAI application to your Applications folder
Launch LocalAI from your Applications folder

Known Issues

Note: The DMGs are not signed by Apple and may show as quarantined.
Workaround: See this issue for details on how to bypass the quarantine.
Fix tracking: The signing issue is being tracked in this issue.

Next Steps

After installing LocalAI, you can:

Access the WebUI at http://localhost:8080
Try it out with examples
Learn about available models
Customize your configuration

Linux Installation

One-Line Installer (Recommended)

The fastest way to install LocalAI on Linux is with the installation script:

curl https://localai.io/install.sh | sh

This script will:

Detect your system architecture
Download the appropriate LocalAI binary
Set up the necessary configuration
Start LocalAI automatically

Installer Configuration Options

The installer can be configured using environment variables:

curl https://localai.io/install.sh | VAR=value sh

Environment Variables

Environment Variable	Description
DOCKER_INSTALL	Set to `"true"` to enable the installation of Docker images
USE_AIO	Set to `"true"` to use the all-in-one LocalAI Docker image
USE_VULKAN	Set to `"true"` to use Vulkan GPU support
API_KEY	Specify an API key for accessing LocalAI, if required
PORT	Specifies the port on which LocalAI will run (default is 8080)
THREADS	Number of processor threads the application should use. Defaults to the number of logical cores minus one
VERSION	Specifies the version of LocalAI to install. Defaults to the latest available version
MODELS_PATH	Directory path where LocalAI models are stored (default is `/usr/share/local-ai/models`)
P2P_TOKEN	Token to use for the federation or for starting workers. See distributed inferencing documentation
WORKER	Set to `"true"` to make the instance a worker (p2p token is required)
FEDERATED	Set to `"true"` to share the instance with the federation (p2p token is required)
FEDERATED_SERVER	Set to `"true"` to run the instance as a federation server which forwards requests to the federation (p2p token is required)

Image Selection

The installer will automatically detect your GPU and select the appropriate image. By default, it uses the standard images without extra Python dependencies. You can customize the image selection:

USE_AIO=true: Use all-in-one images that include all dependencies
USE_VULKAN=true: Use Vulkan GPU support instead of vendor-specific GPU support

Uninstallation

To uninstall LocalAI installed via the script:

curl https://localai.io/install.sh | sh -s -- --uninstall

Manual Installation

Download Binary

You can manually download the appropriate binary for your system from the releases page:

Go to GitHub Releases
Download the binary for your architecture (amd64, arm64, etc.)
Make it executable:

chmod +x local-ai-*

Run LocalAI:

./local-ai-*

System Requirements

Hardware requirements vary based on:

Model size
Quantization method
Backend used

For performance benchmarks with different backends like llama.cpp, visit this link.

Configuration

After installation, you can:

Access the WebUI at http://localhost:8080
Configure models in the models directory
Customize settings via environment variables or config files

Next Steps

Run with Kubernetes

For installing LocalAI in Kubernetes, the deployment file from the examples can be used and customized as preferred:

kubectl apply -f https://raw.githubusercontent.com/mudler/LocalAI-examples/refs/heads/main/kubernetes/deployment.yaml

For Nvidia GPUs:

kubectl apply -f https://raw.githubusercontent.com/mudler/LocalAI-examples/refs/heads/main/kubernetes/deployment-nvidia.yaml

Alternatively, the helm chart can be used as well:

helm repo add go-skynet https://go-skynet.github.io/helm-charts/
helm repo update
helm show values go-skynet/local-ai > values.yaml


helm install local-ai go-skynet/local-ai -f values.yaml

Build LocalAI

Build

LocalAI can be built as a container image or as a single, portable binary. Note that some model architectures might require Python libraries, which are not included in the binary.

LocalAI’s extensible architecture allows you to add your own backends, which can be written in any language, and as such the container images contains also the Python dependencies to run all the available backends (for example, in order to run backends like Diffusers that allows to generate images and videos from text).

This section contains instructions on how to build LocalAI from source.

Build LocalAI locally

Requirements

In order to build LocalAI locally, you need the following requirements:

Golang >= 1.21
GCC
GRPC

To install the dependencies follow the instructions below:

Install xcode from the App Store

brew install go protobuf protoc-gen-go protoc-gen-go-grpc wget

apt install golang make protobuf-compiler-grpc

After you have golang installed and working, you can install the required binaries for compiling the golang protobuf components via the following commands

go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af

make build

Build

To build LocalAI with make:

git clone https://github.com/go-skynet/LocalAI
cd LocalAI
make build

This should produce the binary local-ai

Container image

Requirements:

Docker or podman, or a container engine

In order to build the LocalAI container image locally you can use docker, for example:

docker build -t localai .
docker run localai

Example: Build on mac

Building on Mac (M1, M2 or M3) works, but you may need to install some prerequisites using brew.

The below has been tested by one mac user and found to work. Note that this doesn’t use Docker to run the server:

Install xcode from the Apps Store (needed for metalkit)

brew install abseil cmake go grpc protobuf wget protoc-gen-go protoc-gen-go-grpc

git clone https://github.com/go-skynet/LocalAI.git

cd LocalAI

make build

wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q2_K.gguf -O models/phi-2.Q2_K

cp -rf prompt-templates/ggml-gpt4all-j.tmpl models/phi-2.Q2_K.tmpl

./local-ai backends install llama-cpp

./local-ai --models-path=./models/ --debug=true

curl http://localhost:8080/v1/models

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "phi-2.Q2_K",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9 
   }'

Troubleshooting mac

If you encounter errors regarding a missing utility metal, install Xcode from the App Store.
After the installation of Xcode, if you receive a xcrun error 'xcrun: error: unable to find utility "metal", not a developer tool or in PATH'. You might have installed the Xcode command line tools before installing Xcode, the former one is pointing to an incomplete SDK.

xcode-select --print-path

sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer

If completions are slow, ensure that gpu-layers in your model yaml matches the number of layers from the model in use (or simply use a high number such as 256).
If you get a compile error: error: only virtual member functions can be marked 'final', reinstall all the necessary brew packages, clean the build, and try again.

brew reinstall go grpc protobuf wget

make clean

make build

Build backends

LocalAI have several backends available for installation in the backend gallery. The backends can be also built by source. As backends might vary from language and dependencies that they require, the documentation will provide generic guidance for few of the backends, which can be applied with some slight modifications also to the others.

Manually

Typically each backend include a Makefile which allow to package the backend.

In the LocalAI repository, for instance you can build bark-cpp by doing:

git clone https://github.com/go-skynet/LocalAI.git

make -C LocalAI/backend/go/bark-cpp build package

make -C LocalAI/backend/python/vllm

With Docker

Building with docker is simpler as abstracts away all the requirement, and focuses on building the final OCI images that are available in the gallery. This allows for instance also to build locally a backend and install it with LocalAI. You can refer to Backends for general guidance on how to install and develop backends.

In the LocalAI repository, you can build bark-cpp by doing:

git clone https://github.com/go-skynet/LocalAI.git

make docker-build-bark-cpp

Note that make is only by convenience, in reality it just runs a simple docker command as:

docker build --build-arg BUILD_TYPE=$(BUILD_TYPE) --build-arg BASE_IMAGE=$(BASE_IMAGE) -t local-ai-backend:bark-cpp -f LocalAI/backend/Dockerfile.golang --build-arg BACKEND=bark-cpp .

Note:

BUILD_TYPE can be either: cublas, hipblas, sycl_f16, sycl_f32, metal.
BASE_IMAGE is tested on ubuntu:22.04 (and defaults to it) and quay.io/go-skynet/intel-oneapi-base:latest for intel/sycl

Chapter 3

Getting started

Welcome to LocalAI! This section covers everything you need to know after installation to start using LocalAI effectively.

Tip

Haven’t installed LocalAI yet?

See the Installation guide to install LocalAI first. Docker is the recommended installation method for most users.

What’s in This Section

Quickstart Guide - Get started quickly with your first API calls and model downloads
Install and Run Models - Learn how to install, configure, and run AI models
Customize Models - Customize model configurations and prompt templates
Container Images Reference - Complete reference for available Docker images
Try It Out - Explore examples and use cases

Quickstart

LocalAI is a free, open-source alternative to OpenAI (Anthropic, etc.), functioning as a drop-in replacement REST API for local inferencing. It allows you to run LLMs, generate images, and produce audio, all locally or on-premises with consumer-grade hardware, supporting multiple model families and architectures.

Tip

Security considerations

If you are exposing LocalAI remotely, make sure you protect the API endpoints adequately with a mechanism which allows to protect from the incoming traffic or alternatively, run LocalAI with API_KEY to gate the access with an API key. The API key guarantees a total access to the features (there is no role separation), and it is to be considered as likely as an admin role.

Quickstart

This guide assumes you have already installed LocalAI. If you haven’t installed it yet, see the Installation guide first.

Starting LocalAI

Once installed, start LocalAI. For Docker installations:

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest

The API will be available at http://localhost:8080.

Downloading models on start

When starting LocalAI (either via Docker or via CLI) you can specify as argument a list of models to install automatically before starting the API, for example:

local-ai run llama-3.2-1b-instruct:q4_k_m
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
local-ai run ollama://gemma:2b
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
local-ai run oci://localai/phi-2:latest

Tip

Automatic Backend Detection: When you install models from the gallery or YAML files, LocalAI automatically detects your system’s GPU capabilities (NVIDIA, AMD, Intel) and downloads the appropriate backend. For advanced configuration options, see GPU Acceleration.

For a full list of options, you can run LocalAI with --help or refer to the Linux Installation guide for installer configuration options.

Using LocalAI and the full stack with LocalAGI

LocalAI is part of the Local family stack, along with LocalAGI and LocalRecall.

LocalAGI is a powerful, self-hostable AI Agent platform designed for maximum privacy and flexibility which encompassess and uses all the software stack. It provides a complete drop-in replacement for OpenAI’s Responses APIs with advanced agentic capabilities, working entirely locally on consumer-grade hardware (CPU and GPU).

Quick Start

git clone https://github.com/mudler/LocalAGI
cd LocalAGI

docker compose up

docker compose -f docker-compose.nvidia.yaml up

docker compose -f docker-compose.intel.yaml up

MODEL_NAME=gemma-3-12b-it docker compose up

MODEL_NAME=gemma-3-12b-it \
MULTIMODAL_MODEL=minicpm-v-4_5 \
IMAGE_MODEL=flux.1-dev-ggml \
docker compose -f docker-compose.nvidia.yaml up

Key Features

Privacy-Focused: All processing happens locally, ensuring your data never leaves your machine
Flexible Deployment: Supports CPU, NVIDIA GPU, and Intel GPU configurations
Multiple Model Support: Compatible with various models from Hugging Face and other sources
Web Interface: User-friendly chat interface for interacting with AI agents
Advanced Capabilities: Supports multimodal models, image generation, and more
Docker Integration: Easy deployment using Docker Compose

Environment Variables

You can customize your LocalAGI setup using the following environment variables:

MODEL_NAME: Specify the model to use (e.g., gemma-3-12b-it)
MULTIMODAL_MODEL: Set a custom multimodal model
IMAGE_MODEL: Configure an image generation model

For more advanced configuration and API documentation, visit the LocalAGI GitHub repository.

What’s Next?

There is much more to explore with LocalAI! You can run any model from Hugging Face, perform video generation, and also voice cloning. For a comprehensive overview, check out the features section.

Explore additional resources and community contributions:

Setting Up Models

This section covers everything you need to know about installing and configuring models in LocalAI. You’ll learn multiple methods to get models running.

Prerequisites

LocalAI installed and running (see Quickstart if you haven’t set it up yet)
Basic understanding of command line usage

Method 1: Using the Model Gallery (Easiest)

The Model Gallery is the simplest way to install models. It provides pre-configured models ready to use.

Via WebUI

Open the LocalAI WebUI at http://localhost:8080
Navigate to the “Models” tab
Browse available models
Click “Install” on any model you want
Wait for installation to complete

For more details, refer to the Gallery Documentation.

Via CLI

# List available models
local-ai models list

# Install a specific model
local-ai models install llama-3.2-1b-instruct:q4_k_m

# Start LocalAI with a model from the gallery
local-ai run llama-3.2-1b-instruct:q4_k_m

To run models available in the LocalAI gallery, you can use the model name as the URI. For example, to run LocalAI with the Hermes model, execute:

local-ai run hermes-2-theta-llama-3-8b

To install only the model, use:

local-ai models install hermes-2-theta-llama-3-8b

Note: The galleries available in LocalAI can be customized to point to a different URL or a local directory. For more information on how to setup your own gallery, see the Gallery Documentation.

Browse Online

Visit models.localai.io to browse all available models in your browser.

Method 1.5: Import Models via WebUI

The WebUI provides a powerful model import interface that supports both simple and advanced configuration:

Simple Import Mode

Open the LocalAI WebUI at http://localhost:8080
Click “Import Model”
Enter the model URI (e.g., https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct-GGUF)
Optionally configure preferences:
- Backend selection
- Model name
- Description
- Quantizations
- Embeddings support
- Custom preferences
Click “Import Model” to start the import process

Advanced Import Mode

For full control over model configuration:

In the WebUI, click “Import Model”
Toggle to “Advanced Mode”
Edit the YAML configuration directly in the code editor
Use the “Validate” button to check your configuration
Click “Create” or “Update” to save

The advanced editor includes:

Syntax highlighting
YAML validation
Format and copy tools
Full configuration options

This is especially useful for:

Custom model configurations
Fine-tuning model parameters
Setting up complex model setups
Editing existing model configurations

Method 2: Installing from Hugging Face

LocalAI can directly install models from Hugging Face:

# Install and run a model from Hugging Face
local-ai run huggingface://TheBloke/phi-2-GGUF

The format is: huggingface://<repository>/<model-file> ( is optional)

Examples

local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf

Method 3: Installing from OCI Registries

Ollama Registry

local-ai run ollama://gemma:2b

Standard OCI Registry

local-ai run oci://localai/phi-2:latest

Run Models via URI

To run models via URI, specify a URI to a model file or a configuration file when starting LocalAI. Valid syntax includes:

file://path/to/model
huggingface://repository_id/model_file (e.g., huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf)
From OCIs: oci://container_image:tag, ollama://model_id:tag
From configuration files: https://gist.githubusercontent.com/.../phi-2.yaml

Configuration files can be used to customize the model defaults and settings. For advanced configurations, refer to the Customize Models section.

Examples

local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
local-ai run ollama://gemma:2b
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
local-ai run oci://localai/phi-2:latest

Method 4: Manual Installation

For full control, you can manually download and configure models.

Step 1: Download a Model

Download a GGUF model file. Popular sources:

Hugging Face

Example:

mkdir -p models

wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
  -O models/phi-2.Q4_K_M.gguf

Step 2: Create a Configuration File (Optional)

Create a YAML file to configure the model:

# models/phi-2.yaml
name: phi-2
parameters:
  model: phi-2.Q4_K_M.gguf
  temperature: 0.7
context_size: 2048
threads: 4
backend: llama-cpp

Customize model defaults and settings with a configuration file. For advanced configurations, refer to the Advanced Documentation.

Step 3: Run LocalAI

Choose one of the following methods to run LocalAI:

mkdir models

cp your-model.gguf models/

docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "your-model.gguf",
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'

Tip

Other Docker Images:

For other Docker images, please refer to the table in the container images section.

Example:

mkdir models

wget https://huggingface.co/TheBloke/Luna-AI-Llama2-Uncensored-GGUF/resolve/main/luna-ai-llama2-uncensored.Q4_0.gguf -O models/luna-ai-llama2

cp -rf prompt-templates/getting_started.tmpl models/luna-ai-llama2.tmpl

docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4

curl http://localhost:8080/v1/models

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "luna-ai-llama2",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9
   }'

Note

If running on Apple Silicon (ARM), it is not recommended to run on Docker due to emulation. Follow the build instructions to use Metal acceleration for full GPU support.
If you are running on Apple x86_64, you can use Docker without additional gain from building it from source.

git clone https://github.com/go-skynet/LocalAI

cd LocalAI

cp your-model.gguf models/

docker compose up -d --pull always

curl http://localhost:8080/v1/models

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "your-model.gguf",
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'

Tip

Other Docker Images:

For other Docker images, please refer to the table in Getting Started.

Note: If you are on Windows, ensure the project is on the Linux filesystem to avoid slow model loading. For more information, see the Microsoft Docs.

For Kubernetes deployment, see the Kubernetes installation guide.

LocalAI binary releases are available on GitHub.

# With binary
local-ai --models-path ./models

Tip

If installing on macOS, you might encounter a message saying:

“local-ai-git-Darwin-arm64” (or the name you gave the binary) can’t be opened because Apple cannot check it for malicious software.

Hit OK, then go to Settings > Privacy & Security > Security and look for the message:

“local-ai-git-Darwin-arm64” was blocked from use because it is not from an identified developer.

Press “Allow Anyway.”

For instructions on building LocalAI from source, see the Build from Source guide.

GPU Acceleration

For instructions on GPU acceleration, visit the GPU Acceleration page.

For more model configurations, visit the Examples Section.

Understanding Model Files

File Formats

GGUF: Modern format, recommended for most use cases
GGML: Older format, still supported but deprecated

Quantization Levels

Models come in different quantization levels (quality vs. size trade-off):

Quantization	Size	Quality	Use Case
Q8_0	Largest	Highest	Best quality, requires more RAM
Q6_K	Large	Very High	High quality
Q4_K_M	Medium	High	Balanced (recommended)
Q4_K_S	Small	Medium	Lower RAM usage
Q2_K	Smallest	Lower	Minimal RAM, lower quality

Choosing the Right Model

Consider:

RAM available: Larger models need more RAM
Use case: Different models excel at different tasks
Speed: Smaller quantizations are faster
Quality: Higher quantizations produce better output

Model Configuration

Basic Configuration

Create a YAML file in your models directory:

name: my-model
parameters:
  model: model.gguf
  temperature: 0.7
  top_p: 0.9
context_size: 2048
threads: 4
backend: llama-cpp

Advanced Configuration

See the Model Configuration guide for all available options.

Managing Models

List Installed Models

# Via API
curl http://localhost:8080/v1/models

# Via CLI
local-ai models list

Remove Models

Simply delete the model file and configuration from your models directory:

rm models/model-name.gguf
rm models/model-name.yaml  # if exists

Troubleshooting

Model Not Loading

Check backend: Ensure the required backend is installed

local-ai backends list
local-ai backends install llama-cpp  # if needed

Check logs: Enable debug mode
```
DEBUG=true local-ai
```
Verify file: Ensure the model file is not corrupted

Out of Memory

Use a smaller quantization (Q4_K_S or Q2_K)
Reduce context_size in configuration
Close other applications to free RAM

Wrong Backend

Check the Compatibility Table to ensure you’re using the correct backend for your model.

Best Practices

Start small: Begin with smaller models to test your setup
Use quantized models: Q4_K_M is a good balance for most use cases
Organize models: Keep your models directory organized
Backup configurations: Save your YAML configurations
Monitor resources: Watch RAM and disk usage

Try it out

Once LocalAI is installed, you can start it (either by using docker, or the cli, or the systemd service).

By default the LocalAI WebUI should be accessible from http://localhost:8080. You can also use 3rd party projects to interact with LocalAI as you would use OpenAI (see also Integrations ).

After installation, install new models by navigating the model gallery, or by using the local-ai CLI.

Tip

To install models with the WebUI, see the Models section. With the CLI you can list the models with local-ai models list and install them with local-ai models install <model-name>.

You can also run models manually by copying files into the models directory.

You can test out the API endpoints using curl, few examples are listed below. The models we are referring here (gpt-4, gpt-4-vision-preview, tts-1, whisper-1) are the default models that come with the AIO images - you can also use any other model you have installed.

Text Generation

Creates a model response for the given chat conversation. OpenAI documentation.

curl http://localhost:8080/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}] }'

GPT Vision

Understand images.

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{ 
        "model": "gpt-4-vision-preview", 
        "messages": [
          {
            "role": "user", "content": [
              {"type":"text", "text": "What is in the image?"},
              {
                "type": "image_url", 
                "image_url": {
                  "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" 
                }
              }
            ], 
          "temperature": 0.9
          }
        ]
      }'

Function calling

Call functions

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather like in Boston?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "description": "Get the current weather in a given location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA"
              },
              "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"]
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "auto"
  }'

Image Generation

Creates an image given a prompt. OpenAI documentation.

curl http://localhost:8080/v1/images/generations \
      -H "Content-Type: application/json" -d '{
          "prompt": "A cute baby sea otter",
          "size": "256x256"
        }'

Text to speech

Generates audio from the input text. OpenAI documentation.

curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "alloy"
  }' \
  --output speech.mp3

Audio Transcription

Transcribes audio into the input language. OpenAI Documentation.

Download first a sample to transcribe:

wget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg

Send the example audio file to the transcriptions endpoint :

curl http://localhost:8080/v1/audio/transcriptions \
    -H "Content-Type: multipart/form-data" \
    -F file="@$PWD/gb1.ogg" -F model="whisper-1"

Embeddings Generation

Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. OpenAI Embeddings.

curl http://localhost:8080/embeddings \
    -X POST -H "Content-Type: application/json" \
    -d '{ 
        "input": "Your text string goes here", 
        "model": "text-embedding-ada-002"
      }'

Tip

Don’t use the model file as model in the request unless you want to handle the prompt template for yourself.

Use the model names like you would do with OpenAI like in the examples below. For instance gpt-4-vision-preview, or gpt-4.

Customizing the Model

To customize the prompt template or the default settings of the model, a configuration file is utilized. This file must adhere to the LocalAI YAML configuration standards. For comprehensive syntax details, refer to the advanced documentation. The configuration file can be located either remotely (such as in a Github Gist) or within the local filesystem or a remote URL.

LocalAI can be initiated using either its container image or binary, with a command that includes URLs of model config files or utilizes a shorthand format (like huggingface:// or github://), which is then expanded into complete URLs.

The configuration can also be set via an environment variable. For instance:

local-ai github://owner/repo/file.yaml@branch

MODELS="github://owner/repo/file.yaml@branch,github://owner/repo/file.yaml@branch" local-ai

Here’s an example to initiate the phi-2 model:

docker run -p 8080:8080 localai/localai:v3.7.0 https://gist.githubusercontent.com/mudler/ad601a0488b497b69ec549150d9edd18/raw/a8a8869ef1bb7e3830bf5c0bae29a0cce991ff8d/phi-2.yaml

You can also check all the embedded models configurations here.

Tip

The model configurations used in the quickstart are accessible here: https://github.com/mudler/LocalAI/tree/master/embedded/models. Contributions are welcome; please feel free to submit a Pull Request.

The phi-2 model configuration from the quickstart is expanded from https://github.com/mudler/LocalAI/blob/master/examples/configurations/phi-2.yaml.

Example: Customizing the Prompt Template

To modify the prompt template, create a Github gist or a Pastebin file, and copy the content from https://github.com/mudler/LocalAI/blob/master/examples/configurations/phi-2.yaml. Alter the fields as needed:

name: phi-2
context_size: 2048
f16: true
threads: 11
gpu_layers: 90
mmap: true
parameters:
  # Reference any HF model or a local file here
  model: huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
  temperature: 0.2
  top_k: 40
  top_p: 0.95
template:
  
  chat: &template |
    Instruct: {{.Input}}
    Output:
  # Modify the prompt template here ^^^ as per your requirements
  completion: *template

Then, launch LocalAI using your gist’s URL:

## Important! Substitute with your gist's URL!
docker run -p 8080:8080 localai/localai:v3.7.0 https://gist.githubusercontent.com/xxxx/phi-2.yaml

Next Steps

Visit the advanced section for more insights on prompt templates and configuration files.
To learn about fine-tuning an LLM model, check out the fine-tuning section.

Build LocalAI from source

Building LocalAI from source is an installation method that allows you to compile LocalAI yourself, which is useful for custom configurations, development, or when you need specific build options.

For complete build instructions, see the Build from Source documentation in the Installation section.

Run with container images

LocalAI provides a variety of images to support different environments. These images are available on quay.io and Docker Hub.

All-in-One images comes with a pre-configured set of models and backends, standard images instead do not have any model pre-configured and installed.

For GPU Acceleration support for Nvidia video graphic cards, use the Nvidia/CUDA images, if you don’t have a GPU, use the CPU images. If you have AMD or Mac Silicon, see the build section.

Tip

Available Images Types:

Images ending with -core are smaller images without predownload python dependencies. Use these images if you plan to use llama.cpp, stablediffusion-ncn or rwkv backends - if you are not sure which one to use, do not use these images.
Images containing the aio tag are all-in-one images with all the features enabled, and come with an opinionated set of configuration.

Prerequisites

Before you begin, ensure you have a container engine installed if you are not using the binaries. Suitable options include Docker or Podman. For installation instructions, refer to the following guides:

Tip

Hardware Requirements: The hardware requirements for LocalAI vary based on the model size and quantization method used. For performance benchmarks with different backends, such as llama.cpp, visit this link. The rwkv backend is noted for its lower resource consumption.

Standard container images

Standard container images do not have pre-installed models. Use these if you want to configure models manually.

Description	Quay	Docker Hub
Latest images from the branch (development)	`quay.io/go-skynet/local-ai:master`	`localai/localai:master`
Latest tag	`quay.io/go-skynet/local-ai:latest`	`localai/localai:latest`
Versioned image	`quay.io/go-skynet/local-ai:v3.7.0`	`localai/localai:v3.7.0`

Description	Quay	Docker Hub
Latest images from the branch (development)	`quay.io/go-skynet/local-ai:master-gpu-nvidia-cuda-11`	`localai/localai:master-gpu-nvidia-cuda-11`
Latest tag	`quay.io/go-skynet/local-ai:latest-gpu-nvidia-cuda-11`	`localai/localai:latest-gpu-nvidia-cuda-11`
Versioned image	`quay.io/go-skynet/local-ai:v3.7.0-gpu-nvidia-cuda-11`	`localai/localai:v3.7.0-gpu-nvidia-cuda-11`

Description	Quay	Docker Hub
Latest images from the branch (development)	`quay.io/go-skynet/local-ai:master-gpu-nvidia-cuda-12`	`localai/localai:master-gpu-nvidia-cuda-12`
Latest tag	`quay.io/go-skynet/local-ai:latest-gpu-nvidia-cuda-12`	`localai/localai:latest-gpu-nvidia-cuda-12`
Versioned image	`quay.io/go-skynet/local-ai:v3.7.0-gpu-nvidia-cuda-12`	`localai/localai:v3.7.0-gpu-nvidia-cuda-12`

Description	Quay	Docker Hub
Latest images from the branch (development)	`quay.io/go-skynet/local-ai:master-gpu-intel`	`localai/localai:master-gpu-intel`
Latest tag	`quay.io/go-skynet/local-ai:latest-gpu-intel`	`localai/localai:latest-gpu-intel`
Versioned image	`quay.io/go-skynet/local-ai:v3.7.0-gpu-intel`	`localai/localai:v3.7.0-gpu-intel`

Description	Quay	Docker Hub
Latest images from the branch (development)	`quay.io/go-skynet/local-ai:master-gpu-hipblas`	`localai/localai:master-gpu-hipblas`
Latest tag	`quay.io/go-skynet/local-ai:latest-gpu-hipblas`	`localai/localai:latest-gpu-hipblas`
Versioned image	`quay.io/go-skynet/local-ai:v3.7.0-gpu-hipblas`	`localai/localai:v3.7.0-gpu-hipblas`

Description	Quay	Docker Hub
Latest images from the branch (development)	`quay.io/go-skynet/local-ai:master-vulkan`	`localai/localai:master-vulkan`
Latest tag	`quay.io/go-skynet/local-ai:latest-gpu-vulkan`	`localai/localai:latest-gpu-vulkan`
Versioned image	`quay.io/go-skynet/local-ai:v3.7.0-vulkan`	`localai/localai:v3.7.0-vulkan`

These images are compatible with Nvidia ARM64 devices, such as the Jetson Nano, Jetson Xavier NX, and Jetson AGX Xavier. For more information, see the Nvidia L4T guide.

Description	Quay	Docker Hub
Latest images from the branch (development)	`quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64`	`localai/localai:master-nvidia-l4t-arm64`
Latest tag	`quay.io/go-skynet/local-ai:latest-nvidia-l4t-arm64`	`localai/localai:latest-nvidia-l4t-arm64`
Versioned image	`quay.io/go-skynet/local-ai:v3.7.0-nvidia-l4t-arm64`	`localai/localai:v3.7.0-nvidia-l4t-arm64`

All-in-one images

All-In-One images are images that come pre-configured with a set of models and backends to fully leverage almost all the LocalAI featureset. These images are available for both CPU and GPU environments. The AIO images are designed to be easy to use and require no configuration. Models configuration can be found here separated by size.

In the AIO images there are models configured with the names of OpenAI models, however, they are really backed by Open Source models. You can find the table below

Category	Model name	Real model (CPU)	Real model (GPU)
Text Generation	`gpt-4`	`phi-2`	`hermes-2-pro-mistral`
Multimodal Vision	`gpt-4-vision-preview`	`bakllava`	`llava-1.6-mistral`
Image Generation	`stablediffusion`	`stablediffusion`	`dreamshaper-8`
Speech to Text	`whisper-1`	`whisper` with `whisper-base` model	<= same
Text to Speech	`tts-1`	`en-us-amy-low.onnx` from `rhasspy/piper`	<= same
Embeddings	`text-embedding-ada-002`	`all-MiniLM-L6-v2` in Q4	`all-MiniLM-L6-v2`

Usage

Select the image (CPU or GPU) and start the container with Docker:

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest-aio-cpu

LocalAI will automatically download all the required models, and the API will be available at localhost:8080.

Or with a docker-compose file:

version: "3.9"
services:
  api:
    image: localai/localai:latest-aio-cpu
    # For a specific version:
    # image: localai/localai:v3.7.0-aio-cpu
    # For Nvidia GPUs decomment one of the following (cuda11 or cuda12):
    # image: localai/localai:v3.7.0-aio-gpu-nvidia-cuda-11
    # image: localai/localai:v3.7.0-aio-gpu-nvidia-cuda-12
    # image: localai/localai:latest-aio-gpu-nvidia-cuda-11
    # image: localai/localai:latest-aio-gpu-nvidia-cuda-12
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 1m
      timeout: 20m
      retries: 5
    ports:
      - 8080:8080
    environment:
      - DEBUG=true
      # ...
    volumes:
      - ./models:/models:cached
    # decomment the following piece if running with Nvidia GPUs
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: 1
    #           capabilities: [gpu]

Tip

Models caching: The AIO image will download the needed models on the first run if not already present and store those in /models inside the container. The AIO models will be automatically updated with new versions of AIO images.

You can change the directory inside the container by specifying a MODELS_PATH environment variable (or --models-path).

If you want to use a named model or a local directory, you can mount it as a volume to /models:

docker run -p 8080:8080 --name local-ai -ti -v $PWD/models:/models localai/localai:latest-aio-cpu

or associate a volume:

docker volume create localai-models
docker run -p 8080:8080 --name local-ai -ti -v localai-models:/models localai/localai:latest-aio-cpu

Available AIO images

Description	Quay	Docker Hub
Latest images for CPU	`quay.io/go-skynet/local-ai:latest-aio-cpu`	`localai/localai:latest-aio-cpu`
Versioned image (e.g. for CPU)	`quay.io/go-skynet/local-ai:v3.7.0-aio-cpu`	`localai/localai:v3.7.0-aio-cpu`
Latest images for Nvidia GPU (CUDA11)	`quay.io/go-skynet/local-ai:latest-aio-gpu-nvidia-cuda-11`	`localai/localai:latest-aio-gpu-nvidia-cuda-11`
Latest images for Nvidia GPU (CUDA12)	`quay.io/go-skynet/local-ai:latest-aio-gpu-nvidia-cuda-12`	`localai/localai:latest-aio-gpu-nvidia-cuda-12`
Latest images for AMD GPU	`quay.io/go-skynet/local-ai:latest-aio-gpu-hipblas`	`localai/localai:latest-aio-gpu-hipblas`
Latest images for Intel GPU	`quay.io/go-skynet/local-ai:latest-aio-gpu-intel`	`localai/localai:latest-aio-gpu-intel`

Available environment variables

The AIO Images are inheriting the same environment variables as the base images and the environment of LocalAI (that you can inspect by calling --help). However, it supports additional environment variables available only from the container image

Variable	Default	Description
`PROFILE`	Auto-detected	The size of the model to use. Available: `cpu`, `gpu-8g`
`MODELS`	Auto-detected	A list of models YAML Configuration file URI/URL (see also running models)

Run with Kubernetes

For installing LocalAI in Kubernetes, the deployment file from the examples can be used and customized as preferred:

kubectl apply -f https://raw.githubusercontent.com/mudler/LocalAI-examples/refs/heads/main/kubernetes/deployment.yaml

For Nvidia GPUs:

kubectl apply -f https://raw.githubusercontent.com/mudler/LocalAI-examples/refs/heads/main/kubernetes/deployment-nvidia.yaml

Alternatively, the helm chart can be used as well:

helm repo add go-skynet https://go-skynet.github.io/helm-charts/
helm repo update
helm show values go-skynet/local-ai > values.yaml


helm install local-ai go-skynet/local-ai -f values.yaml

News

Release notes have been now moved completely over Github releases.

You can see the release notes here.

04-12-2023: v2.0.0

This release brings a major overhaul in some backends.

Breaking/important changes:

Backend rename: llama-stable renamed to llama-ggml 1287
Prompt template changes: 1254 (extra space in roles)
Apple metal bugfixes: 1365

New:

Added support for LLaVa and OpenAI Vision API support ( 1254 )
Python based backends are now using conda to track env dependencies ( 1144 )
Support for parallel requests ( 1290 )
Support for transformers-embeddings ( 1308 )
Watchdog for backends ( 1341 ). As https://github.com/ggerganov/llama.cpp/issues/3969 is hitting LocalAI’s llama-cpp implementation, we have now a watchdog that can be used to make sure backends are not stalling. This is a generic mechanism that can be enabled for all the backends now.
Whisper.cpp updates ( 1302 )
Petals backend ( 1350 )
Full LLM fine-tuning example to use with LocalAI: https://localai.io/advanced/fine-tuning/

Due to the python dependencies size of images grew in size. If you still want to use smaller images without python dependencies, you can use the corresponding images tags ending with -core.

Full changelog: https://github.com/mudler/LocalAI/releases/tag/v2.0.0

30-10-2023: v1.40.0

This release is a preparation before v2 - the efforts now will be to refactor, polish and add new backends. Follow up on: https://github.com/mudler/LocalAI/issues/1126

Hot topics

This release now brings the llama-cpp backend which is a c++ backend tied to llama.cpp. It follows more closely and tracks recent versions of llama.cpp. It is not feature compatible with the current llama backend but plans are to sunset the current llama backend in favor of this one. This one will be probably be the latest release containing the older llama backend written in go and c++. The major improvement with this change is that there are less layers that could be expose to potential bugs - and as well it ease out maintenance as well.

Support for ROCm/HIPBLAS

This release bring support for AMD thanks to @65a . See more details in 1100

More CLI commands

Thanks to @jespino now the local-ai binary has more subcommands allowing to manage the gallery or try out directly inferencing, check it out!

Release notes

25-09-2023: v1.30.0

This is an exciting LocalAI release! Besides bug-fixes and enhancements this release brings the new backend to a whole new level by extending support to vllm and vall-e-x for audio generation!

Check out the documentation for vllm here and Vall-E-X here

Release notes

26-08-2023: v1.25.0

Hey everyone, Ettore here, I’m so happy to share this release out - while this summer is hot apparently doesn’t stop LocalAI development :)

This release brings a lot of new features, bugfixes and updates! Also a big shout out to the community, this was a great release!

Attention 🚨

From this release the llama backend supports only gguf files (see 943 ). LocalAI however still supports ggml files. We ship a version of llama.cpp before that change in a separate backend, named llama-stable to allow still loading ggml files. If you were specifying the llama backend manually to load ggml files from this release you should use llama-stable instead, or do not specify a backend at all (LocalAI will automatically handle this).

Image generation enhancements

The Diffusers backend got now various enhancements, including support to generate images from images, longer prompts, and support for more kernels schedulers. See the Diffusers documentation for more information.

Lora adapters

Now it’s possible to load lora adapters for llama.cpp. See 955 for more information.

Device management

It is now possible for single-devices with one GPU to specify --single-active-backend to allow only one backend active at the time 925 .

Community spotlight

Resources management

Thanks to the continous community efforts (another cool contribution from dave-gray101 ) now it’s possible to shutdown a backend programmatically via the API. There is an ongoing effort in the community to better handling of resources. See also the 🔥Roadmap.

New how-to section

Thanks to the community efforts now we have a new how-to website with various examples on how to use LocalAI. This is a great starting point for new users! We are currently working on improving it, a huge shout out to lunamidori5 from the community for the impressive efforts on this!

💡 More examples!

Open source autopilot? See the new addition by gruberdev in our examples on how to use Continue with LocalAI!
Want to try LocalAI with Insomnia? Check out the new Insomnia example by dave-gray101 !

LocalAGI in discord!

Did you know that we have now few cool bots in our Discord? come check them out! We also have an instance of LocalAGI ready to help you out!

Changelog summary

Breaking Changes 🛠

feat: bump llama.cpp, add gguf support by mudler in 943

Exciting New Features 🎉

feat(Makefile): allow to restrict backend builds by mudler in 890
feat(diffusers): various enhancements by mudler in 895
feat: make initializer accept gRPC delay times by mudler in 900
feat(diffusers): add DPMSolverMultistepScheduler++, DPMSolverMultistepSchedulerSDE++, guidance_scale by mudler in 903
feat(diffusers): overcome prompt limit by mudler in 904
feat(diffusers): add img2img and clip_skip, support more kernels schedulers by mudler in 906
Usage Features by dave-gray101 in 863
feat(diffusers): be consistent with pipelines, support also depthimg2img by mudler in 926
feat: add –single-active-backend to allow only one backend active at the time by mudler in 925
feat: add llama-stable backend by mudler in 932
feat: allow to customize rwkv tokenizer by dave-gray101 in 937
feat: backend monitor shutdown endpoint, process based by dave-gray101 in 938
feat: Allow to load lora adapters for llama.cpp by mudler in 955

Join our Discord community! our vibrant community is growing fast, and we are always happy to help! https://discord.gg/uJAeKSAGDy

The full changelog is available here.

🔥🔥🔥🔥 12-08-2023: v1.24.0 🔥🔥🔥🔥

This is release brings four(!) new additional backends to LocalAI: 🐶 Bark, 🦙 AutoGPTQ, 🧨 Diffusers, 🦙 exllama and a lot of improvements!

Major improvements:

feat: add bark and AutoGPTQ by mudler in 871
feat: Add Diffusers by mudler in 874
feat: add API_KEY list support by neboman11 and bnusunny in 877
feat: Add exllama by mudler in 881
feat: pre-configure LocalAI galleries by mudler in 886

🐶 Bark

Bark is a text-prompted generative audio model - it combines GPT techniques to generate Audio from text. It is a great addition to LocalAI, and it’s available in the container images by default.

It can also generate music, see the example: lion.webm

🦙 AutoGPTQ

AutoGPTQ is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

It is targeted mainly for GPU usage only. Check out the documentation for usage.

🦙 Exllama

Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”. It is a faster alternative to run LLaMA models on GPU.Check out the Exllama documentation for usage.

🧨 Diffusers

Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Currently it is experimental, and supports generation only of images so you might encounter some issues on models which weren’t tested yet. Check out the Diffusers documentation for usage.

🔑 API Keys

Thanks to the community contributions now it’s possible to specify a list of API keys that can be used to gate API requests.

API Keys can be specified with the API_KEY environment variable as a comma-separated list of keys.

🖼️ Galleries

Now by default the model-gallery repositories are configured in the container images

💡 New project

LocalAGI is a simple agent that uses LocalAI functions to have a full locally runnable assistant (with no API keys needed).

See it here in action planning a trip for San Francisco!

The full changelog is available here.

🔥🔥 29-07-2023: v1.23.0 🚀

This release focuses mostly on bugfixing and updates, with just a couple of new features:

feat: add rope settings and negative prompt, drop grammar backend by mudler in 797
Added CPU information to entrypoint.sh by @finger42 in 794
feat: cancel stream generation if client disappears by @tmm1 in 792

Most notably, this release brings important fixes for CUDA (and not only):

fix: add rope settings during model load, fix CUDA by mudler in 821
fix: select function calls if ’name’ is set in the request by mudler in 827
fix: symlink libphonemize in the container by mudler in 831

Note

From this release OpenAI functions are available in the llama backend. The llama-grammar has been deprecated. See also OpenAI functions.

The full changelog is available here

🔥🔥🔥 23-07-2023: v1.22.0 🚀

feat: add llama-master backend by mudler in 752
[build] pass build type to cmake on libtransformers.a build by @TonDar0n in 741
feat: resolve JSONSchema refs (planners) by mudler in 774
feat: backends improvements by mudler in 778
feat(llama2): add template for chat messages by dave-gray101 in 782

Note

From this release to use the OpenAI functions you need to use the llama-grammar backend. It has been added a llama backend for tracking llama.cpp master and llama-grammar for the grammar functionalities that have not been merged yet upstream. See also OpenAI functions. Until the feature is merged we will have two llama backends.

Huggingface embeddings

In this release is now possible to specify to LocalAI external gRPC backends that can be used for inferencing 778 . It is now possible to write internal backends in any language, and a huggingface-embeddings backend is now available in the container image to be used with https://github.com/UKPLab/sentence-transformers. See also Embeddings.

LLaMa 2 has been released!

Thanks to the community effort now LocalAI supports templating for LLaMa2! more at: 782 until we update the model gallery with LLaMa2 models!

Official langchain integration

Progress has been made to support LocalAI with langchain. See: https://github.com/langchain-ai/langchain/pull/8134

🔥🔥🔥 17-07-2023: v1.21.0 🚀

[whisper] Partial support for verbose_json format in transcribe endpoint by @ldotlopez in 721
LocalAI functions by @mudler in 726
gRPC-based backends by @mudler in 743
falcon support (7b and 40b) with ggllm.cpp by @mudler in 743

LocalAI functions

This allows to run OpenAI functions as described in the OpenAI blog post and documentation: https://openai.com/blog/function-calling-and-other-api-updates.

This is a video of running the same example, locally with LocalAI:

And here when it actually picks to reply to the user instead of using functions!

Note: functions are supported only with llama.cpp-compatible models.

A full example is available here: https://github.com/mudler/LocalAI-examples/tree/main/functions

gRPC backends

This is an internal refactor which is not user-facing, however, it allows to ease out maintenance and addition of new backends to LocalAI!

`falcon` support

Now Falcon 7b and 40b models compatible with https://github.com/cmp-nct/ggllm.cpp are supported as well.

The former, ggml-based backend has been renamed to falcon-ggml.

Default pre-compiled binaries

From this release the default behavior of images has changed. Compilation is not triggered on start automatically, to recompile local-ai from scratch on start and switch back to the old behavior, you can set REBUILD=true in the environment variables. Rebuilding can be necessary if your CPU and/or architecture is old and the pre-compiled binaries are not compatible with your platform. See the build section for more information.

Full release changelog

🔥🔥🔥 28-06-2023: v1.20.0 🚀

Exciting New Features 🎉

Add Text-to-Audio generation with go-piper by mudler in 649 See API endpoints in our documentation.
Add gallery repository by mudler in 663 . See models for documentation.

Container images

Standard (GPT + stablediffusion): quay.io/go-skynet/local-ai:v1.20.0
FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-ffmpeg
CUDA 11+FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-gpu-nvidia-cuda11-ffmpeg
CUDA 12+FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-gpu-nvidia-cuda12-ffmpeg

Updates

Updates to llama.cpp, go-transformers, gpt4all.cpp and rwkv.cpp.

The NUMA option was enabled by mudler in 684 , along with many new parameters (mmap,mmlock, ..). See advanced for the full list of parameters.

Gallery repositories

In this release there is support for gallery repositories. These are repositories that contain models, and can be used to install models. The default gallery which contains only freely licensed models is in Github: https://github.com/go-skynet/model-gallery, but you can use your own gallery by setting the GALLERIES environment variable. An automatic index of huggingface models is available as well.

For example, now you can start LocalAI with the following environment variable to use both galleries:

GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:ci-robbot/localai-huggingface-zoo/index.yaml","name":"huggingface"}]

And in runtime you can install a model from huggingface now with:

curl http://localhost:8000/models/apply -H "Content-Type: application/json" -d '{ "id": "huggingface@thebloke__open-llama-7b-open-instruct-ggml__open-llama-7b-open-instruct.ggmlv3.q4_0.bin" }'

or a tts voice with:

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{ "id": "model-gallery@voice-en-us-kathleen-low" }'

See also models for a complete documentation.

Text to Audio

Now LocalAI uses piper and go-piper to generate audio from text. This is an experimental feature, and it requires GO_TAGS=tts to be set during build. It is enabled by default in the pre-built container images.

To setup audio models, you can use the new galleries, or setup the models manually as described in the API section of the documentation.

You can check the full changelog in Github

🔥🔥🔥 19-06-2023: v1.19.0 🚀

Full CUDA GPU offload support ( PR by mudler. Thanks to chnyda for handing over the GPU access, and lu-zero to help in debugging )
Full GPU Metal Support is now fully functional. Thanks to Soleblaze to iron out the Metal Apple silicon support!

Container images:

Standard (GPT + stablediffusion): quay.io/go-skynet/local-ai:v1.19.2
FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-ffmpeg
CUDA 11+FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-gpu-nvidia-cuda11-ffmpeg
CUDA 12+FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-gpu-nvidia-cuda12-ffmpeg

🔥🔥🔥 06-06-2023: v1.18.0 🚀

This LocalAI release is plenty of new features, bugfixes and updates! Thanks to the community for the help, this was a great community release!

We now support a vast variety of models, while being backward compatible with prior quantization formats, this new release allows still to load older formats and new k-quants!

New features

✨ Added support for falcon-based model families (7b) ( mudler )
✨ Experimental support for Metal Apple Silicon GPU - ( mudler and thanks to Soleblaze for testing! ). See the build section.
✨ Support for token stream in the /v1/completions endpoint ( samm81 )
✨ Added huggingface backend ( Evilfreelancer )
📷 Stablediffusion now can output 2048x2048 images size with esrgan! ( mudler )

Container images

🐋 CUDA container images (arm64, x86_64) ( sebastien-prudhomme )
🐋 FFmpeg container images (arm64, x86_64) ( mudler )

Dependencies updates

🆙 Bloomz has been updated to the latest ggml changes, including new quantization format ( mudler )
🆙 RWKV has been updated to the new quantization format( mudler )
🆙 k-quants format support for the llama models ( mudler )
🆙 gpt4all has been updated, incorporating upstream changes allowing to load older models, and with different CPU instruction set (AVX only, AVX2) from the same binary! ( mudler )

Generic

🐧 Fully Linux static binary releases ( mudler )
📷 Stablediffusion has been enabled on container images by default ( mudler ) Note: You can disable container image rebuilds with REBUILD=false

Examples

💡 AutoGPT example ( mudler )
💡 PrivateGPT example ( mudler )
💡 Flowise example ( mudler )

Two new projects offer now direct integration with LocalAI!

Full release changelog

29-05-2023: v1.17.0

Support for OpenCL has been added while building from sources.

You can now build LocalAI from source with BUILD_TYPE=clblas to have an OpenCL build. See also the build section.

For instructions on how to install OpenCL/CLBlast see here.

rwkv.cpp has been updated to the new ggml format commit.

27-05-2023: v1.16.0

Now it’s possible to automatically download pre-configured models before starting the API.

Start local-ai with the PRELOAD_MODELS containing a list of models from the gallery, for instance to install gpt4all-j as gpt-3.5-turbo:

PRELOAD_MODELS=[{"url": "github:go-skynet/model-gallery/gpt4all-j.yaml", "name": "gpt-3.5-turbo"}]

llama.cpp models now can also automatically save the prompt cache state as well by specifying in the model YAML configuration file:

prompt_cache_path: "alpaca-cache"

prompt_cache_all: true

Features

LocalAI provides a comprehensive set of features for running AI models locally. This section covers all the capabilities and functionalities available in LocalAI.

Core Features

Text Generation - Generate text with GPT-compatible models using various backends
Image Generation - Create images with Stable Diffusion and other diffusion models
Audio Processing - Transcribe audio to text and generate speech from text
Embeddings - Generate vector embeddings for semantic search and RAG applications
GPT Vision - Analyze and understand images with vision-language models

Advanced Features

OpenAI Functions - Use function calling and tools API with local models
Constrained Grammars - Control model output format with BNF grammars
GPU Acceleration - Optimize performance with GPU support
Distributed Inference - Scale inference across multiple nodes
Model Context Protocol (MCP) - Enable agentic capabilities with MCP integration

Specialized Features

Object Detection - Detect and locate objects in images
Reranker - Improve retrieval accuracy with cross-encoder models
Stores - Vector similarity search for embeddings
Model Gallery - Browse and install pre-configured models
Backends - Learn about available backends and how to manage them

Getting Started

To start using these features, make sure you have LocalAI installed and have downloaded some models. Then explore the feature pages above to learn how to use each capability.

⚙️ Backends

LocalAI supports a variety of backends that can be used to run different types of AI models. There are core Backends which are included, and there are containerized applications that provide the runtime environment for specific model types, such as LLMs, diffusion models, or text-to-speech models.

Managing Backends in the UI

The LocalAI web interface provides an intuitive way to manage your backends:

Navigate to the “Backends” section in the navigation menu
Browse available backends from configured galleries
Use the search bar to find specific backends by name, description, or type
Filter backends by type using the quick filter buttons (LLM, Diffusion, TTS, Whisper)
Install or delete backends with a single click
Monitor installation progress in real-time

Each backend card displays:

Backend name and description
Type of models it supports
Installation status
Action buttons (Install/Delete)
Additional information via the info button

Backend Galleries

Backend galleries are repositories that contain backend definitions. They work similarly to model galleries but are specifically for backends.

Adding a Backend Gallery

You can add backend galleries by specifying the Environment Variable LOCALAI_BACKEND_GALLERIES:

export LOCALAI_BACKEND_GALLERIES='[{"name":"my-gallery","url":"https://raw.githubusercontent.com/username/repo/main/backends"}]'

The URL needs to point to a valid yaml file, for example:

- name: "test-backend"
  uri: "quay.io/image/tests:localai-backend-test"
  alias: "foo-backend"

Where URI is the path to an OCI container image.

Backend Gallery Structure

A backend gallery is a collection of YAML files, each defining a backend. Here’s an example structure:

name: "llm-backend"
description: "A backend for running LLM models"
uri: "quay.io/username/llm-backend:latest"
alias: "llm"
tags:
  - "llm"
  - "text-generation"

Pre-installing Backends

You can pre-install backends when starting LocalAI using the LOCALAI_EXTERNAL_BACKENDS environment variable:

export LOCALAI_EXTERNAL_BACKENDS="llm-backend,diffusion-backend"
local-ai run

Creating a Backend

To create a new backend, you need to:

Create a container image that implements the LocalAI backend interface
Define a backend YAML file
Publish your backend to a container registry

Backend Container Requirements

Your backend container should:

Implement the LocalAI backend interface (gRPC or HTTP)
Handle model loading and inference
Support the required model types
Include necessary dependencies
Have a top level run.sh file that will be used to run the backend
Pushed to a registry so can be used in a gallery

Getting started

For getting started, see the available backends in LocalAI here: https://github.com/mudler/LocalAI/tree/master/backend .

For Python based backends there is a template that can be used as starting point: https://github.com/mudler/LocalAI/tree/master/backend/python/common/template .
For Golang based backends, you can see the bark-cpp backend as an example: https://github.com/mudler/LocalAI/tree/master/backend/go/bark-cpp
For C++ based backends, you can see the llama-cpp backend as an example: https://github.com/mudler/LocalAI/tree/master/backend/cpp/llama-cpp

Publishing Your Backend

Build your container image:

docker build -t quay.io/username/my-backend:latest .

Push to a container registry:

docker push quay.io/username/my-backend:latest

Add your backend to a gallery:
- Create a YAML entry in your gallery repository
- Include the backend definition
- Make the gallery accessible via HTTP/HTTPS

Backend Types

LocalAI supports various types of backends:

LLM Backends: For running language models
Diffusion Backends: For image generation
TTS Backends: For text-to-speech conversion
Whisper Backends: For speech-to-text conversion

⚡ GPU acceleration

Section under construction

This section contains instruction on how to use LocalAI with GPU acceleration.

For acceleration for AMD or Metal HW is still in development, for additional details see the build

Automatic Backend Detection

When you install a model from the gallery (or a YAML file), LocalAI intelligently detects the required backend and your system’s capabilities, then downloads the correct version for you. Whether you’re running on a standard CPU, an NVIDIA GPU, an AMD GPU, or an Intel GPU, LocalAI handles it automatically.

For advanced use cases or to override auto-detection, you can use the LOCALAI_FORCE_META_BACKEND_CAPABILITY environment variable. Here are the available options:

default: Forces CPU-only backend. This is the fallback if no specific hardware is detected.
nvidia: Forces backends compiled with CUDA support for NVIDIA GPUs.
amd: Forces backends compiled with ROCm support for AMD GPUs.
intel: Forces backends compiled with SYCL/oneAPI support for Intel GPUs.

Model configuration

Depending on the model architecture and backend used, there might be different ways to enable GPU acceleration. It is required to configure the model you intend to use with a YAML config file. For example, for llama.cpp workloads a configuration file might look like this (where gpu_layers is the number of layers to offload to the GPU):

name: my-model-name
parameters:
  # Relative to the models path
  model: llama.cpp-model.ggmlv3.q5_K_M.bin

context_size: 1024
threads: 1

f16: true # enable with GPU acceleration
gpu_layers: 22 # GPU Layers (only used when built with cublas)

For diffusers instead, it might look like this instead:

name: stablediffusion
parameters:
  model: toonyou_beta6.safetensors
backend: diffusers
step: 30
f16: true
diffusers:
  pipeline_type: StableDiffusionPipeline
  cuda: true
  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
  scheduler_type: "k_dpmpp_sde"

CUDA(NVIDIA) acceleration

Requirements

Requirement: nvidia-container-toolkit (installation instructions 1 2)

If using a system with SELinux, ensure you have the policies installed, such as those provided by nvidia

To check what CUDA version do you need, you can either run nvidia-smi or nvcc --version.

Alternatively, you can also check nvidia-smi with docker:

docker run --runtime=nvidia --rm nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi

To use CUDA, use the images with the cublas tag, for example.

The image list is on quay:

CUDA 11 tags: master-gpu-nvidia-cuda-11, v1.40.0-gpu-nvidia-cuda-11, …
CUDA 12 tags: master-gpu-nvidia-cuda-12, v1.40.0-gpu-nvidia-cuda-12, …

In addition to the commands to run LocalAI normally, you need to specify --gpus all to docker, for example:

docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v1.40.0-gpu-nvidia-cuda12

If the GPU inferencing is working, you should be able to see something like:

5:22PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4
llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 4321.77 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1598 MB
...................................................................................................
llama_init_from_file: kv self size  =  512.00 MB

ROCM(AMD) acceleration

There are a limited number of tested configurations for ROCm systems however most newer deditated GPU consumer grade devices seem to be supported under the current ROCm6 implementation.

Due to the nature of ROCm it is best to run all implementations in containers as this limits the number of packages required for installation on host system, compatibility and package versions for dependencies across all variations of OS must be tested independently if desired, please refer to the build documentation.

Requirements

ROCm 6.x.x compatible GPU/accelerator
OS: Ubuntu (22.04, 20.04), RHEL (9.3, 9.2, 8.9, 8.8), SLES (15.5, 15.4)
Installed to host: amdgpu-dkms and rocm >=6.0.0 as per ROCm documentation.

Recommendations

Make sure to do not use GPU assigned for compute for desktop rendering.
Ensure at least 100GB of free space on disk hosting container runtime and storing images prior to installation.

Limitations

Ongoing verification testing of ROCm compatibility with integrated backends. Please note the following list of verified backends and devices.

LocalAI hipblas images are built against the following targets: gfx900,gfx906,gfx908,gfx940,gfx941,gfx942,gfx90a,gfx1030,gfx1031,gfx1100,gfx1101

If your device is not one of these you must specify the corresponding GPU_TARGETS and specify REBUILD=true. Otherwise you don’t need to specify these in the commands below.

Verified

The devices in the following list have been tested with hipblas images running ROCm 6.0.0

Backend	Verified	Devices
llama.cpp	yes	Radeon VII (gfx906)
diffusers	yes	Radeon VII (gfx906)
piper	yes	Radeon VII (gfx906)
whisper	no	none
bark	no	none
coqui	no	none
transformers	no	none
exllama	no	none
exllama2	no	none
mamba	no	none
sentencetransformers	no	none
transformers-musicgen	no	none
vall-e-x	no	none
vllm	no	none

You can help by expanding this list.

System Prep

Check your GPU LLVM target is compatible with the version of ROCm. This can be found in the LLVM Docs.
Check which ROCm version is compatible with your LLVM target and your chosen OS (pay special attention to supported kernel versions). See the following for compatibility for (ROCm 6.0.0) or (ROCm 6.0.2)
Install you chosen version of the dkms and rocm (it is recommended that the native package manager be used for this process for any OS as version changes are executed more easily via this method if updates are required). Take care to restart after installing amdgpu-dkms and before installing rocm, for details regarding this see the installation documentation for your chosen OS (6.0.2 or 6.0.0)
Deploy. Yes it’s that easy.

Setup Example (Docker/containerd)

The following are examples of the ROCm specific configuration elements required.

    # For full functionality select a non-'core' image, version locking the image is recommended for debug purposes.
    image: quay.io/go-skynet/local-ai:master-aio-gpu-hipblas
    environment:
      - DEBUG=true
      # If your gpu is not already included in the current list of default targets the following build details are required.
      - REBUILD=true
      - BUILD_TYPE=hipblas
      - GPU_TARGETS=gfx906 # Example for Radeon VII
    devices:
      # AMD GPU only require the following devices be passed through to the container for offloading to occur.
      - /dev/dri
      - /dev/kfd

The same can also be executed as a run for your container runtime

docker run \
 -e DEBUG=true \
 -e REBUILD=true \
 -e BUILD_TYPE=hipblas \
 -e GPU_TARGETS=gfx906 \
 --device /dev/dri \
 --device /dev/kfd \
 quay.io/go-skynet/local-ai:master-aio-gpu-hipblas

Please ensure to add all other required environment variables, port forwardings, etc to your compose file or run command.

The rebuild process will take some time to complete when deploying these containers and it is recommended that you pull the image prior to deployment as depending on the version these images may be ~20GB in size.

Example (k8s) (Advanced Deployment/WIP)

For k8s deployments there is an additional step required before deployment, this is the deployment of the ROCm/k8s-device-plugin. For any k8s environment the documentation provided by AMD from the ROCm project should be successful. It is recommended that if you use rke2 or OpenShift that you deploy the SUSE or RedHat provided version of this resource to ensure compatibility. After this has been completed the helm chart from go-skynet can be configured and deployed mostly un-edited.

The following are details of the changes that should be made to ensure proper function. While these details may be configurable in the values.yaml development of this Helm chart is ongoing and is subject to change.

The following details indicate the final state of the localai deployment relevant to GPU function.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {NAME}-local-ai
...
spec:
  ...
  template:
    ...
    spec:
      containers:
        - env:
            - name: HIP_VISIBLE_DEVICES
              value: '0'
              # This variable indicates the devices available to container (0:device1 1:device2 2:device3) etc.
              # For multiple devices (say device 1 and 3) the value would be equivalent to HIP_VISIBLE_DEVICES="0,2"
              # Please take note of this when an iGPU is present in host system as compatibility is not assured.
          ...
          resources:
            limits:
              amd.com/gpu: '1'
            requests:
              amd.com/gpu: '1'

This configuration has been tested on a ‘custom’ cluster managed by SUSE Rancher that was deployed on top of Ubuntu 22.04.4, certification of other configuration is ongoing and compatibility is not guaranteed.

Notes

When installing the ROCM kernel driver on your system ensure that you are installing an equal or newer version that that which is currently implemented in LocalAI (6.0.0 at time of writing).
AMD documentation indicates that this will ensure functionality however your mileage may vary depending on the GPU and distro you are using.
If you encounter an Error 413 on attempting to upload an audio file or image for whisper or llava/bakllava on a k8s deployment, note that the ingress for your deployment may require the annotation nginx.ingress.kubernetes.io/proxy-body-size: "25m" to allow larger uploads. This may be included in future versions of the helm chart.

Intel acceleration (sycl)

Requirements

If building from source, you need to install Intel oneAPI Base Toolkit and have the Intel drivers available in the system.

Container images

To use SYCL, use the images with gpu-intel in the tag, for example v3.7.0-gpu-intel, …

The image list is on quay.

Example

To run LocalAI with Docker and sycl starting phi-2, you can use the following command as an example:

docker run -e DEBUG=true --privileged -ti -v $PWD/models:/models -p 8080:8080  -v /dev/dri:/dev/dri --rm quay.io/go-skynet/local-ai:master-gpu-intel phi-2

Notes

In addition to the commands to run LocalAI normally, you need to specify --device /dev/dri to docker, for example:

docker run --rm -ti --device /dev/dri -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v3.7.0-gpu-intel

Note also that sycl does have a known issue to hang with mmap: true. You have to disable it in the model configuration if explicitly enabled.

Vulkan acceleration

Requirements

If using nvidia, follow the steps in the CUDA section to configure your docker runtime to allow access to the GPU.

Container images

To use Vulkan, use the images with the vulkan tag, for example v3.7.0-gpu-vulkan.

Example

To run LocalAI with Docker and Vulkan, you can use the following command as an example:

docker run -p 8080:8080 -e DEBUG=true -v $PWD/models:/models localai/localai:latest-gpu-vulkan

Notes

In addition to the commands to run LocalAI normally, you need to specify additional flags to pass the GPU hardware to the container.

These flags are the same as the sections above, depending on the hardware, for nvidia, AMD or Intel.

If you have mixed hardware, you can pass flags for multiple GPUs, for example:

docker run -p 8080:8080 -e DEBUG=true -v $PWD/models:/models \
--gpus=all \ # nvidia passthrough
--device /dev/dri --device /dev/kfd \ # AMD/Intel passthrough
localai/localai:latest-gpu-vulkan

📖 Text generation (GPT)

LocalAI supports generating text with GPT with llama.cpp and other backends (such as rwkv.cpp as ) see also the Model compatibility for an up-to-date list of the supported model families.

Note:

You can also specify the model name as part of the OpenAI token.
If only one model is available, the API will use it for all the requests.

API Reference

Chat completions

https://platform.openai.com/docs/api-reference/chat

For example, to generate a chat completion, you can send a POST request to the /v1/chat/completions endpoint with the instruction as the request body:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "messages": [{"role": "user", "content": "Say this is a test!"}],
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens

Edit completions

https://platform.openai.com/docs/api-reference/edits

To generate an edit completion you can send a POST request to the /v1/edits endpoint with the instruction as the request body:

curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "instruction": "rephrase",
  "input": "Black cat jumped out of the window",
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens.

Completions

https://platform.openai.com/docs/api-reference/completions

To generate a completion, you can send a POST request to the /v1/completions endpoint with the instruction as per the request body:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "prompt": "A long time ago in a galaxy far, far away",
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens

List models

You can list all the models available with:

curl http://localhost:8080/v1/models

Backends

RWKV

RWKV support is available through llama.cpp (see below)

llama.cpp

llama.cpp is a popular port of Facebook’s LLaMA model in C/C++.

Note

The ggml file format has been deprecated. If you are using ggml models and you are configuring your model with a YAML file, specify, use a LocalAI version older than v2.25.0. For gguf models, use the llama backend. The go backend is deprecated as well but still available as go-llama.

Features

The llama.cpp model supports the following features:

Setup

LocalAI supports llama.cpp models out of the box. You can use the llama.cpp model in the same way as any other model.

Manual setup

It is sufficient to copy the ggml or gguf model files in the models folder. You can refer to the model in the model parameter in the API calls.

You can optionally create an associated YAML model config file to tune the model’s parameters or apply a template to the prompt.

Prompt templates are useful for models that are fine-tuned towards a specific prompt.

Automatic setup

LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for ggml or gguf models.

For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.1
   }'

LocalAI will automatically download and configure the model in the model directory.

Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the model gallery documentation.

YAML configuration

To use the llama.cpp backend, specify llama-cpp as the backend in the YAML file:

name: llama
backend: llama-cpp
parameters:
  # Relative to the models path
  model: file.gguf

Backend Options

The llama.cpp backend supports additional configuration options that can be specified in the options field of your model YAML configuration. These options allow fine-tuning of the backend behavior:

Option	Type	Description	Example
`use_jinja` or `jinja`	boolean	Enable Jinja2 template processing for chat templates. When enabled, the backend uses Jinja2-based chat templates from the model for formatting messages.	`use_jinja:true`
`context_shift`	boolean	Enable context shifting, which allows the model to dynamically adjust context window usage.	`context_shift:true`
`cache_ram`	integer	Set the maximum RAM cache size in MiB for KV cache. Use `-1` for unlimited (default).	`cache_ram:2048`
`parallel` or `n_parallel`	integer	Enable parallel request processing. When set to a value greater than 1, enables continuous batching for handling multiple requests concurrently.	`parallel:4`
`grpc_servers` or `rpc_servers`	string	Comma-separated list of gRPC server addresses for distributed inference. Allows distributing workload across multiple llama.cpp workers.	`grpc_servers:localhost:50051,localhost:50052`

Example configuration with options:

name: llama-model
backend: llama
parameters:
  model: model.gguf
options:
  - use_jinja:true
  - context_shift:true
  - cache_ram:4096
  - parallel:2

Note: The parallel option can also be set via the LLAMACPP_PARALLEL environment variable, and grpc_servers can be set via the LLAMACPP_GRPC_SERVERS environment variable. Options specified in the YAML file take precedence over environment variables.

Reference

llama

exllama/2

Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”. Both exllama and exllama2 are supported.

Model setup

Download the model as a folder inside the model directory and create a YAML file specifying the exllama backend. For instance with the TheBloke/WizardLM-7B-uncensored-GPTQ model:

$ git lfs install
$ cd models && git clone https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ
$ ls models/                                                                 
.keep                        WizardLM-7B-uncensored-GPTQ/ exllama.yaml
$ cat models/exllama.yaml                                                     
name: exllama
parameters:
  model: WizardLM-7B-uncensored-GPTQ
backend: exllama

Test with:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{                                                                                                         
   "model": "exllama",
   "messages": [{"role": "user", "content": "How are you?"}],
   "temperature": 0.1
 }'

vLLM

vLLM is a fast and easy-to-use library for LLM inference.

LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out vllm performance here.

Setup

Create a YAML file for the model you want to use with vllm.

To setup a model, you need to just specify the model name in the YAML config file:

name: vllm
backend: vllm
parameters:
    model: "facebook/opt-125m"

The backend will automatically download the required files in order to run the model.

Usage

Use the completions endpoint by specifying the vllm backend:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
   "model": "vllm",
   "prompt": "Hello, my name is",
   "temperature": 0.1, "top_p": 0.1
 }'

Transformers

Transformers is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.

LocalAI has a built-in integration with Transformers, and it can be used to run models.

This is an extra backend - in the container images (the extra images already contains python dependencies for Transformers) is already available and there is nothing to do for the setup.

Setup

Create a YAML file for the model you want to use with transformers.

To setup a model, you need to just specify the model name in the YAML config file:

name: transformers
backend: transformers
parameters:
    model: "facebook/opt-125m"
type: AutoModelForCausalLM
quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit, xpu_8bit (optional)

The backend will automatically download the required files in order to run the model.

Parameters

Type

Type	Description
`AutoModelForCausalLM`	`AutoModelForCausalLM` is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration
`OVModelForCausalLM`	for Intel CPU/GPU/NPU OpenVINO Text Generation models
`OVModelForFeatureExtraction`	for Intel CPU/GPU/NPU OpenVINO Embedding acceleration
N/A	Defaults to `AutoModel`

OVModelForCausalLM requires OpenVINO IR Text Generation models from Hugging face
OVModelForFeatureExtraction works with any Safetensors Transformer Feature Extraction model from Huggingface (Embedding Model)

Please note that streaming is currently not implemente in AutoModelForCausalLM for Intel GPU. AMD GPU support is not implemented. Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.

Embeddings

Use embeddings: true if the model is an embedding model

Inference device selection

Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the main_gpu parameter.

Inference Engine	Applicable Values
CUDA	`cuda`, `cuda.X` where X is the GPU device like in `nvidia-smi -L` output
OpenVINO	Any applicable value from Inference Modes like `AUTO`,`CPU`,`GPU`,`NPU`,`MULTI`,`HETERO`

Example for CUDA: main_gpu: cuda.0

Example for OpenVINO: main_gpu: AUTO:-CPU

This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models.

Inference Precision

Transformer backend automatically select the fastest applicable inference precision according to the device support. CUDA backend can manually enable bfloat16 if your hardware support it with the following parameter:

f16: true

Quantization

Quantization	Description
`bnb_8bit`	8-bit quantization
`bnb_4bit`	4-bit quantization
`xpu_8bit`	8-bit quantization for Intel XPUs
`xpu_4bit`	4-bit quantization for Intel XPUs

Trust Remote Code

Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library. By default it is disabled for security. It can be manually enabled with: trust_remote_code: true

Maximum Context Size

Maximum context size in bytes can be specified with the parameter: context_size. Do not use values higher than what your model support.

Usage example: context_size: 8192

Auto Prompt Template

Usually chat template is defined by the model author in the tokenizer_config.json file. To enable it use the use_tokenizer_template: true parameter in the template section.

Usage example:

template:
  use_tokenizer_template: true

Custom Stop Words

Stopwords are usually defined in tokenizer_config.json file. They can be overridden with the stopwords parameter in case of need like in llama3-Instruct model.

Usage example:

stopwords:
- "<|eot_id|>"
- "<|end_of_text|>"

Usage

Use the completions endpoint by specifying the transformers model:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
   "model": "transformers",
   "prompt": "Hello, my name is",
   "temperature": 0.1, "top_p": 0.1
 }'

Examples

OpenVINO

A model configuration file for openvion and starling model:

name: starling-openvino
backend: transformers
parameters:
  model: fakezeta/Starling-LM-7B-beta-openvino-int8
context_size: 8192
threads: 6
f16: true
type: OVModelForCausalLM
stopwords:
- <|end_of_turn|>
- <|endoftext|>
prompt_cache_path: "cache"
prompt_cache_all: true
template:
  chat_message: |
    {{if eq .RoleName "system"}}{{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "assistant"}}<|end_of_turn|>GPT4 Correct Assistant: {{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "user"}}GPT4 Correct User: {{.Content}}{{end}}

  chat: |
    {{.Input}}<|end_of_turn|>GPT4 Correct Assistant:

  completion: |
    {{.Input}}

📈 Reranker

A reranking model, often referred to as a cross-encoder, is a core component in the two-stage retrieval systems used in information retrieval and natural language processing tasks. Given a query and a set of documents, it will output similarity scores.

We can use then the score to reorder the documents by relevance in our RAG system to increase its overall accuracy and filter out non-relevant results.

LocalAI supports reranker models, and you can use them by using the rerankers backend, which uses rerankers.

Usage

You can test rerankers by using container images with python (this does NOT work with core images) and a model config file like this, or by installing cross-encoder from the gallery in the UI:

name: jina-reranker-v1-base-en
backend: rerankers
parameters:
  model: cross-encoder

and test it with:

    curl http://localhost:8080/v1/rerank \
      -H "Content-Type: application/json" \
      -d '{
      "model": "jina-reranker-v1-base-en",
      "query": "Organic skincare products for sensitive skin",
      "documents": [
        "Eco-friendly kitchenware for modern homes",
        "Biodegradable cleaning supplies for eco-conscious consumers",
        "Organic cotton baby clothes for sensitive skin",
        "Natural organic skincare range for sensitive skin",
        "Tech gadgets for smart homes: 2024 edition",
        "Sustainable gardening tools and compost solutions",
        "Sensitive skin-friendly facial cleansers and toners",
        "Organic food wraps and storage solutions",
        "All-natural pet food for dogs with allergies",
        "Yoga mats made from recycled materials"
      ],
      "top_n": 3
    }'

🗣 Text to audio (TTS)

API Compatibility

The LocalAI TTS API is compatible with the OpenAI TTS API and the Elevenlabs API.

LocalAI API

The /tts endpoint can also be used to generate speech from text.

Usage

Input: input, model

For example, to generate an audio file, you can send a POST request to the /tts endpoint with the instruction as the request body:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "input": "Hello world",
  "model": "tts"
}'

Returns an audio/wav file.

Backends

🐸 Coqui

Required: Don’t use LocalAI images ending with the -core tag,. Python dependencies are required in order to use this backend.

Coqui works without any configuration, to test it, you can run the following curl command:

    curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
        "backend": "coqui",
        "model": "tts_models/en/ljspeech/glow-tts",
        "input":"Hello, this is a test!"
        }'

You can use the env variable COQUI_LANGUAGE to set the language used by the coqui backend.

You can also use config files to configure tts models (see section below on how to use config files).

Bark

Bark allows to generate audio from text prompts.

This is an extra backend - in the container is already available and there is nothing to do for the setup.

Model setup

There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.

Usage

Use the tts endpoint by specifying the bark backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "bark",
     "input":"Hello!"
   }' | aplay

To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the model parameter:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "bark",
     "input":"Hello!",
     "model": "v2/en_speaker_4"
   }' | aplay

Piper

To install the piper audio models manually:

Download Voices from https://github.com/rhasspy/piper/releases/tag/v0.0.2
Extract the .tar.tgz files (.onnx,.json) inside models
Run the following command to test the model is working

To use the tts endpoint, run the following command. You can specify a backend with the backend parameter. For example, to use the piper backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "model":"it-riccardo_fasol-x-low.onnx",
  "backend": "piper",
  "input": "Ciao, sono Ettore"
}' | aplay

Note:

aplay is a Linux command. You can use other tools to play the audio file.
The model name is the filename with the extension.
The model name is case sensitive.
LocalAI must be compiled with the GO_TAGS=tts flag.

Transformers-musicgen

LocalAI also has experimental support for transformers-musicgen for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:

curl --request POST \
  --url http://localhost:8080/tts \
  --header 'Content-Type: application/json' \
  --data '{
    "backend": "transformers-musicgen",
    "model": "facebook/musicgen-medium",
    "input": "Cello Rave"
}' | aplay

Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.

Vall-E-X

VALL-E-X is an open source implementation of Microsoft’s VALL-E X zero-shot TTS model.

Setup

The backend will automatically download the required files in order to run the model.

This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.

Usage

Use the tts endpoint by specifying the vall-e-x backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "vall-e-x",
     "input":"Hello!"
   }' | aplay

Voice cloning

In order to use voice cloning capabilities you must create a YAML configuration file to setup a model:

name: cloned-voice
backend: vall-e-x
parameters:
  model: "cloned-voice"
tts:
    vall-e:
      # The path to the audio file to be cloned
      # relative to the models directory
      # Max 15s
      audio_path: "audio-sample.wav"

Then you can specify the model name in the requests:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "model": "cloned-voice",
     "input":"Hello!"
   }' | aplay

Using config files

You can also use a config-file to specify TTS models and their parameters.

In the following example we define a custom config to load the xtts_v2 model, and specify a voice and language.

name: xtts_v2
backend: coqui
parameters:
  language: fr
  model: tts_models/multilingual/multi-dataset/xtts_v2

tts:
  voice: Ana Florence

With this config, you can now use the following curl command to generate a text-to-speech audio file:

curl -L http://localhost:8080/tts \
    -H "Content-Type: application/json" \
    -d '{
"model": "xtts_v2",
"input": "Bonjour, je suis Ana Florence. Comment puis-je vous aider?"
}' | aplay

Response format

To provide some compatibility with OpenAI API regarding response_format, ffmpeg must be installed (or a docker image including ffmpeg used) to leverage converting the generated wav file before the api provide its response.

Warning regarding a change in behaviour. Before this addition, the parameter was ignored and a wav file was always returned, with potential codec errors later in the integration (like trying to decode a mp3 file from a wav, which is the default format used by OpenAI)

Supported format thanks to ffmpeg are wav, mp3, aac, flac, opus, defaulting to wav if an unknown or no format is provided.

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "input": "Hello world",
  "model": "tts",
  "response_format": "mp3"
}'

If a response_format is added in the query (other than wav) and ffmpeg is not available, the call will fail.

🎨 Image generation

(Generated with AnimagineXL)

LocalAI supports generating images with Stable diffusion, running on CPU using C++ and Python implementations.

Usage

OpenAI docs: https://platform.openai.com/docs/api-reference/images/create

To generate an image you can send a POST request to the /v1/images/generations endpoint with the instruction as the request body:

curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "prompt": "A cute baby sea otter",
  "size": "256x256"
}'

Available additional parameters: mode, step.

Note: To set a negative prompt, you can split the prompt with |, for instance: a cute baby sea otter|malformed.

curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
  "size": "256x256"
}'

Backends

stablediffusion-ggml

This backend is based on stable-diffusion.cpp. Every model supported by that backend is supported indeed with LocalAI.

Setup

There are already several models in the gallery that are available to install and get up and running with this backend, you can for example run flux by searching it in the Model gallery (flux.1-dev-ggml) or start LocalAI with run:

local-ai run flux.1-dev-ggml

To use a custom model, you can follow these steps:

Create a model file stablediffusion.yaml in the models folder:

name: stablediffusion
backend: stablediffusion-ggml
parameters:
  model: gguf_model.gguf
step: 25
cfg_scale: 4.5
options:
- "clip_l_path:clip_l.safetensors"
- "clip_g_path:clip_g.safetensors"
- "t5xxl_path:t5xxl-Q5_0.gguf"
- "sampler:euler"

Download the required assets to the models repository
Start LocalAI

Diffusers

Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. LocalAI has a diffusers backend which allows image generation using the diffusers library.

(Generated with AnimagineXL)

Model setup

The models will be downloaded the first time you use the backend from huggingface automatically.

Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:

name: animagine-xl
parameters:
  model: Linaqruf/animagine-xl
backend: diffusers

f16: false
diffusers:
  cuda: false # Enable for GPU usage (CUDA)
  scheduler_type: euler_a

Dependencies

This is an extra backend - in the container is already available and there is nothing to do for the setup. Do not use core images (ending with -core). If you are building manually, see the build instructions.

Model setup

The models will be downloaded the first time you use the backend from huggingface automatically.

Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:

name: animagine-xl
parameters:
  model: Linaqruf/animagine-xl
backend: diffusers
cuda: true
f16: true
diffusers:
  scheduler_type: euler_a

Local models

You can also use local models, or modify some parameters like clip_skip, scheduler_type, for instance:

name: stablediffusion
parameters:
  model: toonyou_beta6.safetensors
backend: diffusers
step: 30
f16: true
cuda: true
diffusers:
  pipeline_type: StableDiffusionPipeline
  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
  scheduler_type: "k_dpmpp_sde"
  clip_skip: 11

cfg_scale: 8

Configuration parameters

The following parameters are available in the configuration file:

Parameter	Description	Default
`f16`	Force the usage of `float16` instead of `float32`	`false`
`step`	Number of steps to run the model for	`30`
`cuda`	Enable CUDA acceleration	`false`
`enable_parameters`	Parameters to enable for the model	`negative_prompt,num_inference_steps,clip_skip`
`scheduler_type`	Scheduler type	`k_dpp_sde`
`cfg_scale`	Configuration scale	`8`
`clip_skip`	Clip skip	None
`pipeline_type`	Pipeline type	`AutoPipelineForText2Image`
`lora_adapters`	A list of lora adapters (file names relative to model directory) to apply	None
`lora_scales`	A list of lora scales (floats) to apply	None

There are available several types of schedulers:

Scheduler	Description
`ddim`	DDIM
`pndm`	PNDM
`heun`	Heun
`unipc`	UniPC
`euler`	Euler
`euler_a`	Euler a
`lms`	LMS
`k_lms`	LMS Karras
`dpm_2`	DPM2
`k_dpm_2`	DPM2 Karras
`dpm_2_a`	DPM2 a
`k_dpm_2_a`	DPM2 a Karras
`dpmpp_2m`	DPM++ 2M
`k_dpmpp_2m`	DPM++ 2M Karras
`dpmpp_sde`	DPM++ SDE
`k_dpmpp_sde`	DPM++ SDE Karras
`dpmpp_2m_sde`	DPM++ 2M SDE
`k_dpmpp_2m_sde`	DPM++ 2M SDE Karras

Pipelines types available:

Pipeline type	Description
`StableDiffusionPipeline`	Stable diffusion pipeline
`StableDiffusionImg2ImgPipeline`	Stable diffusion image to image pipeline
`StableDiffusionDepth2ImgPipeline`	Stable diffusion depth to image pipeline
`DiffusionPipeline`	Diffusion pipeline
`StableDiffusionXLPipeline`	Stable diffusion XL pipeline
`StableVideoDiffusionPipeline`	Stable video diffusion pipeline
`AutoPipelineForText2Image`	Automatic detection pipeline for text to image
`VideoDiffusionPipeline`	Video diffusion pipeline
`StableDiffusion3Pipeline`	Stable diffusion 3 pipeline
`FluxPipeline`	Flux pipeline
`FluxTransformer2DModel`	Flux transformer 2D model
`SanaPipeline`	Sana pipeline

Advanced: Additional parameters

Additional arbitrarly parameters can be specified in the option field in key/value separated by ::

name: animagine-xl
options:
- "cfg_scale:6"

Note: There is no complete parameter list. Any parameter can be passed arbitrarly and is passed to the model directly as argument to the pipeline. Different pipelines/implementations support different parameters.

The example above, will result in the following python code when generating images:

pipe(
    prompt="A cute baby sea otter", # Options passed via API
    size="256x256", # Options passed via API
    cfg_scale=6 # Additional parameter passed via configuration file
)

Usage

Text to Image

Use the image generation endpoint with the model name from the configuration file:

curl http://localhost:8080/v1/images/generations \
    -H "Content-Type: application/json" \
    -d '{
      "prompt": "<positive prompt>|<negative prompt>", 
      "model": "animagine-xl", 
      "step": 51,
      "size": "1024x1024" 
    }'

Image to Image

https://huggingface.co/docs/diffusers/using-diffusers/img2img

An example model (GPU):

name: stablediffusion-edit
parameters:
  model: nitrosocke/Ghibli-Diffusion
backend: diffusers
step: 25
cuda: true
f16: true
diffusers:
  pipeline_type: StableDiffusionImg2ImgPipeline
  enable_parameters: "negative_prompt,num_inference_steps,image"

IMAGE_PATH=/path/to/your/image
(echo -n '{"file": "'; base64 $IMAGE_PATH; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-edit"}') |
curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations

🖼️ Flux kontext with `stable-diffusion.cpp`

LocalAI supports Flux Kontext and can be used to edit images via the API:

Install with:

local-ai run flux.1-kontext-dev

To test:

curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "model": "flux.1-kontext-dev",
  "prompt": "change 'flux.cpp' to 'LocalAI'",
  "size": "256x256",
  "ref_images": [
  	"https://raw.githubusercontent.com/leejet/stable-diffusion.cpp/master/assets/flux/flux1-dev-q8_0.png"
  ]
}'

Depth to Image

https://huggingface.co/docs/diffusers/using-diffusers/depth2img

name: stablediffusion-depth
parameters:
  model: stabilityai/stable-diffusion-2-depth
backend: diffusers
step: 50
f16: true
cuda: true
diffusers:
  pipeline_type: StableDiffusionDepth2ImgPipeline
  enable_parameters: "negative_prompt,num_inference_steps,image"

cfg_scale: 6

(echo -n '{"file": "'; base64 ~/path/to/image.jpeg; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-depth"}') |
curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations

img2vid

name: img2vid
parameters:
  model: stabilityai/stable-video-diffusion-img2vid
backend: diffusers
step: 25
f16: true
cuda: true
diffusers:
  pipeline_type: StableVideoDiffusionPipeline

(echo -n '{"file": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true","size": "512x512","model":"img2vid"}') |
curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations

txt2vid

name: txt2vid
parameters:
  model: damo-vilab/text-to-video-ms-1.7b
backend: diffusers
step: 25
f16: true
cuda: true
diffusers:
  pipeline_type: VideoDiffusionPipeline
  cuda: true

(echo -n '{"prompt": "spiderman surfing","size": "512x512","model":"txt2vid"}') |
curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations

🔍 Object detection

LocalAI supports object detection through various backends. This feature allows you to identify and locate objects within images with high accuracy and real-time performance. Currently, RF-DETR is available as an implementation.

Overview

Object detection in LocalAI is implemented through dedicated backends that can identify and locate objects within images. Each backend provides different capabilities and model architectures.

Key Features:

Real-time object detection
High accuracy detection with bounding boxes
Support for multiple hardware accelerators (CPU, NVIDIA GPU, Intel GPU, AMD GPU)
Structured detection results with confidence scores
Easy integration through the /v1/detection endpoint

Usage

Detection Endpoint

LocalAI provides a dedicated /v1/detection endpoint for object detection tasks. This endpoint is specifically designed for object detection and returns structured detection results with bounding boxes and confidence scores.

API Reference

To perform object detection, send a POST request to the /v1/detection endpoint:

curl -X POST http://localhost:8080/v1/detection \
  -H "Content-Type: application/json" \
  -d '{
    "model": "rfdetr-base",
    "image": "https://media.roboflow.com/dog.jpeg"
  }'

Request Format

The request body should contain:

model: The name of the object detection model (e.g., “rfdetr-base”)
image: The image to analyze, which can be:
- A URL to an image
- A base64-encoded image

Response Format

The API returns a JSON response with detected objects:

{
  "detections": [
    {
      "x": 100.5,
      "y": 150.2,
      "width": 200.0,
      "height": 300.0,
      "confidence": 0.95,
      "class_name": "dog"
    },
    {
      "x": 400.0,
      "y": 200.0,
      "width": 150.0,
      "height": 250.0,
      "confidence": 0.87,
      "class_name": "person"
    }
  ]
}

Each detection includes:

x, y: Coordinates of the bounding box top-left corner
width, height: Dimensions of the bounding box
confidence: Detection confidence score (0.0 to 1.0)
class_name: The detected object class

Backends

RF-DETR Backend

The RF-DETR backend is implemented as a Python-based gRPC service that integrates seamlessly with LocalAI. It provides object detection capabilities using the RF-DETR model architecture and supports multiple hardware configurations:

CPU: Optimized for CPU inference
NVIDIA GPU: CUDA acceleration for NVIDIA GPUs
Intel GPU: Intel oneAPI optimization
AMD GPU: ROCm acceleration for AMD GPUs
NVIDIA Jetson: Optimized for ARM64 NVIDIA Jetson devices

Setup

Using the Model Gallery (Recommended)
The easiest way to get started is using the model gallery. The rfdetr-base model is available in the official LocalAI gallery:
```
# Install and run the rfdetr-base model
local-ai run rfdetr-base
```
You can also install it through the web interface by navigating to the Models section and searching for “rfdetr-base”.
Manual Configuration
Create a model configuration file in your models directory:
```
name: rfdetr
backend: rfdetr
parameters:
  model: rfdetr-base
```

Available Models

Currently, the following model is available in the Model Gallery:

rfdetr-base: Base model with balanced performance and accuracy

You can browse and install this model through the LocalAI web interface or using the command line.

Examples

Basic Object Detection

curl -X POST http://localhost:8080/v1/detection \
  -H "Content-Type: application/json" \
  -d '{
    "model": "rfdetr-base",
    "image": "https://example.com/image.jpg"
  }'

Base64 Image Detection

base64_image=$(base64 -w 0 image.jpg)
curl -X POST http://localhost:8080/v1/detection \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"rfdetr-base\",
    \"image\": \"data:image/jpeg;base64,$base64_image\"
  }"

Troubleshooting

Common Issues

Model Loading Errors
- Ensure the model file is properly downloaded
- Check available disk space
- Verify model compatibility with your backend version
Low Detection Accuracy
- Ensure good image quality and lighting
- Check if objects are clearly visible
- Consider using a larger model for better accuracy
Slow Performance
- Enable GPU acceleration if available
- Use a smaller model for faster inference
- Optimize image resolution

Debug Mode

Enable debug logging for troubleshooting:

local-ai run --debug rfdetr-base

Object Detection Category

LocalAI includes a dedicated object-detection category for models and backends that specialize in identifying and locating objects within images. This category currently includes:

RF-DETR: Real-time transformer-based object detection

Additional object detection models and backends will be added to this category in the future. You can filter models by the object-detection tag in the model gallery to find all available object detection models.

🎨 Image generation: Generate images with AI
📖 Text generation: Generate text with language models
🔍 GPT Vision: Analyze images with language models
🚀 GPU acceleration: Optimize performance with GPU acceleration

🧠 Embeddings

LocalAI supports generating embeddings for text or list of tokens.

For the API documentation you can refer to the OpenAI docs: https://platform.openai.com/docs/api-reference/embeddings

Model compatibility

The embedding endpoint is compatible with llama.cpp models, bert.cpp models and sentence-transformers models available in huggingface.

Manual Setup

Create a YAML config file in the models directory. Specify the backend and the model file.

name: text-embedding-ada-002 # The model name used in the API
parameters:
  model: <model_file>
backend: "<backend>"
embeddings: true

Huggingface embeddings

To use sentence-transformers and models in huggingface you can use the sentencetransformers embedding backend.

name: text-embedding-ada-002
backend: sentencetransformers
embeddings: true
parameters:
  model: all-MiniLM-L6-v2

The sentencetransformers backend uses Python sentence-transformers. For a list of all pre-trained models available see here: https://github.com/UKPLab/sentence-transformers#pre-trained-models

Note

The sentencetransformers backend is an optional backend of LocalAI and uses Python. If you are running LocalAI from the containers you are good to go and should be already configured for use.
For local execution, you also have to specify the extra backend in the EXTERNAL_GRPC_BACKENDS environment variable.
- Example: EXTERNAL_GRPC_BACKENDS="sentencetransformers:/path/to/LocalAI/backend/python/sentencetransformers/sentencetransformers.py"
The sentencetransformers backend does support only embeddings of text, and not of tokens. If you need to embed tokens you can use the bert backend or llama.cpp.
No models are required to be downloaded before using the sentencetransformers backend. The models will be downloaded automatically the first time the API is used.

Llama.cpp embeddings

Embeddings with llama.cpp are supported with the llama-cpp backend, it needs to be enabled with embeddings set to true.

name: my-awesome-model
backend: llama-cpp
embeddings: true
parameters:
  model: ggml-file.bin

Then you can use the API to generate embeddings:

curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
  "input": "My text",
  "model": "my-awesome-model"
}' | jq "."

💡 Examples

Example that uses LLamaIndex and LocalAI as embedding: here.

🥽 GPT Vision

LocalAI supports understanding images by using LLaVA, and implements the GPT Vision API from OpenAI.

Usage

OpenAI docs: https://platform.openai.com/docs/guides/vision

To let LocalAI understand and reply with what sees in the image, use the /v1/chat/completions endpoint, for example with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llava",
     "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

Grammars and function tools can be used as well in conjunction with vision APIs:

 curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llava", "grammar": "root ::= (\"yes\" | \"no\")",
     "messages": [{"role": "user", "content": [{"type":"text", "text": "Is there some grass in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

Setup

All-in-One images have already shipped the llava model as gpt-4-vision-preview, so no setup is needed in this case.

To setup the LLaVa models, follow the full example in the configuration examples.

✍️ Constrained Grammars

Overview

The chat endpoint supports the grammar parameter, which allows users to specify a grammar in Backus-Naur Form (BNF). This feature enables the Large Language Model (LLM) to generate outputs adhering to a user-defined schema, such as JSON, YAML, or any other format that can be defined using BNF. For more details about BNF, see Backus-Naur Form on Wikipedia.

Note

Compatibility Notice: This feature is only supported by models that use the llama.cpp backend. For a complete list of compatible models, refer to the Model Compatibility page. For technical details, see the related pull requests: PR #1773 and PR #1887.

Setup

To use this feature, follow the installation and setup instructions on the LocalAI Functions page. Ensure that your local setup meets all the prerequisites specified for the llama.cpp backend.

💡 Usage Example

The following example demonstrates how to use the grammar parameter to constrain the model’s output to either “yes” or “no”. This can be particularly useful in scenarios where the response format needs to be strictly controlled.

Example: Binary Response Constraint

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Do you like apples?"}],
  "grammar": "root ::= (\"yes\" | \"no\")"
}'

In this example, the grammar parameter is set to a simple choice between “yes” and “no”, ensuring that the model’s response adheres strictly to one of these options regardless of the context.

Example: JSON Output Constraint

You can also use grammars to enforce JSON output format:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Generate a person object with name and age"}],
  "grammar": "root ::= \"{\" \"\\\"name\\\":\" string \",\\\"age\\\":\" number \"}\"\nstring ::= \"\\\"\" [a-z]+ \"\\\"\"\nnumber ::= [0-9]+"
}'

Example: YAML Output Constraint

Similarly, you can enforce YAML format:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Generate a YAML list of fruits"}],
  "grammar": "root ::= \"fruits:\" newline (\"  - \" string newline)+\nstring ::= [a-z]+\nnewline ::= \"\\n\""
}'

Advanced Usage

For more complex grammars, you can define multi-line BNF rules. The grammar parser supports:

Alternation (|)
Repetition (*, +)
Optional elements (?)
Character classes ([a-z])
String literals ("text")

OpenAI Functions - Function calling with structured outputs
Text Generation - General text generation capabilities

🆕🖧 Distributed Inference

This functionality enables LocalAI to distribute inference requests across multiple worker nodes, improving efficiency and performance. Nodes are automatically discovered and connect via p2p by using a shared token which makes sure the communication is secure and private between the nodes of the network.

LocalAI supports two modes of distributed inferencing via p2p:

Federated Mode: Requests are shared between the cluster and routed to a single worker node in the network based on the load balancer’s decision.
Worker Mode (aka “model sharding” or “splitting weights”): Requests are processed by all the workers which contributes to the final inference result (by sharing the model weights).

A list of global instances shared by the community is available at explorer.localai.io.

Usage

Starting LocalAI with --p2p generates a shared token for connecting multiple instances: and that’s all you need to create AI clusters, eliminating the need for intricate network setups.

Simply navigate to the “Swarm” section in the WebUI and follow the on-screen instructions.

For fully shared instances, initiate LocalAI with –p2p –federated and adhere to the Swarm section’s guidance. This feature, while still experimental, offers a tech preview quality experience.

Federated mode

Federated mode allows to launch multiple LocalAI instances and connect them together in a federated network. This mode is useful when you want to distribute the load of the inference across multiple nodes, but you want to have a single point of entry for the API. In the Swarm section of the WebUI, you can see the instructions to connect multiple instances together.

To start a LocalAI server in federated mode, run:

local-ai run --p2p --federated

This will generate a token that you can use to connect other LocalAI instances to the network or others can use to join the network. If you already have a token, you can specify it using the TOKEN environment variable.

To start a load balanced server that routes the requests to the network, run with the TOKEN:

local-ai federated

To see all the available options, run local-ai federated --help.

The instructions are displayed in the “Swarm” section of the WebUI, guiding you through the process of connecting multiple instances.

Workers mode

Note

This feature is available exclusively with llama-cpp compatible models.

This feature was introduced in LocalAI pull request #2324 and is based on the upstream work in llama.cpp pull request #6829.

To connect multiple workers to a single LocalAI instance, start first a server in p2p mode:

local-ai run --p2p

And navigate the WebUI to the “Swarm” section to see the instructions to connect multiple workers to the network.

Without P2P

To start workers for distributing the computational load, run:

local-ai worker llama-cpp-rpc --llama-cpp-args="-H <listening_address> -p <listening_port> -m <memory>"

And you can specify the address of the workers when starting LocalAI with the LLAMACPP_GRPC_SERVERS environment variable:

LLAMACPP_GRPC_SERVERS="address1:port,address2:port" local-ai run

The workload on the LocalAI server will then be distributed across the specified nodes.

Alternatively, you can build the RPC workers/server following the llama.cpp README, which is compatible with LocalAI.

Manual example (worker)

Use the WebUI to guide you in the process of starting new workers. This example shows the manual steps to highlight the process.

Start the server with --p2p:

./local-ai run --p2p

Copy the token from the WebUI or via API call (e.g., curl http://localhost:8000/p2p/token) and save it for later use.

To reuse the same token later, restart the server with --p2ptoken or P2P_TOKEN.

Start the workers. Copy the local-ai binary to other hosts and run as many workers as needed using the token:

TOKEN=XXX ./local-ai worker p2p-llama-cpp-rpc --llama-cpp-args="-m <memory>"

(Note: You can also supply the token via command-line arguments)

The server logs should indicate that new workers are being discovered.

Start inference as usual on the server initiated in step 1.

Environment Variables

There are options that can be tweaked or parameters that can be set using environment variables

Environment Variable	Description
LOCALAI_P2P	Set to “true” to enable p2p
LOCALAI_FEDERATED	Set to “true” to enable federated mode
FEDERATED_SERVER	Set to “true” to enable federated server
LOCALAI_P2P_DISABLE_DHT	Set to “true” to disable DHT and enable p2p layer to be local only (mDNS)
LOCALAI_P2P_ENABLE_LIMITS	Set to “true” to enable connection limits and resources management (useful when running with poor connectivity or want to limit resources consumption)
LOCALAI_P2P_LISTEN_MADDRS	Set to comma separated list of multiaddresses to override default libp2p 0.0.0.0 multiaddresses
LOCALAI_P2P_DHT_ANNOUNCE_MADDRS	Set to comma separated list of multiaddresses to override announcing of listen multiaddresses (useful when external address:port is remapped)
LOCALAI_P2P_BOOTSTRAP_PEERS_MADDRS	Set to comma separated list of multiaddresses to specify custom DHT bootstrap nodes
LOCALAI_P2P_TOKEN	Set the token for the p2p network
LOCALAI_P2P_LOGLEVEL	Set the loglevel for the LocalAI p2p stack (default: info)
LOCALAI_P2P_LIB_LOGLEVEL	Set the loglevel for the underlying libp2p stack (default: fatal)

Architecture

LocalAI uses https://github.com/libp2p/go-libp2p under the hood, the same project powering IPFS. Differently from other frameworks, LocalAI uses peer2peer without a single master server, but rather it uses sub/gossip and ledger functionalities to achieve consensus across different peers.

EdgeVPN is used as a library to establish the network and expose the ledger functionality under a shared token to ease out automatic discovery and have separated, private peer2peer networks.

The weights are split proportional to the memory when running into worker mode, when in federation mode each request is split to every node which have to load the model fully.

Debugging

To debug, it’s often useful to run in debug mode, for instance:

LOCALAI_P2P_LOGLEVEL=debug LOCALAI_P2P_LIB_LOGLEVEL=debug LOCALAI_P2P_ENABLE_LIMITS=true LOCALAI_P2P_DISABLE_DHT=true LOCALAI_P2P_TOKEN="<TOKEN>" ./local-ai ...

Notes

If running in p2p mode with container images, make sure you start the container with --net host or network_mode: host in the docker-compose file.
Only a single model is supported currently.
Ensure the server detects new workers before starting inference. Currently, additional workers cannot be added once inference has begun.
For more details on the implementation, refer to LocalAI pull request #2343

🔈 Audio to text

Audio to text models are models that can generate text from an audio file.

The transcription endpoint allows to convert audio files to text. The endpoint is based on whisper.cpp, a C++ library for audio transcription. The endpoint input supports all the audio formats supported by ffmpeg.

Usage

Once LocalAI is started and whisper models are installed, you can use the /v1/audio/transcriptions API endpoint.

For instance, with cURL:

curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@<FILE_PATH>" -F model="<MODEL_NAME>"

Example

Download one of the models from here in the models folder, and create a YAML file for your model:

name: whisper-1
backend: whisper
parameters:
  model: whisper-en

The transcriptions endpoint then can be tested like so:

## Get an example audio file
wget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg

## Send the example audio file to the transcriptions endpoint
curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@$PWD/gb1.ogg" -F model="whisper-1"

## Result
{"text":"My fellow Americans, this day has brought terrible news and great sadness to our country.At nine o'clock this morning, Mission Control in Houston lost contact with our Space ShuttleColumbia.A short time later, debris was seen falling from the skies above Texas.The Columbia's lost.There are no survivors.One board was a crew of seven.Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain DavidBrown, Commander William McCool, Dr. Kultna Shavla, and Elon Ramon, a colonel in the IsraeliAir Force.These men and women assumed great risk in the service to all humanity.In an age when spaceflight has come to seem almost routine, it is easy to overlook thedangers of travel by rocket and the difficulties of navigating the fierce outer atmosphere ofthe Earth.These astronauts knew the dangers, and they faced them willingly, knowing they had a highand noble purpose in life.Because of their courage and daring and idealism, we will miss them all the more.All Americans today are thinking as well of the families of these men and women who havebeen given this sudden shock and grief.You're not alone.Our entire nation agrees with you, and those you loved will always have the respect andgratitude of this country.The cause in which they died will continue.Mankind has led into the darkness beyond our world by the inspiration of discovery andthe longing to understand.Our journey into space will go on.In the skies today, we saw destruction and tragedy.As farther than we can see, there is comfort and hope.In the words of the prophet Isaiah, \"Lift your eyes and look to the heavens who createdall these, he who brings out the starry hosts one by one and calls them each by name.\"Because of his great power and mighty strength, not one of them is missing.The same creator who names the stars also knows the names of the seven souls we mourntoday.The crew of the shuttle Columbia did not return safely to Earth yet we can pray that all aresafely home.May God bless the grieving families and may God continue to bless America.[BLANK_AUDIO]"}

🔥 OpenAI functions and tools

LocalAI supports running OpenAI functions and tools API with llama.cpp compatible models.

To learn more about OpenAI functions, see also the OpenAI API blog post.

LocalAI is also supporting JSON mode out of the box with llama.cpp-compatible models.

💡 Check out also LocalAGI for an example on how to use LocalAI functions.

Setup

OpenAI functions are available only with ggml or gguf models compatible with llama.cpp.

You don’t need to do anything specific - just use ggml or gguf models.

Usage example

You can configure a model manually with a YAML config file in the models directory, for example:

name: gpt-3.5-turbo
parameters:
  # Model file name
  model: ggml-openllama.bin
  top_p: 80
  top_k: 0.9
  temperature: 0.1

To use the functions with the OpenAI client in python:

from openai import OpenAI

messages = [{"role": "user", "content": "What is the weather like in Beijing now?"}]
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Return the temperature of the specified region specified by the user",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "User specified region",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "temperature unit"
                    },
                },
                "required": ["location"],
            },
        },
    }
]

client = OpenAI(
    # This is the default and can be omitted
    api_key="test",
    base_url="http://localhost:8080/v1/"
)

response =client.chat.completions.create(
    messages=messages,
    tools=tools,
    tool_choice ="auto",
    model="gpt-4",
)
#...

For example, with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "What is the weather like in Beijing now?"}],
  "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "description": "Return the temperature of the specified region specified by the user",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "User specified region"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "temperature unit"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ],
    "tool_choice":"auto"
}'

Return data：

{
    "created": 1724210813,
    "object": "chat.completion",
    "id": "16b57014-477c-4e6b-8d25-aad028a5625e",
    "model": "gpt-4",
    "choices": [
        {
            "index": 0,
            "finish_reason": "tool_calls",
            "message": {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "index": 0,
                        "id": "16b57014-477c-4e6b-8d25-aad028a5625e",
                        "type": "function",
                        "function": {
                            "name": "get_current_weather",
                            "arguments": "{\"location\":\"Beijing\",\"unit\":\"celsius\"}"
                        }
                    }
                ]
            }
        }
    ],
    "usage": {
        "prompt_tokens": 221,
        "completion_tokens": 26,
        "total_tokens": 247
    }
}

Advanced

Use functions without grammars

The functions calls maps automatically to grammars which are currently supported only by llama.cpp, however, it is possible to turn off the use of grammars, and extract tool arguments from the LLM responses, by specifying in the YAML file no_grammar and a regex to map the response from the LLM:

name: model_name
parameters:
  # Model file name
  model: model/name

function:
  # set to true to not use grammars
  no_grammar: true
  # set one or more regexes used to extract the function tool arguments from the LLM response
  response_regex:
  - "(?P<function>\w+)\s*\((?P<arguments>.*)\)"

The response regex have to be a regex with named parameters to allow to scan the function name and the arguments. For instance, consider:

(?P<function>\w+)\s*\((?P<arguments>.*)\)

will catch

function_name({ "foo": "bar"})

Parallel tools calls

This feature is experimental and has to be configured in the YAML of the model by enabling function.parallel_calls:

name: gpt-3.5-turbo
parameters:
  # Model file name
  model: ggml-openllama.bin
  top_p: 80
  top_k: 0.9
  temperature: 0.1

function:
  # set to true to allow the model to call multiple functions in parallel
  parallel_calls: true

Use functions with grammar

It is possible to also specify the full function signature (for debugging, or to use with other clients).

The chat endpoint accepts the grammar_json_functions additional parameter which takes a JSON schema object.

For example, with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "gpt-4",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.1,
     "grammar_json_functions": {
        "oneOf": [
            {
                "type": "object",
                "properties": {
                    "function": {"const": "create_event"},
                    "arguments": {
                        "type": "object",
                        "properties": {
                            "title": {"type": "string"},
                            "date": {"type": "string"},
                            "time": {"type": "string"}
                        }
                    }
                }
            },
            {
                "type": "object",
                "properties": {
                    "function": {"const": "search"},
                    "arguments": {
                        "type": "object",
                        "properties": {
                            "query": {"type": "string"}
                        }
                    }
                }
            }
        ]
    }
   }'

Grammars and function tools can be used as well in conjunction with vision APIs:

 curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llava", "grammar": "root ::= (\"yes\" | \"no\")",
     "messages": [{"role": "user", "content": [{"type":"text", "text": "Is there some grass in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

💡 Examples

A full e2e example with docker-compose is available here.

💾 Stores

Stores are an experimental feature to help with querying data using similarity search. It is a low level API that consists of only get, set, delete and find.

For example if you have an embedding of some text and want to find text with similar embeddings. You can create embeddings for chunks of all your text then compare them against the embedding of the text you are searching on.

An embedding here meaning a vector of numbers that represent some information about the text. The embeddings are created from an A.I. model such as BERT or a more traditional method such as word frequency.

Previously you would have to integrate with an external vector database or library directly. With the stores feature you can now do it through the LocalAI API.

Note however that doing a similarity search on embeddings is just one way to do retrieval. A higher level API can take this into account, so this may not be the best place to start.

API overview

There is an internal gRPC API and an external facing HTTP JSON API. We’ll just discuss the external HTTP API, however the HTTP API mirrors the gRPC API. Consult pkg/store/client for internal usage.

Everything is in columnar format meaning that instead of getting an array of objects with a key and a value each. You instead get two separate arrays of keys and values.

Keys are arrays of floating point numbers with a maximum width of 32bits. Values are strings (in gRPC they are bytes).

The key vectors must all be the same length and it’s best for search performance if they are normalized. When addings keys it will be detected if they are not normalized and what length they are.

All endpoints accept a store field which specifies which store to operate on. Presently they are created on the fly and there is only one store backend so no configuration is required.

Set

To set some keys you can do

curl -X POST http://localhost:8080/stores/set \
     -H "Content-Type: application/json" \
     -d '{"keys": [[0.1, 0.2], [0.3, 0.4]], "values": ["foo", "bar"]}'

Setting the same keys again will update their values.

On success 200 OK is returned with no body.

Get

To get some keys you can do

curl -X POST http://localhost:8080/stores/get \
     -H "Content-Type: application/json" \
     -d '{"keys": [[0.1, 0.2]]}'

Both the keys and values are returned, e.g: {"keys":[[0.1,0.2]],"values":["foo"]}

The order of the keys is not preserved! If a key does not exist then nothing is returned.

Delete

To delete keys and values you can do

curl -X POST http://localhost:8080/stores/delete \
     -H "Content-Type: application/json" \
     -d '{"keys": [[0.1, 0.2]]}'

If a key doesn’t exist then it is ignored.

On success 200 OK is returned with no body.

Find

To do a similarity search you can do

curl -X POST http://localhost:8080/stores/find 
     -H "Content-Type: application/json" \
     -d '{"topk": 2, "key": [0.2, 0.1]}'

topk limits the number of results returned. The result value is the same as get, except that it also includes an array of similarities. Where 1.0 is the maximum similarity. They are returned in the order of most similar to least.

🖼️ Model gallery

The model gallery is a curated collection of models configurations for LocalAI that enables one-click install of models directly from the LocalAI Web interface.

A list of the models available can also be browsed at the Public LocalAI Gallery.

LocalAI to ease out installations of models provide a way to preload models on start and downloading and installing them in runtime. You can install models manually by copying them over the models directory, or use the API or the Web interface to configure, download and verify the model assets for you.

Note

The models in this gallery are not directly maintained by LocalAI. If you find a model that is not working, please open an issue on the model gallery repository.

Note

GPT and text generation models might have a license which is not permissive for commercial use or might be questionable or without any license at all. Please check the model license before using it. The official gallery contains only open licensed models.

Useful Links and resources

Open LLM Leaderboard - here you can find a list of the most performing models on the Open LLM benchmark. Keep in mind models compatible with LocalAI must be quantized in the gguf format.

How it works

Navigate the WebUI interface in the “Models” section from the navbar at the top. Here you can find a list of models that can be installed, and you can install them by clicking the “Install” button.

Add other galleries

You can add other galleries by setting the GALLERIES environment variable. The GALLERIES environment variable is a list of JSON objects, where each object has a name and a url field. The name field is the name of the gallery, and the url field is the URL of the gallery’s index file, for example:

GALLERIES=[{"name":"<GALLERY_NAME>", "url":"<GALLERY_URL"}]

The models in the gallery will be automatically indexed and available for installation.

API Reference

Model repositories

You can install a model in runtime, while the API is running and it is started already, or before starting the API by preloading the models.

To install a model in runtime you will need to use the /models/apply LocalAI API endpoint.

By default LocalAI is configured with the localai repository.

To use additional repositories you need to start local-ai with the GALLERIES environment variable:

GALLERIES=[{"name":"<GALLERY_NAME>", "url":"<GALLERY_URL"}]

For example, to enable the default localai repository, you can start local-ai with:

GALLERIES=[{"name":"localai", "url":"github:mudler/localai/gallery/index.yaml"}]

where github:mudler/localai/gallery/index.yaml will be expanded automatically to https://raw.githubusercontent.com/mudler/LocalAI/main/index.yaml.

Note: the url are expanded automatically for github and huggingface, however https:// and http:// prefix works as well.

Note

If you want to build your own gallery, there is no documentation yet. However you can find the source of the default gallery in the LocalAI repository.

List Models

To list all the available models, use the /models/available endpoint:

curl http://localhost:8080/models/available

To search for a model, you can use jq:

curl http://localhost:8080/models/available | jq '.[] | select(.name | contains("replit"))'

curl http://localhost:8080/models/available | jq '.[] | .name | select(contains("localmodels"))'

curl http://localhost:8080/models/available | jq '.[] | .urls | select(. != null) | add | select(contains("orca"))'

How to install a model from the repositories

Models can be installed by passing the full URL of the YAML config file, or either an identifier of the model in the gallery. The gallery is a repository of models that can be installed by passing the model name.

To install a model from the gallery repository, you can pass the model name in the id field. For instance, to install the bert-embeddings model, you can use the following command:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "localai@bert-embeddings"
   }'

where:

localai is the repository. It is optional and can be omitted. If the repository is omitted LocalAI will search the model by name in all the repositories. In the case the same model name is present in both galleries the first match wins.
bert-embeddings is the model name in the gallery (read its config here).

How to install a model not part of a gallery

If you don’t want to set any gallery repository, you can still install models by loading a model configuration file.

In the body of the request you must specify the model configuration file URL (url), optionally a name to install the model (name), extra files to install (files), and configuration overrides (overrides). When calling the API endpoint, LocalAI will download the models files and write the configuration to the folder used to store models.

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "config_url": "<MODEL_CONFIG_FILE_URL>"
   }' 
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "<GALLERY>@<MODEL_NAME>"
   }' 
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE_URL>"
   }'

An example that installs hermes-2-pro-mistral can be:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "config_url": "https://raw.githubusercontent.com/mudler/LocalAI/v2.25.0/embedded/models/hermes-2-pro-mistral.yaml"
   }'

The API will return a job uuid that you can use to track the job progress:

{"uuid":"1059474d-f4f9-11ed-8d99-c4cbe106d571","status":"http://localhost:8080/models/jobs/1059474d-f4f9-11ed-8d99-c4cbe106d571"}

For instance, a small example bash script that waits a job to complete can be (requires jq):

response=$(curl -s http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{"url": "$model_url"}')

job_id=$(echo "$response" | jq -r '.uuid')

while [ "$(curl -s http://localhost:8080/models/jobs/"$job_id" | jq -r '.processed')" != "true" ]; do 
  sleep 1
done

echo "Job completed"

To preload models on start instead you can use the PRELOAD_MODELS environment variable.

To preload models on start, use the PRELOAD_MODELS environment variable by setting it to a JSON array of model uri:

PRELOAD_MODELS='[{"url": "<MODEL_URL>"}]'

Note: url or id must be specified. url is used to a url to a model gallery configuration, while an id is used to refer to models inside repositories. If both are specified, the id will be used.

For example:

PRELOAD_MODELS=[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]

or as arg:

local-ai --preload-models '[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:mudler/LocalAI/gallery/stablediffusion.yaml@master

Note

You can find already some open licensed models in the LocalAI gallery.

If you don’t find the model in the gallery you can try to use the “base” model and provide an URL to LocalAI:

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "github:mudler/LocalAI/gallery/base.yaml@master",
     "name": "model-name",
     "files": [
        {
            "uri": "<URL>",
            "sha256": "<SHA>",
            "filename": "model"
        }
     ]
   }'

Override a model name

To install a model with a different name, specify a name parameter in the request body.

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE>",
     "name": "<MODEL_NAME>"
   }'

For example, to install a model as gpt-3.5-turbo:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
      "url": "github:mudler/LocalAI/gallery/gpt4all-j.yaml",
      "name": "gpt-3.5-turbo"
   }'

Additional Files

To download additional files with the model, use the files parameter:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE>",
     "name": "<MODEL_NAME>",
     "files": [
        {
            "uri": "<additional_file_url>",
            "sha256": "<additional_file_hash>",
            "filename": "<additional_file_name>"
        }
     ]
   }'

Overriding configuration files

To override portions of the configuration file, such as the backend or the model file, use the overrides parameter:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE>",
     "name": "<MODEL_NAME>",
     "overrides": {
        "backend": "llama",
        "f16": true,
        ...
     }
   }'

Examples

Embeddings: Bert

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "bert-embeddings",
     "name": "text-embedding-ada-002"
   }'

To test it:

LOCALAI=http://localhost:8080
curl $LOCALAI/v1/embeddings -H "Content-Type: application/json" -d '{
    "input": "Test",
    "model": "text-embedding-ada-002"
  }'

Image generation: Stable diffusion

URL: https://github.com/EdVince/Stable-Diffusion-NCNN

While the API is running, you can install the model by using the /models/apply endpoint and point it to the stablediffusion model in the models-gallery:

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{         
     "url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"
   }'

You can set the PRELOAD_MODELS environment variable:

PRELOAD_MODELS=[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]

or as arg:

local-ai --preload-models '[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:mudler/LocalAI/gallery/stablediffusion.yaml@master

Test it:

curl $LOCALAI/v1/images/generations -H "Content-Type: application/json" -d '{
            "prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
            "mode": 2,  "seed":9000,
            "size": "256x256", "n":2
}'

Audio transcription: Whisper

URL: https://github.com/ggerganov/whisper.cpp

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{         
     "url": "github:mudler/LocalAI/gallery/whisper-base.yaml@master",
     "name": "whisper-1"
   }'

You can set the PRELOAD_MODELS environment variable:

PRELOAD_MODELS=[{"url": "github:mudler/LocalAI/gallery/whisper-base.yaml@master", "name": "whisper-1"}]

or as arg:

local-ai --preload-models '[{"url": "github:mudler/LocalAI/gallery/whisper-base.yaml@master", "name": "whisper-1"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:mudler/LocalAI/gallery/whisper-base.yaml@master
  name: whisper-1

Note

LocalAI will create a batch process that downloads the required files from a model definition and automatically reload itself to include the new model.

Input: url or id (required), name (optional), files (optional)

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_DEFINITION_URL>",
     "id": "<GALLERY>@<MODEL_NAME>",
     "name": "<INSTALLED_MODEL_NAME>",
     "files": [
        {
            "uri": "<additional_file>",
            "sha256": "<additional_file_hash>",
            "filename": "<additional_file_name>"
        },
      "overrides": { "backend": "...", "f16": true }
     ]
   }

An optional, list of additional files can be specified to be downloaded within files. The name allows to override the model name. Finally it is possible to override the model config file with override.

The url is a full URL, or a github url (github:org/repo/file.yaml), or a local file (file:///path/to/file.yaml). The id is a string in the form <GALLERY>@<MODEL_NAME>, where <GALLERY> is the name of the gallery, and <MODEL_NAME> is the name of the model in the gallery. Galleries can be specified during startup with the GALLERIES environment variable.

Returns an uuid and an url to follow up the state of the process:

{ "uuid":"251475c9-f666-11ed-95e0-9a8a4480ac58", "status":"http://localhost:8080/models/jobs/251475c9-f666-11ed-95e0-9a8a4480ac58"}

To see a collection example of curated models definition files, see the LocalAI repository.

Get model job state `/models/jobs/<uid>`

This endpoint returns the state of the batch job associated to a model installation.

curl http://localhost:8080/models/jobs/<JOB_ID>

Returns a json containing the error, and if the job is being processed:

{"error":null,"processed":true,"message":"completed"}

🔗 Model Context Protocol (MCP)

LocalAI now supports the Model Context Protocol (MCP), enabling powerful agentic capabilities by connecting AI models to external tools and services. This feature allows your LocalAI models to interact with various MCP servers, providing access to real-time data, APIs, and specialized tools.

What is MCP?

The Model Context Protocol is a standard for connecting AI models to external tools and data sources. It enables AI agents to:

Access real-time information from external APIs
Execute commands and interact with external systems
Use specialized tools for specific tasks
Maintain context across multiple tool interactions

Key Features

🔄 Real-time Tool Access: Connect to external MCP servers for live data
🛠️ Multiple Server Support: Configure both remote HTTP and local stdio servers
⚡ Cached Connections: Efficient tool caching for better performance
🔒 Secure Authentication: Support for bearer token authentication
🎯 OpenAI Compatible: Uses the familiar /mcp/v1/chat/completions endpoint
🧠 Advanced Reasoning: Configurable reasoning and re-evaluation capabilities
📋 Auto-Planning: Break down complex tasks into manageable steps
🎯 MCP Prompts: Specialized prompts for better MCP server interaction
🔄 Plan Re-evaluation: Dynamic plan adjustment based on results
⚙️ Flexible Agent Control: Customizable execution limits and retry behavior

Configuration

MCP support is configured in your model’s YAML configuration file using the mcp section:

name: my-agentic-model
backend: llama-cpp
parameters:
  model: qwen3-4b.gguf

mcp:
  remote: |
    {
      "mcpServers": {
        "weather-api": {
          "url": "https://api.weather.com/v1",
          "token": "your-api-token"
        },
        "search-engine": {
          "url": "https://search.example.com/mcp",
          "token": "your-search-token"
        }
      }
    }
  
  stdio: |
    {
      "mcpServers": {
        "file-manager": {
          "command": "python",
          "args": ["-m", "mcp_file_manager"],
          "env": {
            "API_KEY": "your-key"
          }
        },
        "database-tools": {
          "command": "node",
          "args": ["database-mcp-server.js"],
          "env": {
            "DB_URL": "postgresql://localhost/mydb"
          }
        }
      }
    }

agent:
  max_attempts: 3        # Maximum number of tool execution attempts
  max_iterations: 3     # Maximum number of reasoning iterations
  enable_reasoning: true # Enable tool reasoning capabilities
  enable_planning: false # Enable auto-planning capabilities
  enable_mcp_prompts: false # Enable MCP prompts
  enable_plan_re_evaluator: false # Enable plan re-evaluation

Configuration Options

Remote Servers (`remote`)

Configure HTTP-based MCP servers:

url: The MCP server endpoint URL
token: Bearer token for authentication (optional)

STDIO Servers (`stdio`)

Configure local command-based MCP servers:

command: The executable command to run
args: Array of command-line arguments
env: Environment variables (optional)

Agent Configuration (`agent`)

Configure agent behavior and tool execution:

max_attempts: Maximum number of tool execution attempts (default: 3)
max_iterations: Maximum number of reasoning iterations (default: 3)
enable_reasoning: Enable tool reasoning capabilities (default: false)
enable_planning: Enable auto-planning capabilities (default: false)
enable_mcp_prompts: Enable MCP prompts (default: false)
enable_plan_re_evaluator: Enable plan re-evaluation (default: false)

Usage

API Endpoint

Use the MCP-enabled completion endpoint:

curl http://localhost:8080/mcp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-agentic-model",
    "messages": [
      {"role": "user", "content": "What is the current weather in New York?"}
    ],
    "temperature": 0.7
  }'

Example Response

{
  "id": "chatcmpl-123",
  "created": 1699123456,
  "model": "my-agentic-model",
  "choices": [
    {
      "text": "The current weather in New York is 72°F (22°C) with partly cloudy skies. The humidity is 65% and there's a light breeze from the west at 8 mph."
    }
  ],
  "object": "text_completion"
}

Example Configurations

Docker-based Tools

name: docker-agent
backend: llama-cpp
parameters:
  model: qwen3-4b.gguf

mcp:
  stdio: |
    {
      "mcpServers": {
        "searxng": {
          "command": "docker",
          "args": [
            "run", "-i", "--rm",
            "quay.io/mudler/tests:duckduckgo-localai"
          ]
        }
      }
    }

agent:
  max_attempts: 5
  max_iterations: 5
  enable_reasoning: true
  enable_planning: true
  enable_mcp_prompts: true
  enable_plan_re_evaluator: true

Agent Configuration Details

The agent section controls how the AI model interacts with MCP tools:

Execution Control

max_attempts: Limits how many times a tool can be retried if it fails. Higher values provide more resilience but may increase response time.
max_iterations: Controls the maximum number of reasoning cycles the agent can perform. More iterations allow for complex multi-step problem solving.

Reasoning Capabilities

enable_reasoning: When enabled, the agent uses advanced reasoning to better understand tool results and plan next steps.

Planning Capabilities

enable_planning: When enabled, the agent uses auto-planning to break down complex tasks into manageable steps and execute them systematically. The agent will automatically detect when planning is needed.
enable_mcp_prompts: When enabled, the agent uses specialized prompts exposed by the MCP servers to interact with the exposed tools.
enable_plan_re_evaluator: When enabled, the agent can re-evaluate and adjust its execution plan based on intermediate results.

Recommended Settings

Simple tasks: max_attempts: 2, max_iterations: 2, enable_reasoning: false, enable_planning: false
Complex tasks: max_attempts: 5, max_iterations: 5, enable_reasoning: true, enable_planning: true, enable_mcp_prompts: true
Advanced planning: max_attempts: 5, max_iterations: 5, enable_reasoning: true, enable_planning: true, enable_mcp_prompts: true, enable_plan_re_evaluator: true
Development/Debugging: max_attempts: 1, max_iterations: 1, enable_reasoning: true, enable_planning: true

How It Works

Tool Discovery: LocalAI connects to configured MCP servers and discovers available tools
Tool Caching: Tools are cached per model for efficient reuse
Agent Execution: The AI model uses the Cogito framework to execute tools
Response Generation: The model generates responses incorporating tool results

Supported MCP Servers

LocalAI is compatible with any MCP-compliant server.

Best Practices

Security

Use environment variables for sensitive tokens
Validate MCP server endpoints before deployment
Implement proper authentication for remote servers

Performance

Cache frequently used tools
Use appropriate timeout values for external APIs
Monitor resource usage for stdio servers

Error Handling

Implement fallback mechanisms for tool failures
Log tool execution for debugging
Handle network timeouts gracefully

With External Applications

Use MCP-enabled models in your applications:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/mcp/v1",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="my-agentic-model",
    messages=[
        {"role": "user", "content": "Analyze the latest research papers on AI"}
    ]
)

MCP and adding packages

It might be handy to install packages before starting the container to setup the environment. This is an example on how you can do that with docker-compose (installing and configuring docker)

services:
  local-ai:
    image: localai/localai:latest
    #image: localai/localai:latest-gpu-nvidia-cuda-12
    container_name: local-ai
    restart: always
    entrypoint: [ "/bin/bash" ]
    command: >
     -c "apt-get update &&
         apt-get install -y docker.io &&
         /entrypoint.sh"
    environment:
      - DEBUG=true
      - LOCALAI_WATCHDOG_IDLE=true
      - LOCALAI_WATCHDOG_BUSY=true
      - LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m
      - LOCALAI_WATCHDOG_BUSY_TIMEOUT=15m
      - LOCALAI_API_KEY=my-beautiful-api-key
      - DOCKER_HOST=tcp://docker:2376
      - DOCKER_TLS_VERIFY=1
      - DOCKER_CERT_PATH=/certs/client
    ports:
      - "8080:8080"
    volumes:
      - /data/models:/models
      - /data/backends:/backends
      - certs:/certs:ro
    # uncomment for nvidia
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - capabilities: [gpu]
    #           device_ids: ['7']
    # runtime: nvidia

  docker:
    image: docker:dind
    privileged: true
    container_name: docker
    volumes:
      - certs:/certs
    healthcheck:
      test: ["CMD", "docker", "info"]
      interval: 10s
      timeout: 5s
volumes:
  certs:

An example model config (to append to any existing model you have) can be:

mcp:
  stdio: |
     {
      "mcpServers": {
        "weather": {
          "command": "docker",
          "args": [
            "run", "-i", "--rm",
            "ghcr.io/mudler/mcps/weather:master"
          ]
        },
        "memory": {
          "command": "docker",
          "env": {
            "MEMORY_FILE_PATH": "/data/memory.json"
          },
          "args": [
            "run", "-i", "--rm", "-v", "/host/data:/data",
            "ghcr.io/mudler/mcps/memory:master"
          ]
        },
        "ddg": {
          "command": "docker",
          "env": {
            "MAX_RESULTS": "10"
          },
          "args": [
            "run", "-i", "--rm", "-e", "MAX_RESULTS",
            "ghcr.io/mudler/mcps/duckduckgo:master"
          ]
        }
      }
     }

Integrations

Community integrations

List of projects that are using directly LocalAI behind the scenes can be found here.

The list below is a list of software that integrates with LocalAI.

AnythingLLM
Logseq GPT3 OpenAI plugin allows to set a base URL, and works with LocalAI.
https://plugins.jetbrains.com/plugin/21056-codegpt allows for custom OpenAI compatible endpoints since 2.4.0
Wave Terminal has native support for LocalAI!
https://github.com/longy2k/obsidian-bmo-chatbot
https://github.com/FlowiseAI/Flowise
https://github.com/k8sgpt-ai/k8sgpt
https://github.com/kairos-io/kairos
https://github.com/langchain4j/langchain4j
https://github.com/henomis/lingoose
https://github.com/trypromptly/LLMStack
https://github.com/mattermost/openops
https://github.com/charmbracelet/mods
https://github.com/cedriking/spark
Big AGI is a powerful web interface entirely running in the browser, supporting LocalAI
Midori AI Subsystem Manager is a powerful docker subsystem for running all types of AI programs
LLPhant is a PHP library for interacting with LLMs and Vector Databases
GPTLocalhost (Word Add-in) - run LocalAI in Microsoft Word locally
use LocalAI from Nextcloud with the integration plugin and AI assistant

Feel free to open up a Pull request (by clicking at the “Edit page” below) to get a page for your project made or if you see a error on one of the pages!

Chapter 20

Advanced

Advanced usage

Model Configuration with YAML Files

LocalAI uses YAML configuration files to define model parameters, templates, and behavior. You can create individual YAML files in the models directory or use a single configuration file with multiple models.

Quick Example:

name: gpt-3.5-turbo
parameters:
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  temperature: 0.3

context_size: 512
threads: 10
backend: llama-stable

template:
  completion: completion
  chat: chat

For a complete reference of all available configuration options, see the Model Configuration page.

Configuration File Locations:

Individual files: Create .yaml files in your models directory (e.g., models/gpt-3.5-turbo.yaml)
Single config file: Use --models-config-file or LOCALAI_MODELS_CONFIG_FILE to specify a file containing multiple models

Remote URLs: Specify a URL to a YAML configuration file at startup:

local-ai run github://mudler/LocalAI/examples/configurations/phi-2.yaml@master

See also chatbot-ui as an example on how to use config files.

Prompt templates

The API doesn’t inject a default prompt for talking to the model. You have to use a prompt similar to what’s described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release.

You can use a default template for every model present in your model path, by creating a corresponding file with the `.tmpl` suffix next to your model. For instance, if the model is called `foo.bin`, you can create a sibling file, `foo.bin.tmpl` which will be used as a default prompt and can be used with alpaca:

The below instruction describes a task. Write a response that appropriately completes the request.

### Instruction:
{{.Input}}

### Response:

See the prompt-templates directory in this repository for templates for some of the most popular models.

For the edit endpoint, an example template for alpaca-based models can be:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{{.Instruction}}

### Input:
{{.Input}}

### Response:

Install models using the API

Instead of installing models manually, you can use the LocalAI API endpoints and a model definition to install programmatically via API models in runtime.

A curated collection of model files is in the model-gallery. The files of the model gallery are different from the model files used to configure LocalAI models. The model gallery files contains information about the model setup, and the files necessary to run the model locally.

To install for example lunademo, you can send a POST call to the /models/apply endpoint with the model definition url (url) and the name of the model should have in LocalAI (name, optional):

curl --location 'http://localhost:8080/models/apply' \
--header 'Content-Type: application/json' \
--data-raw '{
    "id": "TheBloke/Luna-AI-Llama2-Uncensored-GGML/luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin",
    "name": "lunademo"
}'

Preloading models during startup

In order to allow the API to start-up with all the needed model on the first-start, the model gallery files can be used during startup.

PRELOAD_MODELS='[{"url": "https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml","name": "gpt4all-j"}]' local-ai

PRELOAD_MODELS (or --preload-models) takes a list in JSON with the same parameter of the API calls of the /models/apply endpoint.

Similarly it can be specified a path to a YAML configuration file containing a list of models with PRELOAD_MODELS_CONFIG ( or --preload-models-config ):

- url: https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml
  name: gpt4all-j

Automatic prompt caching

LocalAI can automatically cache prompts for faster loading of the prompt. This can be useful if your model need a prompt template with prefixed text in the prompt before the input.

To enable prompt caching, you can control the settings in the model config YAML file:

prompt_cache_path: "cache"
prompt_cache_all: true

prompt_cache_path is relative to the models folder. you can enter here a name for the file that will be automatically create during the first load if prompt_cache_all is set to true.

Configuring a specific backend for the model

By default LocalAI will try to autoload the model by trying all the backends. This might work for most of models, but some of the backends are NOT configured to autoload.

The available backends are listed in the model compatibility table.

In order to specify a backend for your models, create a model config file in your models directory specifying the backend:

name: gpt-3.5-turbo

parameters:
  # Relative to the models path
  model: ...

backend: llama-stable

Connect external backends

LocalAI backends are internally implemented using gRPC services. This also allows LocalAI to connect to external gRPC services on start and extend LocalAI functionalities via third-party binaries.

The --external-grpc-backends parameter in the CLI can be used either to specify a local backend (a file) or a remote URL. The syntax is <BACKEND_NAME>:<BACKEND_URI>. Once LocalAI is started with it, the new backend name will be available for all the API endpoints.

So for instance, to register a new backend which is a local file:

./local-ai --debug --external-grpc-backends "my-awesome-backend:/path/to/my/backend.py"

Or a remote URI:

./local-ai --debug --external-grpc-backends "my-awesome-backend:host:port"

For example, to start vllm manually after compiling LocalAI (also assuming running the command from the root of the repository):

./local-ai --external-grpc-backends "vllm:$PWD/backend/python/vllm/run.sh"

Note that first is is necessary to create the environment with:

make -C backend/python/vllm

Environment variables

When LocalAI runs in a container, there are additional environment variables available that modify the behavior of LocalAI on startup:

Environment variable	Default	Description
`REBUILD`	`false`	Rebuild LocalAI on startup
`BUILD_TYPE`		Build type. Available: `cublas`, `openblas`, `clblas`, `intel` (intel core), `sycl_f16`, `sycl_f32` (intel backends)
`GO_TAGS`		Go tags. Available: `stablediffusion`
`HUGGINGFACEHUB_API_TOKEN`		Special token for interacting with HuggingFace Inference API, required only when using the `langchain-huggingface` backend
`EXTRA_BACKENDS`		A space separated list of backends to prepare. For example `EXTRA_BACKENDS="backend/python/diffusers backend/python/transformers"` prepares the python environment on start
`DISABLE_AUTODETECT`	`false`	Disable autodetect of CPU flagset on start
`LLAMACPP_GRPC_SERVERS`		A list of llama.cpp workers to distribute the workload. For example `LLAMACPP_GRPC_SERVERS="address1:port,address2:port"`

Here is how to configure these variables:

docker run --env REBUILD=true localai
docker run --env-file .env localai

CLI Parameters

For a complete reference of all CLI parameters, environment variables, and command-line options, see the CLI Reference page.

You can control LocalAI with command line arguments to specify a binding address, number of threads, model paths, and many other options. Any command line parameter can be specified via an environment variable.

.env files

Any settings being provided by an Environment Variable can also be provided from within .env files. There are several locations that will be checked for relevant .env files. In order of precedence they are:

.env within the current directory
localai.env within the current directory
localai.env within the home directory
.config/localai.env within the home directory
/etc/localai.env

Environment variables within files earlier in the list will take precedence over environment variables defined in files later in the list.

An example .env file is:

LOCALAI_THREADS=10
LOCALAI_MODELS_PATH=/mnt/storage/localai/models
LOCALAI_F16=true

Request headers

You can use ‘Extra-Usage’ request header key presence (‘Extra-Usage: true’) to receive inference timings in milliseconds extending default OpenAI response model in the usage field:

...
{
  "id": "...",
  "created": ...,
  "model": "...",
  "choices": [
    {
      ...
    },
    ...
  ],
  "object": "...",
  "usage": {
    "prompt_tokens": ...,
    "completion_tokens": ...,
    "total_tokens": ...,
    // Extra-Usage header key will include these two float fields:
    "timing_prompt_processing: ...,
    "timing_token_generation": ...,
  },
}
...

Extra backends

LocalAI can be extended with extra backends. The backends are implemented as gRPC services and can be written in any language. See the backend section for more details on how to install and build new backends for LocalAI.

In runtime

When using the -core container image it is possible to prepare the python backends you are interested into by using the EXTRA_BACKENDS variable, for instance:

docker run --env EXTRA_BACKENDS="backend/python/diffusers" quay.io/go-skynet/local-ai:master

Concurrent requests

LocalAI supports parallel requests for the backends that supports it. For instance, vLLM and llama.cpp supports parallel requests, and thus LocalAI allows to run multiple requests in parallel.

In order to enable parallel requests, you have to pass --parallel-requests or set the PARALLEL_REQUEST to true as environment variable.

A list of the environment variable that tweaks parallelism is the following:

### Python backends GRPC max workers
### Default number of workers for GRPC Python backends.
### This actually controls wether a backend can process multiple requests or not.

### Define the number of parallel LLAMA.cpp workers (Defaults to 1)

### Enable to run parallel requests

Note that, for llama.cpp you need to set accordingly LLAMACPP_PARALLEL to the number of parallel processes your GPU/CPU can handle. For python-based backends (like vLLM) you can set PYTHON_GRPC_MAX_WORKERS to the number of parallel requests.

VRAM and Memory Management

For detailed information on managing VRAM when running multiple models, see the dedicated VRAM and Memory Management page.

Disable CPU flagset auto detection in llama.cpp

LocalAI will automatically discover the CPU flagset available in your host and will use the most optimized version of the backends.

If you want to disable this behavior, you can set DISABLE_AUTODETECT to true in the environment variables.

Fine-tuning LLMs for text generation

Note

Section under construction

This section covers how to fine-tune a language model for text generation and consume it in LocalAI.

Requirements

For this example you will need at least a 12GB VRAM of GPU and a Linux box.

Fine-tuning

Fine-tuning a language model is a process that requires a lot of computational power and time.

Currently LocalAI doesn’t support the fine-tuning endpoint as LocalAI but there are are plans to support that. For the time being a guide is proposed here to give a simple starting point on how to fine-tune a model and use it with LocalAI (but also with llama.cpp).

There is an e2e example of fine-tuning a LLM model to use with LocalAI written by @mudler available here.

The steps involved are:

Preparing a dataset
Prepare the environment and install dependencies
Fine-tune the model
Merge the Lora base with the model
Convert the model to gguf
Use the model with LocalAI

Dataset preparation

We are going to need a dataset or a set of datasets.

Axolotl supports a variety of formats, in the notebook and in this example we are aiming for a very simple dataset and build that manually, so we are going to use the completion format which requires the full text to be used for fine-tuning.

A dataset for an instructor model (like Alpaca) can look like the following:

[
 {
    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
 },
 {
    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
 }
]

Every block in the text is the whole text that is used to fine-tune. For example, for an instructor model it follows the following format (more or less):

<System prompt>

## Instruction

<Question, instruction>

## Response

<Expected response from the LLM>

The instruction format works such as when we are going to inference with the model, we are going to feed it only the first part up to the ## Instruction block, and the model is going to complete the text with the ## Response block.

Prepare a dataset, and upload it to your Google Drive in case you are using the Google colab. Otherwise place it next the axolotl.yaml file as dataset.json.

Install dependencies

git clone https://github.com/OpenAccess-AI-Collective/axolotl && pushd axolotl && git checkout 797f3dd1de8fd8c0eafbd1c9fdb172abd9ff840a && popd #0.3.0
pip install packaging
pushd axolotl && pip install -e '.[flash-attn,deepspeed]' && popd

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.0/flash_attn-2.3.0+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Configure accelerate:

accelerate config default

Fine-tuning

We will need to configure axolotl. In this example is provided a file to use axolotl.yaml that uses openllama-3b for fine-tuning. Copy the axolotl.yaml file and edit it to your needs. The dataset needs to be next to it as dataset.json. You can find the axolotl.yaml file here.

If you have a big dataset, you can pre-tokenize it to speedup the fine-tuning process:

python -m axolotl.cli.preprocess axolotl.yaml

Now we are ready to start the fine-tuning process:

accelerate launch -m axolotl.cli.train axolotl.yaml

After we have finished the fine-tuning, we merge the Lora base with the model:

python3 -m axolotl.cli.merge_lora axolotl.yaml --lora_model_dir="./qlora-out" --load_in_8bit=False --load_in_4bit=False

And we convert it to the gguf format that LocalAI can consume:

git clone https://github.com/ggerganov/llama.cpp.git
pushd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release && popd

pushd llama.cpp && python3 convert_hf_to_gguf.py ../qlora-out/merged && popd

pushd llama.cpp/build/bin &&  ./llama-quantize ../../../qlora-out/merged/Merged-33B-F16.gguf \
    ../../../custom-model-q4_0.gguf q4_0

Now you should have ended up with a custom-model-q4_0.gguf file that you can copy in the LocalAI models directory and use it with LocalAI.

VRAM and Memory Management

When running multiple models in LocalAI, especially on systems with limited GPU memory (VRAM), you may encounter situations where loading a new model fails because there isn’t enough available VRAM. LocalAI provides two mechanisms to automatically manage model memory allocation and prevent VRAM exhaustion.

The Problem

By default, LocalAI keeps models loaded in memory once they’re first used. This means:

If you load a large model that uses most of your VRAM, subsequent requests for other models may fail
Models remain in memory even when not actively being used
There’s no automatic mechanism to unload models to make room for new ones, unless done manually via the web interface

This is a common issue when working with GPU-accelerated models, as VRAM is typically more limited than system RAM. For more context, see issues #6068, #7269, and #5352.

Solution 1: Single Active Backend

The simplest approach is to ensure only one model is loaded at a time. When a new model is requested, LocalAI will automatically unload the currently active model before loading the new one.

Configuration

./local-ai --single-active-backend

LOCALAI_SINGLE_ACTIVE_BACKEND=true ./local-ai

Use cases

Single GPU systems with limited VRAM
When you only need one model active at a time
Simple deployments where model switching is acceptable

Example

LOCALAI_SINGLE_ACTIVE_BACKEND=true ./local-ai

curl http://localhost:8080/v1/chat/completions -d '{"model": "model-a", ...}'

curl http://localhost:8080/v1/chat/completions -d '{"model": "model-b", ...}'

Solution 2: Watchdog Mechanisms

For more flexible memory management, LocalAI provides watchdog mechanisms that automatically unload models based on their activity state. This allows multiple models to be loaded simultaneously, but automatically frees memory when models become inactive or stuck.

Idle Watchdog

The idle watchdog monitors models that haven’t been used for a specified period and automatically unloads them to free VRAM.

Configuration

LOCALAI_WATCHDOG_IDLE=true ./local-ai

LOCALAI_WATCHDOG_IDLE=true LOCALAI_WATCHDOG_IDLE_TIMEOUT=10m ./local-ai

./local-ai --enable-watchdog-idle --watchdog-idle-timeout=10m

Busy Watchdog

The busy watchdog monitors models that have been processing requests for an unusually long time and terminates them if they exceed a threshold. This is useful for detecting and recovering from stuck or hung backends.

Configuration

LOCALAI_WATCHDOG_BUSY=true ./local-ai

LOCALAI_WATCHDOG_BUSY=true LOCALAI_WATCHDOG_BUSY_TIMEOUT=10m ./local-ai

./local-ai --enable-watchdog-busy --watchdog-busy-timeout=10m

Combined Configuration

You can enable both watchdogs simultaneously for comprehensive memory management:

LOCALAI_WATCHDOG_IDLE=true \
LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m \
LOCALAI_WATCHDOG_BUSY=true \
LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m \
./local-ai

Or using command line flags:

./local-ai \
  --enable-watchdog-idle --watchdog-idle-timeout=15m \
  --enable-watchdog-busy --watchdog-busy-timeout=5m

Use cases

Multi-model deployments where different models may be used intermittently
Systems where you want to keep frequently-used models loaded but free memory from unused ones
Recovery from stuck or hung backend processes
Production environments requiring automatic resource management

Example

LOCALAI_WATCHDOG_IDLE=true \
LOCALAI_WATCHDOG_IDLE_TIMEOUT=10m \
LOCALAI_WATCHDOG_BUSY=true \
LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m \
./local-ai

curl http://localhost:8080/v1/chat/completions -d '{"model": "model-a", ...}'
curl http://localhost:8080/v1/chat/completions -d '{"model": "model-b", ...}'

Timeout Format

Timeouts can be specified using Go’s duration format:

15m - 15 minutes
1h - 1 hour
30s - 30 seconds
2h30m - 2 hours and 30 minutes

Limitations and Considerations

VRAM Usage Estimation

LocalAI cannot reliably estimate VRAM usage of new models to load across different backends (llama.cpp, vLLM, diffusers, etc.) because:

Different backends report memory usage differently
VRAM requirements vary based on model architecture, quantization, and configuration
Some backends may not expose memory usage information before loading the model

Manual Management

If automatic management doesn’t meet your needs, you can manually stop models using the LocalAI management API:

curl -X POST http://localhost:8080/backend/shutdown \
  -H "Content-Type: application/json" \
  -d '{"model": "model-name"}'

To stop all models, you’ll need to call the endpoint for each loaded model individually, or use the web UI to stop all models at once.

Best Practices

Monitor VRAM usage: Use nvidia-smi (for NVIDIA GPUs) or similar tools to monitor actual VRAM usage
Start with single active backend: For single-GPU systems, --single-active-backend is often the simplest solution
Tune watchdog timeouts: Adjust timeouts based on your usage patterns - shorter timeouts free memory faster but may cause more frequent reloads
Consider model size: Ensure your VRAM can accommodate at least one of your largest models
Use quantization: Smaller quantized models use less VRAM and allow more flexibility

See Advanced Usage for other configuration options
See GPU Acceleration for GPU setup and configuration
See Backend Flags for all available backend configuration options

Model Configuration

LocalAI uses YAML configuration files to define model parameters, templates, and behavior. This page provides a complete reference for all available configuration options.

Overview

Model configuration files allow you to:

Define default parameters (temperature, top_p, etc.)
Configure prompt templates
Specify backend settings
Set up function calling
Configure GPU and memory options
And much more

Configuration File Locations

You can create model configuration files in several ways:

Individual YAML files in the models directory (e.g., models/gpt-3.5-turbo.yaml)
Single config file with multiple models using --models-config-file or LOCALAI_MODELS_CONFIG_FILE
Remote URLs - specify a URL to a YAML configuration file at startup

Example: Basic Configuration

name: gpt-3.5-turbo
parameters:
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  temperature: 0.3

context_size: 512
threads: 10
backend: llama-stable

template:
  completion: completion
  chat: chat

Example: Multiple Models in One File

When using --models-config-file, you can define multiple models as a list:

- name: model1
  parameters:
    model: model1.bin
  context_size: 512
  backend: llama-stable

- name: model2
  parameters:
    model: model2.bin
  context_size: 1024
  backend: llama-stable

Core Configuration Fields

Basic Model Settings

Field	Type	Description	Example
`name`	string	Model name, used to identify the model in API calls	`gpt-3.5-turbo`
`backend`	string	Backend to use (e.g. `llama-cpp`, `vllm`, `diffusers`, `whisper`)	`llama-cpp`
`description`	string	Human-readable description of the model	`A conversational AI model`
`usage`	string	Usage instructions or notes	`Best for general conversation`

Model File and Downloads

Field	Type	Description
`parameters.model`	string	Path to the model file (relative to models directory) or URL
`download_files`	array	List of files to download. Each entry has `filename`, `uri`, and optional `sha256`

Example:

parameters:
  model: my-model.gguf

download_files:
  - filename: my-model.gguf
    uri: https://example.com/model.gguf
    sha256: abc123...

Parameters Section

The parameters section contains all OpenAI-compatible request parameters and model-specific options.

OpenAI-Compatible Parameters

These settings will be used as defaults for all the API calls to the model.

Field	Type	Default	Description
`temperature`	float	`0.9`	Sampling temperature (0.0-2.0). Higher values make output more random
`top_p`	float	`0.95`	Nucleus sampling: consider tokens with top_p probability mass
`top_k`	int	`40`	Consider only the top K most likely tokens
`max_tokens`	int	`0`	Maximum number of tokens to generate (0 = unlimited)
`frequency_penalty`	float	`0.0`	Penalty for token frequency (-2.0 to 2.0)
`presence_penalty`	float	`0.0`	Penalty for token presence (-2.0 to 2.0)
`repeat_penalty`	float	`1.1`	Penalty for repeating tokens
`repeat_last_n`	int	`64`	Number of previous tokens to consider for repeat penalty
`seed`	int	`-1`	Random seed (omit for random)
`echo`	bool	`false`	Echo back the prompt in the response
`n`	int	`1`	Number of completions to generate
`logprobs`	bool/int	`false`	Return log probabilities of tokens
`top_logprobs`	int	`0`	Number of top logprobs to return per token (0-20)
`logit_bias`	map	`{}`	Map of token IDs to bias values (-100 to 100)
`typical_p`	float	`1.0`	Typical sampling parameter
`tfz`	float	`1.0`	Tail free z parameter
`keep`	int	`0`	Number of tokens to keep from the prompt

Language and Translation

Field	Type	Description
`language`	string	Language code for transcription/translation
`translate`	bool	Whether to translate audio transcription

Custom Parameters

Field	Type	Description
`batch`	int	Batch size for processing
`ignore_eos`	bool	Ignore end-of-sequence tokens
`negative_prompt`	string	Negative prompt for image generation
`rope_freq_base`	float32	RoPE frequency base
`rope_freq_scale`	float32	RoPE frequency scale
`negative_prompt_scale`	float32	Scale for negative prompt
`tokenizer`	string	Tokenizer to use (RWKV)

LLM Configuration

These settings apply to most LLM backends (llama.cpp, vLLM, etc.):

Performance Settings

Field	Type	Default	Description
`threads`	int	`processor count`	Number of threads for parallel computation
`context_size`	int	`512`	Maximum context size (number of tokens)
`f16`	bool	`false`	Enable 16-bit floating point precision (GPU acceleration)
`gpu_layers`	int	`0`	Number of layers to offload to GPU (0 = CPU only)

Memory Management

Field	Type	Default	Description
`mmap`	bool	`true`	Use memory mapping for model loading (faster, less RAM)
`mmlock`	bool	`false`	Lock model in memory (prevents swapping)
`low_vram`	bool	`false`	Use minimal VRAM mode
`no_kv_offloading`	bool	`false`	Disable KV cache offloading

GPU Configuration

Field	Type	Description
`tensor_split`	string	Comma-separated GPU memory allocation (e.g., `"0.8,0.2"` for 80%/20%)
`main_gpu`	string	Main GPU identifier for multi-GPU setups
`cuda`	bool	Explicitly enable/disable CUDA

Sampling and Generation

Field	Type	Default	Description
`mirostat`	int	`0`	Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0)
`mirostat_tau`	float	`5.0`	Mirostat target entropy
`mirostat_eta`	float	`0.1`	Mirostat learning rate

LoRA Configuration

Field	Type	Description
`lora_adapter`	string	Path to LoRA adapter file
`lora_base`	string	Base model for LoRA
`lora_scale`	float32	LoRA scale factor
`lora_adapters`	array	Multiple LoRA adapters
`lora_scales`	array	Scales for multiple LoRA adapters

Advanced Options

Field	Type	Description
`no_mulmatq`	bool	Disable matrix multiplication queuing
`draft_model`	string	Draft model for speculative decoding
`n_draft`	int32	Number of draft tokens
`quantization`	string	Quantization format
`load_format`	string	Model load format
`numa`	bool	Enable NUMA (Non-Uniform Memory Access)
`rms_norm_eps`	float32	RMS normalization epsilon
`ngqa`	int32	Natural question generation parameter
`rope_scaling`	string	RoPE scaling configuration
`type`	string	Model type/architecture
`grammar`	string	Grammar file path for constrained generation

YARN Configuration

YARN (Yet Another RoPE extensioN) settings for context extension:

Field	Type	Description
`yarn_ext_factor`	float32	YARN extension factor
`yarn_attn_factor`	float32	YARN attention factor
`yarn_beta_fast`	float32	YARN beta fast parameter
`yarn_beta_slow`	float32	YARN beta slow parameter

Prompt Caching

Field	Type	Description
`prompt_cache_path`	string	Path to store prompt cache (relative to models directory)
`prompt_cache_all`	bool	Cache all prompts automatically
`prompt_cache_ro`	bool	Read-only prompt cache

Text Processing

Field	Type	Description
`stopwords`	array	Words or phrases that stop generation
`cutstrings`	array	Strings to cut from responses
`trimspace`	array	Strings to trim whitespace from
`trimsuffix`	array	Suffixes to trim from responses
`extract_regex`	array	Regular expressions to extract content

System Prompt

Field	Type	Description
`system_prompt`	string	Default system prompt for the model

vLLM-Specific Configuration

These options apply when using the vllm backend:

Field	Type	Description
`gpu_memory_utilization`	float32	GPU memory utilization (0.0-1.0, default 0.9)
`trust_remote_code`	bool	Trust and execute remote code
`enforce_eager`	bool	Force eager execution mode
`swap_space`	int	Swap space in GB
`max_model_len`	int	Maximum model length
`tensor_parallel_size`	int	Tensor parallelism size
`disable_log_stats`	bool	Disable logging statistics
`dtype`	string	Data type (e.g., `float16`, `bfloat16`)
`flash_attention`	string	Flash attention configuration
`cache_type_k`	string	Key cache type
`cache_type_v`	string	Value cache type
`limit_mm_per_prompt`	object	Limit multimodal content per prompt: `{image: int, video: int, audio: int}`

Template Configuration

Templates use Go templates with Sprig functions.

Field	Type	Description
`template.chat`	string	Template for chat completion endpoint
`template.chat_message`	string	Template for individual chat messages
`template.completion`	string	Template for text completion
`template.edit`	string	Template for edit operations
`template.function`	string	Template for function/tool calls
`template.multimodal`	string	Template for multimodal interactions
`template.reply_prefix`	string	Prefix to add to model replies
`template.use_tokenizer_template`	bool	Use tokenizer’s built-in template (vLLM/transformers)
`template.join_chat_messages_by_character`	string	Character to join chat messages (default: `\n`)

Template Variables

Templating supports sprig functions.

Following are common variables available in templates:

{{.Input}} - User input
{{.Instruction}} - Instruction for edit operations
{{.System}} - System message
{{.Prompt}} - Full prompt
{{.Functions}} - Function definitions (for function calling)
{{.FunctionCall}} - Function call result

Example Template

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}{{end}}
    {{if eq .Role "assistant"}}Assistant: {{.Content}}{{end}}
    {{end}}
    Assistant:

Function Calling Configuration

Configure how the model handles function/tool calls:

Field	Type	Default	Description
`function.disable_no_action`	bool	`false`	Disable the no-action behavior
`function.no_action_function_name`	string	`answer`	Name of the no-action function
`function.no_action_description_name`	string		Description for no-action function
`function.function_name_key`	string	`name`	JSON key for function name
`function.function_arguments_key`	string	`arguments`	JSON key for function arguments
`function.response_regex`	array		Named regex patterns to extract function calls
`function.argument_regex`	array		Named regex to extract function arguments
`function.argument_regex_key_name`	string	`key`	Named regex capture for argument key
`function.argument_regex_value_name`	string	`value`	Named regex capture for argument value
`function.json_regex_match`	array		Regex patterns to match JSON in tool mode
`function.replace_function_results`	array		Replace function call results with patterns
`function.replace_llm_results`	array		Replace LLM results with patterns
`function.capture_llm_results`	array		Capture LLM results as text (e.g., for “thinking” blocks)

Grammar Configuration

Field	Type	Default	Description
`function.grammar.disable`	bool	`false`	Completely disable grammar enforcement
`function.grammar.parallel_calls`	bool	`false`	Allow parallel function calls
`function.grammar.mixed_mode`	bool	`false`	Allow mixed-mode grammar enforcing
`function.grammar.no_mixed_free_string`	bool	`false`	Disallow free strings in mixed mode
`function.grammar.disable_parallel_new_lines`	bool	`false`	Disable parallel processing for new lines
`function.grammar.prefix`	string		Prefix to add before grammar rules
`function.grammar.expect_strings_after_json`	bool	`false`	Expect strings after JSON data

Diffusers Configuration

For image generation models using the diffusers backend:

Field	Type	Description
`diffusers.cuda`	bool	Enable CUDA for diffusers
`diffusers.pipeline_type`	string	Pipeline type (e.g., `stable-diffusion`, `stable-diffusion-xl`)
`diffusers.scheduler_type`	string	Scheduler type (e.g., `euler`, `ddpm`)
`diffusers.enable_parameters`	string	Comma-separated parameters to enable
`diffusers.cfg_scale`	float32	Classifier-free guidance scale
`diffusers.img2img`	bool	Enable image-to-image transformation
`diffusers.clip_skip`	int	Number of CLIP layers to skip
`diffusers.clip_model`	string	CLIP model to use
`diffusers.clip_subfolder`	string	CLIP model subfolder
`diffusers.control_net`	string	ControlNet model to use
`step`	int	Number of diffusion steps

TTS Configuration

For text-to-speech models:

Field	Type	Description
`tts.voice`	string	Voice file path or voice ID
`tts.audio_path`	string	Path to audio files (for Vall-E)

Roles Configuration

Map conversation roles to specific strings:

roles:
  user: "### Instruction:"
  assistant: "### Response:"
  system: "### System Instruction:"

Feature Flags

Enable or disable experimental features:

feature_flags:
  feature_name: true
  another_feature: false

MCP Configuration

Model Context Protocol (MCP) configuration:

Field	Type	Description
`mcp.remote`	string	YAML string defining remote MCP servers
`mcp.stdio`	string	YAML string defining STDIO MCP servers

Agent Configuration

Agent/autonomous agent configuration:

Field	Type	Description
`agent.max_attempts`	int	Maximum number of attempts
`agent.max_iterations`	int	Maximum number of iterations
`agent.enable_reasoning`	bool	Enable reasoning capabilities
`agent.enable_planning`	bool	Enable planning capabilities
`agent.enable_mcp_prompts`	bool	Enable MCP prompts
`agent.enable_plan_re_evaluator`	bool	Enable plan re-evaluation

Pipeline Configuration

Define pipelines for audio-to-audio processing:

Field	Type	Description
`pipeline.tts`	string	TTS model name
`pipeline.llm`	string	LLM model name
`pipeline.transcription`	string	Transcription model name
`pipeline.vad`	string	Voice activity detection model name

gRPC Configuration

Backend gRPC communication settings:

Field	Type	Description
`grpc.attempts`	int	Number of retry attempts
`grpc.attempts_sleep_time`	int	Sleep time between retries (seconds)

Overrides

Override model configuration values at runtime (llama.cpp):

overrides:
  - "qwen3moe.expert_used_count=int:10"
  - "some_key=string:value"

Format: KEY=TYPE:VALUE where TYPE is int, float, string, or bool.

Known Use Cases

Specify which endpoints this model supports:

known_usecases:
  - chat
  - completion
  - embeddings

Available flags: chat, completion, edit, embeddings, rerank, image, transcript, tts, sound_generation, tokenize, vad, video, detection, llm (combination of CHAT, COMPLETION, EDIT).

Complete Example

Here’s a comprehensive example combining many options:

name: my-llm-model
description: A high-performance LLM model
backend: llama-stable

parameters:
  model: my-model.gguf
  temperature: 0.7
  top_p: 0.9
  top_k: 40
  max_tokens: 2048

context_size: 4096
threads: 8
f16: true
gpu_layers: 35

system_prompt: "You are a helpful AI assistant."

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}
    {{else if eq .Role "assistant"}}Assistant: {{.Content}}
    {{end}}
    {{end}}
    Assistant:

roles:
  user: "User:"
  assistant: "Assistant:"
  system: "System:"

stopwords:
  - "\n\nUser:"
  - "\n\nHuman:"

prompt_cache_path: "cache/my-model"
prompt_cache_all: true

function:
  grammar:
    parallel_calls: true
    mixed_mode: false

feature_flags:
  experimental_feature: true

See Advanced Usage for other configuration options
See Prompt Templates for template examples
See CLI Reference for command-line options

Chapter 23

References

Model compatibility table

Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the backends, compatible models families and the associated repository.

Note

LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See the advanced section for more details.

Text Generation & Language Models

Backend and Bindings	Compatible models	Completion/Chat endpoint	Capability	Embeddings support	Token stream support	Acceleration
llama.cpp	LLama, Mamba, RWKV, Falcon, Starcoder, GPT-2, and many others	yes	GPT and Functions	yes	yes	CUDA 11/12, ROCm, Intel SYCL, Vulkan, Metal, CPU
vLLM	Various GPTs and quantization formats	yes	GPT	no	no	CUDA 12, ROCm, Intel
transformers	Various GPTs and quantization formats	yes	GPT, embeddings, Audio generation	yes	yes*	CUDA 11/12, ROCm, Intel, CPU
exllama2	GPTQ	yes	GPT only	no	no	CUDA 12
MLX	Various LLMs	yes	GPT	no	no	Metal (Apple Silicon)
MLX-VLM	Vision-Language Models	yes	Multimodal GPT	no	no	Metal (Apple Silicon)
langchain-huggingface	Any text generators available on HuggingFace through API	yes	GPT	no	no	N/A

Audio & Speech Processing

Backend and Bindings	Compatible models	Completion/Chat endpoint	Capability	Embeddings support	Token stream support	Acceleration
whisper.cpp	whisper	no	Audio transcription	no	no	CUDA 12, ROCm, Intel SYCL, Vulkan, CPU
faster-whisper	whisper	no	Audio transcription	no	no	CUDA 12, ROCm, Intel, CPU
piper (binding)	Any piper onnx model	no	Text to voice	no	no	CPU
bark	bark	no	Audio generation	no	no	CUDA 12, ROCm, Intel
bark-cpp	bark	no	Audio-Only	no	no	CUDA, Metal, CPU
coqui	Coqui TTS	no	Audio generation and Voice cloning	no	no	CUDA 12, ROCm, Intel, CPU
kokoro	Kokoro TTS	no	Text-to-speech	no	no	CUDA 12, ROCm, Intel, CPU
chatterbox	Chatterbox TTS	no	Text-to-speech	no	no	CUDA 11/12, CPU
kitten-tts	Kitten TTS	no	Text-to-speech	no	no	CPU
silero-vad with Golang bindings	Silero VAD	no	Voice Activity Detection	no	no	CPU
neutts	NeuTTSAir	no	Text-to-speech with voice cloning	no	no	CUDA 12, ROCm, CPU
mlx-audio	MLX	no	Text-tospeech	no	no	Metal (Apple Silicon)

Image & Video Generation

Backend and Bindings	Compatible models	Completion/Chat endpoint	Capability	Embeddings support	Token stream support	Acceleration
stablediffusion.cpp	stablediffusion-1, stablediffusion-2, stablediffusion-3, flux, PhotoMaker	no	Image	no	no	CUDA 12, Intel SYCL, Vulkan, CPU
diffusers	SD, various diffusion models,…	no	Image/Video generation	no	no	CUDA 11/12, ROCm, Intel, Metal, CPU
transformers-musicgen	MusicGen	no	Audio generation	no	no	CUDA, CPU

Specialized AI Tasks

Backend and Bindings	Compatible models	Completion/Chat endpoint	Capability	Embeddings support	Token stream support	Acceleration
rfdetr	RF-DETR	no	Object Detection	no	no	CUDA 12, Intel, CPU
rerankers	Reranking API	no	Reranking	no	no	CUDA 11/12, ROCm, Intel, CPU
local-store	Vector database	no	Vector storage	yes	no	CPU
huggingface	HuggingFace API models	yes	Various AI tasks	yes	yes	API-based

Acceleration Support Summary

GPU Acceleration

NVIDIA CUDA: CUDA 11.7, CUDA 12.0 support across most backends
AMD ROCm: HIP-based acceleration for AMD GPUs
Intel oneAPI: SYCL-based acceleration for Intel GPUs (F16/F32 precision)
Vulkan: Cross-platform GPU acceleration
Metal: Apple Silicon GPU acceleration (M1/M2/M3+)

Specialized Hardware

NVIDIA Jetson (L4T): ARM64 support for embedded AI
Apple Silicon: Native Metal acceleration for Mac M1/M2/M3+
Darwin x86: Intel Mac support

CPU Optimization

AVX/AVX2/AVX512: Advanced vector extensions for x86
Quantization: 4-bit, 5-bit, 8-bit integer quantization support
Mixed Precision: F16/F32 mixed precision support

Note: any backend name listed above can be used in the backend field of the model configuration file (See the advanced section).

* Only for CUDA and OpenVINO CPU/XPU acceleration.

Architecture

LocalAI is an API written in Go that serves as an OpenAI shim, enabling software already developed with OpenAI SDKs to seamlessly integrate with LocalAI. It can be effortlessly implemented as a substitute, even on consumer-grade hardware. This capability is achieved by employing various C++ backends, including ggml, to perform inference on LLMs using both CPU and, if desired, GPU. Internally LocalAI backends are just gRPC server, indeed you can specify and build your own gRPC server and extend LocalAI in runtime as well. It is possible to specify external gRPC server and/or binaries that LocalAI will manage internally.

LocalAI uses a mixture of backends written in various languages (C++, Golang, Python, …). You can check the model compatibility table to learn about all the components of LocalAI.

Backstory

As much as typical open source projects starts, I, mudler, was fiddling around with llama.cpp over my long nights and wanted to have a way to call it from go, as I am a Golang developer and use it extensively. So I’ve created LocalAI (or what was initially known as llama-cli) and added an API to it.

But guess what? The more I dived into this rabbit hole, the more I realized that I had stumbled upon something big. With all the fantastic C++ projects floating around the community, it dawned on me that I could piece them together to create a full-fledged OpenAI replacement. So, ta-da! LocalAI was born, and it quickly overshadowed its humble origins.

Now, why did I choose to go with C++ bindings, you ask? Well, I wanted to keep LocalAI snappy and lightweight, allowing it to run like a champ on any system and avoid any Golang penalties of the GC, and, most importantly built on shoulders of giants like llama.cpp. Go is good at backends and API and is easy to maintain. And hey, don’t forget that I’m all about sharing the love. That’s why I made LocalAI MIT licensed, so everyone can hop on board and benefit from it.

As if that wasn’t exciting enough, as the project gained traction, mkellerman and Aisuko jumped in to lend a hand. mkellerman helped set up some killer examples, while Aisuko is becoming our community maestro. The community now is growing even more with new contributors and users, and I couldn’t be happier about it!

Oh, and let’s not forget the real MVP here—llama.cpp. Without this extraordinary piece of software, LocalAI wouldn’t even exist. So, a big shoutout to the community for making this magic happen!

CLI Reference

Complete reference for all LocalAI command-line interface (CLI) parameters and environment variables.

Note: All CLI flags can also be set via environment variables. Environment variables take precedence over CLI flags. See .env files for configuration file support.

Global Flags

Parameter	Default	Description	Environment Variable
`-h, --help`		Show context-sensitive help
`--log-level`	`info`	Set the level of logs to output [error,warn,info,debug,trace]	`$LOCALAI_LOG_LEVEL`
`--debug`	`false`	DEPRECATED - Use `--log-level=debug` instead. Enable debug logging	`$LOCALAI_DEBUG`, `$DEBUG`

Storage Flags

Parameter	Default	Description	Environment Variable
`--models-path`	`BASEPATH/models`	Path containing models used for inferencing	`$LOCALAI_MODELS_PATH`, `$MODELS_PATH`
`--generated-content-path`	`/tmp/generated/content`	Location for assets generated by backends (e.g. stablediffusion, images, audio, videos)	`$LOCALAI_GENERATED_CONTENT_PATH`, `$GENERATED_CONTENT_PATH`
`--upload-path`	`/tmp/localai/upload`	Path to store uploads from files API	`$LOCALAI_UPLOAD_PATH`, `$UPLOAD_PATH`
`--localai-config-dir`	`BASEPATH/configuration`	Directory for dynamic loading of certain configuration files (currently api_keys.json and external_backends.json)	`$LOCALAI_CONFIG_DIR`
`--localai-config-dir-poll-interval`		Time duration to poll the LocalAI Config Dir if your system has broken fsnotify events (example: `1m`)	`$LOCALAI_CONFIG_DIR_POLL_INTERVAL`
`--models-config-file`		YAML file containing a list of model backend configs (alias: `--config-file`)	`$LOCALAI_MODELS_CONFIG_FILE`, `$CONFIG_FILE`

Backend Flags

Parameter	Default	Description	Environment Variable
`--backends-path`	`BASEPATH/backends`	Path containing backends used for inferencing	`$LOCALAI_BACKENDS_PATH`, `$BACKENDS_PATH`
`--backends-system-path`	`/usr/share/localai/backends`	Path containing system backends used for inferencing	`$LOCALAI_BACKENDS_SYSTEM_PATH`, `$BACKEND_SYSTEM_PATH`
`--external-backends`		A list of external backends to load from gallery on boot	`$LOCALAI_EXTERNAL_BACKENDS`, `$EXTERNAL_BACKENDS`
`--external-grpc-backends`		A list of external gRPC backends (format: `BACKEND_NAME:URI`)	`$LOCALAI_EXTERNAL_GRPC_BACKENDS`, `$EXTERNAL_GRPC_BACKENDS`
`--backend-galleries`		JSON list of backend galleries	`$LOCALAI_BACKEND_GALLERIES`, `$BACKEND_GALLERIES`
`--autoload-backend-galleries`	`true`	Automatically load backend galleries on startup	`$LOCALAI_AUTOLOAD_BACKEND_GALLERIES`, `$AUTOLOAD_BACKEND_GALLERIES`
`--parallel-requests`	`false`	Enable backends to handle multiple requests in parallel if they support it (e.g.: llama.cpp or vllm)	`$LOCALAI_PARALLEL_REQUESTS`, `$PARALLEL_REQUESTS`
`--single-active-backend`	`false`	Allow only one backend to be run at a time	`$LOCALAI_SINGLE_ACTIVE_BACKEND`, `$SINGLE_ACTIVE_BACKEND`
`--preload-backend-only`	`false`	Do not launch the API services, only the preloaded models/backends are started (useful for multi-node setups)	`$LOCALAI_PRELOAD_BACKEND_ONLY`, `$PRELOAD_BACKEND_ONLY`
`--enable-watchdog-idle`	`false`	Enable watchdog for stopping backends that are idle longer than the watchdog-idle-timeout	`$LOCALAI_WATCHDOG_IDLE`, `$WATCHDOG_IDLE`
`--watchdog-idle-timeout`	`15m`	Threshold beyond which an idle backend should be stopped	`$LOCALAI_WATCHDOG_IDLE_TIMEOUT`, `$WATCHDOG_IDLE_TIMEOUT`
`--enable-watchdog-busy`	`false`	Enable watchdog for stopping backends that are busy longer than the watchdog-busy-timeout	`$LOCALAI_WATCHDOG_BUSY`, `$WATCHDOG_BUSY`
`--watchdog-busy-timeout`	`5m`	Threshold beyond which a busy backend should be stopped	`$LOCALAI_WATCHDOG_BUSY_TIMEOUT`, `$WATCHDOG_BUSY_TIMEOUT`

For more information on VRAM management, see VRAM and Memory Management.

Models Flags

Parameter	Default	Description	Environment Variable
`--galleries`		JSON list of galleries	`$LOCALAI_GALLERIES`, `$GALLERIES`
`--autoload-galleries`	`true`	Automatically load galleries on startup	`$LOCALAI_AUTOLOAD_GALLERIES`, `$AUTOLOAD_GALLERIES`
`--preload-models`		A list of models to apply in JSON at start	`$LOCALAI_PRELOAD_MODELS`, `$PRELOAD_MODELS`
`--models`		A list of model configuration URLs to load	`$LOCALAI_MODELS`, `$MODELS`
`--preload-models-config`		A list of models to apply at startup. Path to a YAML config file	`$LOCALAI_PRELOAD_MODELS_CONFIG`, `$PRELOAD_MODELS_CONFIG`
`--load-to-memory`		A list of models to load into memory at startup	`$LOCALAI_LOAD_TO_MEMORY`, `$LOAD_TO_MEMORY`

Note: You can also pass model configuration URLs as positional arguments: local-ai run MODEL_URL1 MODEL_URL2 ...

Performance Flags

Parameter	Default	Description	Environment Variable
`--f16`	`false`	Enable GPU acceleration	`$LOCALAI_F16`, `$F16`
`-t, --threads`		Number of threads used for parallel computation. Usage of the number of physical cores in the system is suggested	`$LOCALAI_THREADS`, `$THREADS`
`--context-size`		Default context size for models	`$LOCALAI_CONTEXT_SIZE`, `$CONTEXT_SIZE`

API Flags

Parameter	Default	Description	Environment Variable
`--address`	`:8080`	Bind address for the API server	`$LOCALAI_ADDRESS`, `$ADDRESS`
`--cors`	`false`	Enable CORS (Cross-Origin Resource Sharing)	`$LOCALAI_CORS`, `$CORS`
`--cors-allow-origins`		Comma-separated list of allowed CORS origins	`$LOCALAI_CORS_ALLOW_ORIGINS`, `$CORS_ALLOW_ORIGINS`
`--csrf`	`false`	Enable Fiber CSRF middleware	`$LOCALAI_CSRF`
`--upload-limit`	`15`	Default upload-limit in MB	`$LOCALAI_UPLOAD_LIMIT`, `$UPLOAD_LIMIT`
`--api-keys`		List of API Keys to enable API authentication. When this is set, all requests must be authenticated with one of these API keys	`$LOCALAI_API_KEY`, `$API_KEY`
`--disable-webui`	`false`	Disables the web user interface. When set to true, the server will only expose API endpoints without serving the web interface	`$LOCALAI_DISABLE_WEBUI`, `$DISABLE_WEBUI`
`--disable-gallery-endpoint`	`false`	Disable the gallery endpoints	`$LOCALAI_DISABLE_GALLERY_ENDPOINT`, `$DISABLE_GALLERY_ENDPOINT`
`--disable-metrics-endpoint`	`false`	Disable the `/metrics` endpoint	`$LOCALAI_DISABLE_METRICS_ENDPOINT`, `$DISABLE_METRICS_ENDPOINT`
`--machine-tag`		If not empty, add that string to Machine-Tag header in each response. Useful to track response from different machines using multiple P2P federated nodes	`$LOCALAI_MACHINE_TAG`, `$MACHINE_TAG`

Hardening Flags

Parameter	Default	Description	Environment Variable
`--disable-predownload-scan`	`false`	If true, disables the best-effort security scanner before downloading any files	`$LOCALAI_DISABLE_PREDOWNLOAD_SCAN`
`--opaque-errors`	`false`	If true, all error responses are replaced with blank 500 errors. This is intended only for hardening against information leaks and is normally not recommended	`$LOCALAI_OPAQUE_ERRORS`
`--use-subtle-key-comparison`	`false`	If true, API Key validation comparisons will be performed using constant-time comparisons rather than simple equality. This trades off performance on each request for resilience against timing attacks	`$LOCALAI_SUBTLE_KEY_COMPARISON`
`--disable-api-key-requirement-for-http-get`	`false`	If true, a valid API key is not required to issue GET requests to portions of the web UI. This should only be enabled in secure testing environments	`$LOCALAI_DISABLE_API_KEY_REQUIREMENT_FOR_HTTP_GET`
`--http-get-exempted-endpoints`	`^/$,^/browse/?$,^/talk/?$,^/p2p/?$,^/chat/?$,^/text2image/?$,^/tts/?$,^/static/.$,^/swagger.$`	If `--disable-api-key-requirement-for-http-get` is overridden to true, this is the list of endpoints to exempt. Only adjust this in case of a security incident or as a result of a personal security posture review	`$LOCALAI_HTTP_GET_EXEMPTED_ENDPOINTS`

P2P Flags

Parameter	Default	Description	Environment Variable
`--p2p`	`false`	Enable P2P mode	`$LOCALAI_P2P`, `$P2P`
`--p2p-dht-interval`	`360`	Interval for DHT refresh (used during token generation)	`$LOCALAI_P2P_DHT_INTERVAL`, `$P2P_DHT_INTERVAL`
`--p2p-otp-interval`	`9000`	Interval for OTP refresh (used during token generation)	`$LOCALAI_P2P_OTP_INTERVAL`, `$P2P_OTP_INTERVAL`
`--p2ptoken`		Token for P2P mode (optional)	`$LOCALAI_P2P_TOKEN`, `$P2P_TOKEN`, `$TOKEN`
`--p2p-network-id`		Network ID for P2P mode, can be set arbitrarily by the user for grouping a set of instances	`$LOCALAI_P2P_NETWORK_ID`, `$P2P_NETWORK_ID`
`--federated`	`false`	Enable federated instance	`$LOCALAI_FEDERATED`, `$FEDERATED`

Other Commands

LocalAI supports several subcommands beyond run:

local-ai models - Manage LocalAI models and definitions
local-ai backends - Manage LocalAI backends and definitions
local-ai tts - Convert text to speech
local-ai sound-generation - Generate audio files from text or audio
local-ai transcript - Convert audio to text
local-ai worker - Run workers to distribute workload (llama.cpp-only)
local-ai util - Utility commands
local-ai explorer - Run P2P explorer
local-ai federated - Run LocalAI in federated mode

Use local-ai <command> --help for more information on each command.

Examples

Basic Usage

./local-ai run

./local-ai run --models-path /path/to/models --address :9090

./local-ai run --f16

Environment Variables

export LOCALAI_MODELS_PATH=/path/to/models
export LOCALAI_ADDRESS=:9090
export LOCALAI_F16=true
./local-ai run

Advanced Configuration

./local-ai run \
  --models model1.yaml model2.yaml \
  --enable-watchdog-idle \
  --watchdog-idle-timeout=10m \
  --p2p \
  --federated

See Advanced Usage for configuration examples
See VRAM and Memory Management for memory management options

LocalAI binaries

LocalAI binaries are available for both Linux and MacOS platforms and can be executed directly from your command line. These binaries are continuously updated and hosted on our GitHub Releases page. This method also supports Windows users via the Windows Subsystem for Linux (WSL).

macOS Download

You can download the DMG and install the application:

Note: the DMGs are not signed by Apple as quarantined. See https://github.com/mudler/LocalAI/issues/6268 for a workaround, fix is tracked here: https://github.com/mudler/LocalAI/issues/6244

Otherwise, use the following one-liner command in your terminal to download and run LocalAI on Linux or MacOS:

curl -Lo local-ai "https://github.com/mudler/LocalAI/releases/download/v3.7.0/local-ai-$(uname -s)-$(uname -m)" && chmod +x local-ai && ./local-ai

Otherwise, here are the links to the binaries:

OS	Link
Linux (amd64)	Download
Linux (arm64)	Download
MacOS (arm64)	Download

Binaries do have limited support compared to container images:

Python-based backends are not shipped with binaries (e.g. bark, diffusers or transformers)
MacOS binaries and Linux-arm64 do not ship TTS nor stablediffusion-cpp backends
Linux binaries do not ship stablediffusion-cpp backend

Running on Nvidia ARM64

LocalAI can be run on Nvidia ARM64 devices, such as the Jetson Nano, Jetson Xavier NX, and Jetson AGX Xavier. The following instructions will guide you through building the LocalAI container for Nvidia ARM64 devices.

Prerequisites

Docker engine installed (https://docs.docker.com/engine/install/ubuntu/)
Nvidia container toolkit installed (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-ap)

Build the container

Build the LocalAI container for Nvidia ARM64 devices using the following command:

git clone https://github.com/mudler/LocalAI

cd LocalAI

docker build --build-arg SKIP_DRIVERS=true --build-arg BUILD_TYPE=cublas --build-arg BASE_IMAGE=nvcr.io/nvidia/l4t-jetpack:r36.4.0 --build-arg IMAGE_TYPE=core -t quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64-core .

Otherwise images are available on quay.io and dockerhub:

docker pull quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64-core

Usage

Run the LocalAI container on Nvidia ARM64 devices using the following command, where /data/models is the directory containing the models:

docker run -e DEBUG=true -p 8080:8080 -v /data/models:/models  -ti --restart=always --name local-ai --runtime nvidia --gpus all quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64-core

Note: /data/models is the directory containing the models. You can replace it with the directory containing your models.

FAQ

Frequently asked questions

Here are answers to some of the most common questions.

How do I get models?

Most gguf-based models should work, but newer models may require additions to the API. If a model doesn’t work, please feel free to open up issues. However, be cautious about downloading models from the internet and directly onto your machine, as there may be security vulnerabilities in lama.cpp or ggml that could be maliciously exploited. Some models can be found on Hugging Face: https://huggingface.co/models?search=gguf, or models from gpt4all are compatible too: https://github.com/nomic-ai/gpt4all.

Where are models stored?

LocalAI stores downloaded models in the following locations by default:

Command line: ./models (relative to current working directory)
Docker: /models (inside the container, typically mounted to ./models on host)
Launcher application: ~/.localai/models (in your home directory)

You can customize the model storage location using the LOCALAI_MODELS_PATH environment variable or --models-path command line flag. This is useful if you want to store models outside your home directory for backup purposes or to avoid filling up your home directory with large model files.

How much storage space do models require?

Model sizes vary significantly depending on the model and quantization level:

Small models (1-3B parameters): 1-3 GB
Medium models (7-13B parameters): 4-8 GB
Large models (30B+ parameters): 15-30+ GB

Quantization levels (smaller files, slightly reduced quality):

Q4_K_M: ~75% of original size
Q4_K_S: ~60% of original size
Q2_K: ~50% of original size

Storage recommendations:

Ensure you have at least 2-3x the model size available for downloads and temporary files
Use SSD storage for better performance
Consider the model size relative to your system RAM - models larger than your RAM may not run efficiently

Benchmarking LocalAI and llama.cpp shows different results!

LocalAI applies a set of defaults when loading models with the llama.cpp backend, one of these is mirostat sampling - while it achieves better results, it slows down the inference. You can disable this by setting mirostat: 0 in the model config file. See also the advanced section (/advanced/) for more information and this issue.

What’s the difference with Serge, or XXX?

LocalAI is a multi-model solution that doesn’t focus on a specific model type (e.g., llama.cpp or alpaca.cpp), and it handles all of these internally for faster inference, easy to set up locally and deploy to Kubernetes.

Everything is slow, how is it possible?

There are few situation why this could occur. Some tips are:

Don’t use HDD to store your models. Prefer SSD over HDD. In case you are stuck with HDD, disable mmap in the model config file so it loads everything in memory.
Watch out CPU overbooking. Ideally the --threads should match the number of physical cores. For instance if your CPU has 4 cores, you would ideally allocate <= 4 threads to a model.
Run LocalAI with DEBUG=true. This gives more information, including stats on the token inference speed.
Check that you are actually getting an output: run a simple curl request with "stream": true to see how fast the model is responding.

Can I use it with a Discord bot, or XXX?

Yes! If the client uses OpenAI and supports setting a different base URL to send requests to, you can use the LocalAI endpoint. This allows to use this with every application that was supposed to work with OpenAI, but without changing the application!

Can this leverage GPUs?

There is GPU support, see /features/gpu-acceleration/.

Where is the webUI?

There is the availability of localai-webui and chatbot-ui in the examples section and can be setup as per the instructions. However as LocalAI is an API you can already plug it into existing projects that provides are UI interfaces to OpenAI’s APIs. There are several already on Github, and should be compatible with LocalAI already (as it mimics the OpenAI API)

Does it work with AutoGPT?

Yes, see the examples!

How can I troubleshoot when something is wrong?

Enable the debug mode by setting DEBUG=true in the environment variables. This will give you more information on what’s going on. You can also specify --debug in the command line.

I’m getting ‘invalid pitch’ error when running with CUDA, what’s wrong?

This typically happens when your prompt exceeds the context size. Try to reduce the prompt size, or increase the context size.

I’m getting a ‘SIGILL’ error, what’s wrong?

Your CPU probably does not have support for certain instructions that are compiled by default in the pre-built binaries. If you are running in a container, try setting REBUILD=true and disable the CPU instructions that are not compatible with your CPU. For instance: CMAKE_ARGS="-DGGML_F16C=OFF -DGGML_AVX512=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF" make build

LocalAI

Why Choose LocalAI?

Key Features

Quick Start

Get Started

Learn More

Subsections of LocalAI

Overview

Why LocalAI?

Core Components

Getting Started

Recommended: Docker Installation

Key Features

Community and Support

Next Steps

License

Installation

Installation Methods

Quick Start

Subsections of Installation

Docker Installation

Prerequisites

Quick Start

Image Types

Standard Images

CPU Image

GPU Images

All-in-One (AIO) Images

CPU Image

GPU Images

Using Docker Compose

Persistent Storage

What’s Included in AIO Images

Next Steps

Advanced Configuration

Troubleshooting

Container won’t start

GPU not detected

Models not downloading

See Also

macOS Installation

Download

Installation Steps

Known Issues

Next Steps

Linux Installation

One-Line Installer (Recommended)

Installer Configuration Options

Environment Variables

Image Selection

Uninstallation

Manual Installation

Download Binary

System Requirements

Configuration

Next Steps

Run with Kubernetes

Build LocalAI

Build

Build LocalAI locally

Requirements

Build

Container image

Example: Build on mac

Troubleshooting mac

Build backends

Manually

With Docker

Getting started

What’s in This Section

Subsections of Getting started

Quickstart

Quickstart

Starting LocalAI

Downloading models on start

Using LocalAI and the full stack with LocalAGI

Quick Start

Key Features

Environment Variables

What’s Next?