Chapter 8

Features

LocalAI provides a comprehensive set of features for running AI models locally. This section covers all the capabilities and functionalities available in LocalAI.

Core Features

Text Generation - Generate text with GPT-compatible models using various backends
Image Generation - Create images with Stable Diffusion and other diffusion models
Audio Processing - Transcribe audio to text and generate speech from text
Embeddings - Generate vector embeddings for semantic search and RAG applications
GPT Vision - Analyze and understand images with vision-language models

Advanced Features

OpenAI Functions - Use function calling and tools API with local models
Constrained Grammars - Control model output format with BNF grammars
GPU Acceleration - Optimize performance with GPU support
Distributed Inference - Scale inference across multiple nodes
Model Context Protocol (MCP) - Enable agentic capabilities with MCP integration

Specialized Features

Object Detection - Detect and locate objects in images
Reranker - Improve retrieval accuracy with cross-encoder models
Stores - Vector similarity search for embeddings
Model Gallery - Browse and install pre-configured models
Backends - Learn about available backends and how to manage them

Getting Started

To start using these features, make sure you have LocalAI installed and have downloaded some models. Then explore the feature pages above to learn how to use each capability.

⚙️ Backends

LocalAI supports a variety of backends that can be used to run different types of AI models. There are core Backends which are included, and there are containerized applications that provide the runtime environment for specific model types, such as LLMs, diffusion models, or text-to-speech models.

Managing Backends in the UI

The LocalAI web interface provides an intuitive way to manage your backends:

Navigate to the “Backends” section in the navigation menu
Browse available backends from configured galleries
Use the search bar to find specific backends by name, description, or type
Filter backends by type using the quick filter buttons (LLM, Diffusion, TTS, Whisper)
Install or delete backends with a single click
Monitor installation progress in real-time

Each backend card displays:

Backend name and description
Type of models it supports
Installation status
Action buttons (Install/Delete)
Additional information via the info button

Backend Galleries

Backend galleries are repositories that contain backend definitions. They work similarly to model galleries but are specifically for backends.

Adding a Backend Gallery

You can add backend galleries by specifying the Environment Variable LOCALAI_BACKEND_GALLERIES:

export LOCALAI_BACKEND_GALLERIES='[{"name":"my-gallery","url":"https://raw.githubusercontent.com/username/repo/main/backends"}]'

The URL needs to point to a valid yaml file, for example:

- name: "test-backend"
  uri: "quay.io/image/tests:localai-backend-test"
  alias: "foo-backend"

Where URI is the path to an OCI container image.

Backend Gallery Structure

A backend gallery is a collection of YAML files, each defining a backend. Here’s an example structure:

name: "llm-backend"
description: "A backend for running LLM models"
uri: "quay.io/username/llm-backend:latest"
alias: "llm"
tags:
  - "llm"
  - "text-generation"

Pre-installing Backends

You can pre-install backends when starting LocalAI using the LOCALAI_EXTERNAL_BACKENDS environment variable:

export LOCALAI_EXTERNAL_BACKENDS="llm-backend,diffusion-backend"
local-ai run

Creating a Backend

To create a new backend, you need to:

Create a container image that implements the LocalAI backend interface
Define a backend YAML file
Publish your backend to a container registry

Backend Container Requirements

Your backend container should:

Implement the LocalAI backend interface (gRPC or HTTP)
Handle model loading and inference
Support the required model types
Include necessary dependencies
Have a top level run.sh file that will be used to run the backend
Pushed to a registry so can be used in a gallery

Getting started

For getting started, see the available backends in LocalAI here: https://github.com/mudler/LocalAI/tree/master/backend .

For Python based backends there is a template that can be used as starting point: https://github.com/mudler/LocalAI/tree/master/backend/python/common/template .
For Golang based backends, you can see the bark-cpp backend as an example: https://github.com/mudler/LocalAI/tree/master/backend/go/bark-cpp
For C++ based backends, you can see the llama-cpp backend as an example: https://github.com/mudler/LocalAI/tree/master/backend/cpp/llama-cpp

Publishing Your Backend

Build your container image:

docker build -t quay.io/username/my-backend:latest .

Push to a container registry:

docker push quay.io/username/my-backend:latest

Add your backend to a gallery:
- Create a YAML entry in your gallery repository
- Include the backend definition
- Make the gallery accessible via HTTP/HTTPS

Backend Types

LocalAI supports various types of backends:

LLM Backends: For running language models
Diffusion Backends: For image generation
TTS Backends: For text-to-speech conversion
Whisper Backends: For speech-to-text conversion

⚡ GPU acceleration

Section under construction

This section contains instruction on how to use LocalAI with GPU acceleration.

For acceleration for AMD or Metal HW is still in development, for additional details see the build

Automatic Backend Detection

When you install a model from the gallery (or a YAML file), LocalAI intelligently detects the required backend and your system’s capabilities, then downloads the correct version for you. Whether you’re running on a standard CPU, an NVIDIA GPU, an AMD GPU, or an Intel GPU, LocalAI handles it automatically.

For advanced use cases or to override auto-detection, you can use the LOCALAI_FORCE_META_BACKEND_CAPABILITY environment variable. Here are the available options:

default: Forces CPU-only backend. This is the fallback if no specific hardware is detected.
nvidia: Forces backends compiled with CUDA support for NVIDIA GPUs.
amd: Forces backends compiled with ROCm support for AMD GPUs.
intel: Forces backends compiled with SYCL/oneAPI support for Intel GPUs.

Model configuration

Depending on the model architecture and backend used, there might be different ways to enable GPU acceleration. It is required to configure the model you intend to use with a YAML config file. For example, for llama.cpp workloads a configuration file might look like this (where gpu_layers is the number of layers to offload to the GPU):

name: my-model-name
parameters:
  # Relative to the models path
  model: llama.cpp-model.ggmlv3.q5_K_M.bin

context_size: 1024
threads: 1

f16: true # enable with GPU acceleration
gpu_layers: 22 # GPU Layers (only used when built with cublas)

For diffusers instead, it might look like this instead:

name: stablediffusion
parameters:
  model: toonyou_beta6.safetensors
backend: diffusers
step: 30
f16: true
diffusers:
  pipeline_type: StableDiffusionPipeline
  cuda: true
  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
  scheduler_type: "k_dpmpp_sde"

CUDA(NVIDIA) acceleration

Requirements

Requirement: nvidia-container-toolkit (installation instructions 1 2)

If using a system with SELinux, ensure you have the policies installed, such as those provided by nvidia

To check what CUDA version do you need, you can either run nvidia-smi or nvcc --version.

Alternatively, you can also check nvidia-smi with docker:

docker run --runtime=nvidia --rm nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi

To use CUDA, use the images with the cublas tag, for example.

The image list is on quay:

CUDA 11 tags: master-gpu-nvidia-cuda-11, v1.40.0-gpu-nvidia-cuda-11, …
CUDA 12 tags: master-gpu-nvidia-cuda-12, v1.40.0-gpu-nvidia-cuda-12, …

In addition to the commands to run LocalAI normally, you need to specify --gpus all to docker, for example:

docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v1.40.0-gpu-nvidia-cuda12

If the GPU inferencing is working, you should be able to see something like:

5:22PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4
llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 4321.77 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1598 MB
...................................................................................................
llama_init_from_file: kv self size  =  512.00 MB

ROCM(AMD) acceleration

There are a limited number of tested configurations for ROCm systems however most newer deditated GPU consumer grade devices seem to be supported under the current ROCm6 implementation.

Due to the nature of ROCm it is best to run all implementations in containers as this limits the number of packages required for installation on host system, compatibility and package versions for dependencies across all variations of OS must be tested independently if desired, please refer to the build documentation.

Requirements

ROCm 6.x.x compatible GPU/accelerator
OS: Ubuntu (22.04, 20.04), RHEL (9.3, 9.2, 8.9, 8.8), SLES (15.5, 15.4)
Installed to host: amdgpu-dkms and rocm >=6.0.0 as per ROCm documentation.

Recommendations

Make sure to do not use GPU assigned for compute for desktop rendering.
Ensure at least 100GB of free space on disk hosting container runtime and storing images prior to installation.

Limitations

Ongoing verification testing of ROCm compatibility with integrated backends. Please note the following list of verified backends and devices.

LocalAI hipblas images are built against the following targets: gfx900,gfx906,gfx908,gfx940,gfx941,gfx942,gfx90a,gfx1030,gfx1031,gfx1100,gfx1101

If your device is not one of these you must specify the corresponding GPU_TARGETS and specify REBUILD=true. Otherwise you don’t need to specify these in the commands below.

Verified

The devices in the following list have been tested with hipblas images running ROCm 6.0.0

Backend	Verified	Devices
llama.cpp	yes	Radeon VII (gfx906)
diffusers	yes	Radeon VII (gfx906)
piper	yes	Radeon VII (gfx906)
whisper	no	none
bark	no	none
coqui	no	none
transformers	no	none
exllama	no	none
exllama2	no	none
mamba	no	none
sentencetransformers	no	none
transformers-musicgen	no	none
vall-e-x	no	none
vllm	no	none

You can help by expanding this list.

System Prep

Check your GPU LLVM target is compatible with the version of ROCm. This can be found in the LLVM Docs.
Check which ROCm version is compatible with your LLVM target and your chosen OS (pay special attention to supported kernel versions). See the following for compatibility for (ROCm 6.0.0) or (ROCm 6.0.2)
Install you chosen version of the dkms and rocm (it is recommended that the native package manager be used for this process for any OS as version changes are executed more easily via this method if updates are required). Take care to restart after installing amdgpu-dkms and before installing rocm, for details regarding this see the installation documentation for your chosen OS (6.0.2 or 6.0.0)
Deploy. Yes it’s that easy.

Setup Example (Docker/containerd)

The following are examples of the ROCm specific configuration elements required.

    # For full functionality select a non-'core' image, version locking the image is recommended for debug purposes.
    image: quay.io/go-skynet/local-ai:master-aio-gpu-hipblas
    environment:
      - DEBUG=true
      # If your gpu is not already included in the current list of default targets the following build details are required.
      - REBUILD=true
      - BUILD_TYPE=hipblas
      - GPU_TARGETS=gfx906 # Example for Radeon VII
    devices:
      # AMD GPU only require the following devices be passed through to the container for offloading to occur.
      - /dev/dri
      - /dev/kfd

The same can also be executed as a run for your container runtime

docker run \
 -e DEBUG=true \
 -e REBUILD=true \
 -e BUILD_TYPE=hipblas \
 -e GPU_TARGETS=gfx906 \
 --device /dev/dri \
 --device /dev/kfd \
 quay.io/go-skynet/local-ai:master-aio-gpu-hipblas

Please ensure to add all other required environment variables, port forwardings, etc to your compose file or run command.

The rebuild process will take some time to complete when deploying these containers and it is recommended that you pull the image prior to deployment as depending on the version these images may be ~20GB in size.

Example (k8s) (Advanced Deployment/WIP)

For k8s deployments there is an additional step required before deployment, this is the deployment of the ROCm/k8s-device-plugin. For any k8s environment the documentation provided by AMD from the ROCm project should be successful. It is recommended that if you use rke2 or OpenShift that you deploy the SUSE or RedHat provided version of this resource to ensure compatibility. After this has been completed the helm chart from go-skynet can be configured and deployed mostly un-edited.

The following are details of the changes that should be made to ensure proper function. While these details may be configurable in the values.yaml development of this Helm chart is ongoing and is subject to change.

The following details indicate the final state of the localai deployment relevant to GPU function.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {NAME}-local-ai
...
spec:
  ...
  template:
    ...
    spec:
      containers:
        - env:
            - name: HIP_VISIBLE_DEVICES
              value: '0'
              # This variable indicates the devices available to container (0:device1 1:device2 2:device3) etc.
              # For multiple devices (say device 1 and 3) the value would be equivalent to HIP_VISIBLE_DEVICES="0,2"
              # Please take note of this when an iGPU is present in host system as compatibility is not assured.
          ...
          resources:
            limits:
              amd.com/gpu: '1'
            requests:
              amd.com/gpu: '1'

This configuration has been tested on a ‘custom’ cluster managed by SUSE Rancher that was deployed on top of Ubuntu 22.04.4, certification of other configuration is ongoing and compatibility is not guaranteed.

Notes

When installing the ROCM kernel driver on your system ensure that you are installing an equal or newer version that that which is currently implemented in LocalAI (6.0.0 at time of writing).
AMD documentation indicates that this will ensure functionality however your mileage may vary depending on the GPU and distro you are using.
If you encounter an Error 413 on attempting to upload an audio file or image for whisper or llava/bakllava on a k8s deployment, note that the ingress for your deployment may require the annotation nginx.ingress.kubernetes.io/proxy-body-size: "25m" to allow larger uploads. This may be included in future versions of the helm chart.

Intel acceleration (sycl)

Requirements

If building from source, you need to install Intel oneAPI Base Toolkit and have the Intel drivers available in the system.

Container images

To use SYCL, use the images with gpu-intel in the tag, for example v3.7.0-gpu-intel, …

The image list is on quay.

Example

To run LocalAI with Docker and sycl starting phi-2, you can use the following command as an example:

docker run -e DEBUG=true --privileged -ti -v $PWD/models:/models -p 8080:8080  -v /dev/dri:/dev/dri --rm quay.io/go-skynet/local-ai:master-gpu-intel phi-2

Notes

In addition to the commands to run LocalAI normally, you need to specify --device /dev/dri to docker, for example:

docker run --rm -ti --device /dev/dri -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v3.7.0-gpu-intel

Note also that sycl does have a known issue to hang with mmap: true. You have to disable it in the model configuration if explicitly enabled.

Vulkan acceleration

Requirements

If using nvidia, follow the steps in the CUDA section to configure your docker runtime to allow access to the GPU.

Container images

To use Vulkan, use the images with the vulkan tag, for example v3.7.0-gpu-vulkan.

Example

To run LocalAI with Docker and Vulkan, you can use the following command as an example:

docker run -p 8080:8080 -e DEBUG=true -v $PWD/models:/models localai/localai:latest-gpu-vulkan

Notes

In addition to the commands to run LocalAI normally, you need to specify additional flags to pass the GPU hardware to the container.

These flags are the same as the sections above, depending on the hardware, for nvidia, AMD or Intel.

If you have mixed hardware, you can pass flags for multiple GPUs, for example:

docker run -p 8080:8080 -e DEBUG=true -v $PWD/models:/models \
--gpus=all \ # nvidia passthrough
--device /dev/dri --device /dev/kfd \ # AMD/Intel passthrough
localai/localai:latest-gpu-vulkan

📖 Text generation (GPT)

LocalAI supports generating text with GPT with llama.cpp and other backends (such as rwkv.cpp as ) see also the Model compatibility for an up-to-date list of the supported model families.

Note:

You can also specify the model name as part of the OpenAI token.
If only one model is available, the API will use it for all the requests.

API Reference

Chat completions

https://platform.openai.com/docs/api-reference/chat

For example, to generate a chat completion, you can send a POST request to the /v1/chat/completions endpoint with the instruction as the request body:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "messages": [{"role": "user", "content": "Say this is a test!"}],
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens

Edit completions

https://platform.openai.com/docs/api-reference/edits

To generate an edit completion you can send a POST request to the /v1/edits endpoint with the instruction as the request body:

curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "instruction": "rephrase",
  "input": "Black cat jumped out of the window",
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens.

Completions

https://platform.openai.com/docs/api-reference/completions

To generate a completion, you can send a POST request to the /v1/completions endpoint with the instruction as per the request body:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "prompt": "A long time ago in a galaxy far, far away",
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens

List models

You can list all the models available with:

curl http://localhost:8080/v1/models

Backends

RWKV

RWKV support is available through llama.cpp (see below)

llama.cpp

llama.cpp is a popular port of Facebook’s LLaMA model in C/C++.

Note

The ggml file format has been deprecated. If you are using ggml models and you are configuring your model with a YAML file, specify, use a LocalAI version older than v2.25.0. For gguf models, use the llama backend. The go backend is deprecated as well but still available as go-llama.

Features

The llama.cpp model supports the following features:

Setup

LocalAI supports llama.cpp models out of the box. You can use the llama.cpp model in the same way as any other model.

Manual setup

It is sufficient to copy the ggml or gguf model files in the models folder. You can refer to the model in the model parameter in the API calls.

You can optionally create an associated YAML model config file to tune the model’s parameters or apply a template to the prompt.

Prompt templates are useful for models that are fine-tuned towards a specific prompt.

Automatic setup

LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for ggml or gguf models.

For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.1
   }'

LocalAI will automatically download and configure the model in the model directory.

Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the model gallery documentation.

YAML configuration

To use the llama.cpp backend, specify llama-cpp as the backend in the YAML file:

name: llama
backend: llama-cpp
parameters:
  # Relative to the models path
  model: file.gguf

Backend Options

The llama.cpp backend supports additional configuration options that can be specified in the options field of your model YAML configuration. These options allow fine-tuning of the backend behavior:

Option	Type	Description	Example
`use_jinja` or `jinja`	boolean	Enable Jinja2 template processing for chat templates. When enabled, the backend uses Jinja2-based chat templates from the model for formatting messages.	`use_jinja:true`
`context_shift`	boolean	Enable context shifting, which allows the model to dynamically adjust context window usage.	`context_shift:true`
`cache_ram`	integer	Set the maximum RAM cache size in MiB for KV cache. Use `-1` for unlimited (default).	`cache_ram:2048`
`parallel` or `n_parallel`	integer	Enable parallel request processing. When set to a value greater than 1, enables continuous batching for handling multiple requests concurrently.	`parallel:4`
`grpc_servers` or `rpc_servers`	string	Comma-separated list of gRPC server addresses for distributed inference. Allows distributing workload across multiple llama.cpp workers.	`grpc_servers:localhost:50051,localhost:50052`

Example configuration with options:

name: llama-model
backend: llama
parameters:
  model: model.gguf
options:
  - use_jinja:true
  - context_shift:true
  - cache_ram:4096
  - parallel:2

Note: The parallel option can also be set via the LLAMACPP_PARALLEL environment variable, and grpc_servers can be set via the LLAMACPP_GRPC_SERVERS environment variable. Options specified in the YAML file take precedence over environment variables.

Reference

llama

exllama/2

Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”. Both exllama and exllama2 are supported.

Model setup

Download the model as a folder inside the model directory and create a YAML file specifying the exllama backend. For instance with the TheBloke/WizardLM-7B-uncensored-GPTQ model:

$ git lfs install
$ cd models && git clone https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ
$ ls models/                                                                 
.keep                        WizardLM-7B-uncensored-GPTQ/ exllama.yaml
$ cat models/exllama.yaml                                                     
name: exllama
parameters:
  model: WizardLM-7B-uncensored-GPTQ
backend: exllama

Test with:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{                                                                                                         
   "model": "exllama",
   "messages": [{"role": "user", "content": "How are you?"}],
   "temperature": 0.1
 }'

vLLM

vLLM is a fast and easy-to-use library for LLM inference.

LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out vllm performance here.

Setup

Create a YAML file for the model you want to use with vllm.

To setup a model, you need to just specify the model name in the YAML config file:

name: vllm
backend: vllm
parameters:
    model: "facebook/opt-125m"

The backend will automatically download the required files in order to run the model.

Usage

Use the completions endpoint by specifying the vllm backend:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
   "model": "vllm",
   "prompt": "Hello, my name is",
   "temperature": 0.1, "top_p": 0.1
 }'

Transformers

Transformers is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.

LocalAI has a built-in integration with Transformers, and it can be used to run models.

This is an extra backend - in the container images (the extra images already contains python dependencies for Transformers) is already available and there is nothing to do for the setup.

Setup

Create a YAML file for the model you want to use with transformers.

To setup a model, you need to just specify the model name in the YAML config file:

name: transformers
backend: transformers
parameters:
    model: "facebook/opt-125m"
type: AutoModelForCausalLM
quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit, xpu_8bit (optional)

The backend will automatically download the required files in order to run the model.

Parameters

Type

Type	Description
`AutoModelForCausalLM`	`AutoModelForCausalLM` is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration
`OVModelForCausalLM`	for Intel CPU/GPU/NPU OpenVINO Text Generation models
`OVModelForFeatureExtraction`	for Intel CPU/GPU/NPU OpenVINO Embedding acceleration
N/A	Defaults to `AutoModel`

OVModelForCausalLM requires OpenVINO IR Text Generation models from Hugging face
OVModelForFeatureExtraction works with any Safetensors Transformer Feature Extraction model from Huggingface (Embedding Model)

Please note that streaming is currently not implemente in AutoModelForCausalLM for Intel GPU. AMD GPU support is not implemented. Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.

Embeddings

Use embeddings: true if the model is an embedding model

Inference device selection

Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the main_gpu parameter.

Inference Engine	Applicable Values
CUDA	`cuda`, `cuda.X` where X is the GPU device like in `nvidia-smi -L` output
OpenVINO	Any applicable value from Inference Modes like `AUTO`,`CPU`,`GPU`,`NPU`,`MULTI`,`HETERO`

Example for CUDA: main_gpu: cuda.0

Example for OpenVINO: main_gpu: AUTO:-CPU

This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models.

Inference Precision

Transformer backend automatically select the fastest applicable inference precision according to the device support. CUDA backend can manually enable bfloat16 if your hardware support it with the following parameter:

f16: true

Quantization

Quantization	Description
`bnb_8bit`	8-bit quantization
`bnb_4bit`	4-bit quantization
`xpu_8bit`	8-bit quantization for Intel XPUs
`xpu_4bit`	4-bit quantization for Intel XPUs

Trust Remote Code

Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library. By default it is disabled for security. It can be manually enabled with: trust_remote_code: true

Maximum Context Size

Maximum context size in bytes can be specified with the parameter: context_size. Do not use values higher than what your model support.

Usage example: context_size: 8192

Auto Prompt Template

Usually chat template is defined by the model author in the tokenizer_config.json file. To enable it use the use_tokenizer_template: true parameter in the template section.

Usage example:

template:
  use_tokenizer_template: true

Custom Stop Words

Stopwords are usually defined in tokenizer_config.json file. They can be overridden with the stopwords parameter in case of need like in llama3-Instruct model.

Usage example:

stopwords:
- "<|eot_id|>"
- "<|end_of_text|>"

Usage

Use the completions endpoint by specifying the transformers model:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
   "model": "transformers",
   "prompt": "Hello, my name is",
   "temperature": 0.1, "top_p": 0.1
 }'

Examples

OpenVINO

A model configuration file for openvion and starling model:

name: starling-openvino
backend: transformers
parameters:
  model: fakezeta/Starling-LM-7B-beta-openvino-int8
context_size: 8192
threads: 6
f16: true
type: OVModelForCausalLM
stopwords:
- <|end_of_turn|>
- <|endoftext|>
prompt_cache_path: "cache"
prompt_cache_all: true
template:
  chat_message: |
    {{if eq .RoleName "system"}}{{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "assistant"}}<|end_of_turn|>GPT4 Correct Assistant: {{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "user"}}GPT4 Correct User: {{.Content}}{{end}}

  chat: |
    {{.Input}}<|end_of_turn|>GPT4 Correct Assistant:

  completion: |
    {{.Input}}

📈 Reranker

A reranking model, often referred to as a cross-encoder, is a core component in the two-stage retrieval systems used in information retrieval and natural language processing tasks. Given a query and a set of documents, it will output similarity scores.

We can use then the score to reorder the documents by relevance in our RAG system to increase its overall accuracy and filter out non-relevant results.

LocalAI supports reranker models, and you can use them by using the rerankers backend, which uses rerankers.

Usage

You can test rerankers by using container images with python (this does NOT work with core images) and a model config file like this, or by installing cross-encoder from the gallery in the UI:

name: jina-reranker-v1-base-en
backend: rerankers
parameters:
  model: cross-encoder

and test it with:

    curl http://localhost:8080/v1/rerank \
      -H "Content-Type: application/json" \
      -d '{
      "model": "jina-reranker-v1-base-en",
      "query": "Organic skincare products for sensitive skin",
      "documents": [
        "Eco-friendly kitchenware for modern homes",
        "Biodegradable cleaning supplies for eco-conscious consumers",
        "Organic cotton baby clothes for sensitive skin",
        "Natural organic skincare range for sensitive skin",
        "Tech gadgets for smart homes: 2024 edition",
        "Sustainable gardening tools and compost solutions",
        "Sensitive skin-friendly facial cleansers and toners",
        "Organic food wraps and storage solutions",
        "All-natural pet food for dogs with allergies",
        "Yoga mats made from recycled materials"
      ],
      "top_n": 3
    }'

🗣 Text to audio (TTS)

API Compatibility

The LocalAI TTS API is compatible with the OpenAI TTS API and the Elevenlabs API.

LocalAI API

The /tts endpoint can also be used to generate speech from text.

Usage

Input: input, model

For example, to generate an audio file, you can send a POST request to the /tts endpoint with the instruction as the request body:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "input": "Hello world",
  "model": "tts"
}'

Returns an audio/wav file.

Backends

🐸 Coqui

Required: Don’t use LocalAI images ending with the -core tag,. Python dependencies are required in order to use this backend.

Coqui works without any configuration, to test it, you can run the following curl command:

    curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
        "backend": "coqui",
        "model": "tts_models/en/ljspeech/glow-tts",
        "input":"Hello, this is a test!"
        }'

You can use the env variable COQUI_LANGUAGE to set the language used by the coqui backend.

You can also use config files to configure tts models (see section below on how to use config files).

Bark

Bark allows to generate audio from text prompts.

This is an extra backend - in the container is already available and there is nothing to do for the setup.

Model setup

There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.

Usage

Use the tts endpoint by specifying the bark backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "bark",
     "input":"Hello!"
   }' | aplay

To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the model parameter:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "bark",
     "input":"Hello!",
     "model": "v2/en_speaker_4"
   }' | aplay

Piper

To install the piper audio models manually:

Download Voices from https://github.com/rhasspy/piper/releases/tag/v0.0.2
Extract the .tar.tgz files (.onnx,.json) inside models
Run the following command to test the model is working

To use the tts endpoint, run the following command. You can specify a backend with the backend parameter. For example, to use the piper backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "model":"it-riccardo_fasol-x-low.onnx",
  "backend": "piper",
  "input": "Ciao, sono Ettore"
}' | aplay

Note:

aplay is a Linux command. You can use other tools to play the audio file.
The model name is the filename with the extension.
The model name is case sensitive.
LocalAI must be compiled with the GO_TAGS=tts flag.

Transformers-musicgen

LocalAI also has experimental support for transformers-musicgen for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:

curl --request POST \
  --url http://localhost:8080/tts \
  --header 'Content-Type: application/json' \
  --data '{
    "backend": "transformers-musicgen",
    "model": "facebook/musicgen-medium",
    "input": "Cello Rave"
}' | aplay

Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.

Vall-E-X

VALL-E-X is an open source implementation of Microsoft’s VALL-E X zero-shot TTS model.

Setup

The backend will automatically download the required files in order to run the model.

This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.

Usage

Use the tts endpoint by specifying the vall-e-x backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "vall-e-x",
     "input":"Hello!"
   }' | aplay

Voice cloning

In order to use voice cloning capabilities you must create a YAML configuration file to setup a model:

name: cloned-voice
backend: vall-e-x
parameters:
  model: "cloned-voice"
tts:
    vall-e:
      # The path to the audio file to be cloned
      # relative to the models directory
      # Max 15s
      audio_path: "audio-sample.wav"

Then you can specify the model name in the requests:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "model": "cloned-voice",
     "input":"Hello!"
   }' | aplay

Using config files

You can also use a config-file to specify TTS models and their parameters.

In the following example we define a custom config to load the xtts_v2 model, and specify a voice and language.

name: xtts_v2
backend: coqui
parameters:
  language: fr
  model: tts_models/multilingual/multi-dataset/xtts_v2

tts:
  voice: Ana Florence

With this config, you can now use the following curl command to generate a text-to-speech audio file:

curl -L http://localhost:8080/tts \
    -H "Content-Type: application/json" \
    -d '{
"model": "xtts_v2",
"input": "Bonjour, je suis Ana Florence. Comment puis-je vous aider?"
}' | aplay

Response format

To provide some compatibility with OpenAI API regarding response_format, ffmpeg must be installed (or a docker image including ffmpeg used) to leverage converting the generated wav file before the api provide its response.

Warning regarding a change in behaviour. Before this addition, the parameter was ignored and a wav file was always returned, with potential codec errors later in the integration (like trying to decode a mp3 file from a wav, which is the default format used by OpenAI)

Supported format thanks to ffmpeg are wav, mp3, aac, flac, opus, defaulting to wav if an unknown or no format is provided.

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "input": "Hello world",
  "model": "tts",
  "response_format": "mp3"
}'

If a response_format is added in the query (other than wav) and ffmpeg is not available, the call will fail.

🎨 Image generation

(Generated with AnimagineXL)

LocalAI supports generating images with Stable diffusion, running on CPU using C++ and Python implementations.

Usage

OpenAI docs: https://platform.openai.com/docs/api-reference/images/create

To generate an image you can send a POST request to the /v1/images/generations endpoint with the instruction as the request body:

curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "prompt": "A cute baby sea otter",
  "size": "256x256"
}'

Available additional parameters: mode, step.

Note: To set a negative prompt, you can split the prompt with |, for instance: a cute baby sea otter|malformed.

curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
  "size": "256x256"
}'

Backends

stablediffusion-ggml

This backend is based on stable-diffusion.cpp. Every model supported by that backend is supported indeed with LocalAI.

Setup

There are already several models in the gallery that are available to install and get up and running with this backend, you can for example run flux by searching it in the Model gallery (flux.1-dev-ggml) or start LocalAI with run:

local-ai run flux.1-dev-ggml

To use a custom model, you can follow these steps:

Create a model file stablediffusion.yaml in the models folder:

name: stablediffusion
backend: stablediffusion-ggml
parameters:
  model: gguf_model.gguf
step: 25
cfg_scale: 4.5
options:
- "clip_l_path:clip_l.safetensors"
- "clip_g_path:clip_g.safetensors"
- "t5xxl_path:t5xxl-Q5_0.gguf"
- "sampler:euler"

Download the required assets to the models repository
Start LocalAI

Diffusers

Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. LocalAI has a diffusers backend which allows image generation using the diffusers library.

(Generated with AnimagineXL)

Model setup

The models will be downloaded the first time you use the backend from huggingface automatically.

Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:

name: animagine-xl
parameters:
  model: Linaqruf/animagine-xl
backend: diffusers

f16: false
diffusers:
  cuda: false # Enable for GPU usage (CUDA)
  scheduler_type: euler_a

Dependencies

This is an extra backend - in the container is already available and there is nothing to do for the setup. Do not use core images (ending with -core). If you are building manually, see the build instructions.

Model setup

The models will be downloaded the first time you use the backend from huggingface automatically.

Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:

name: animagine-xl
parameters:
  model: Linaqruf/animagine-xl
backend: diffusers
cuda: true
f16: true
diffusers:
  scheduler_type: euler_a

Local models

You can also use local models, or modify some parameters like clip_skip, scheduler_type, for instance:

name: stablediffusion
parameters:
  model: toonyou_beta6.safetensors
backend: diffusers
step: 30
f16: true
cuda: true
diffusers:
  pipeline_type: StableDiffusionPipeline
  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
  scheduler_type: "k_dpmpp_sde"
  clip_skip: 11

cfg_scale: 8

Configuration parameters

The following parameters are available in the configuration file:

Parameter	Description	Default
`f16`	Force the usage of `float16` instead of `float32`	`false`
`step`	Number of steps to run the model for	`30`
`cuda`	Enable CUDA acceleration	`false`
`enable_parameters`	Parameters to enable for the model	`negative_prompt,num_inference_steps,clip_skip`
`scheduler_type`	Scheduler type	`k_dpp_sde`
`cfg_scale`	Configuration scale	`8`
`clip_skip`	Clip skip	None
`pipeline_type`	Pipeline type	`AutoPipelineForText2Image`
`lora_adapters`	A list of lora adapters (file names relative to model directory) to apply	None
`lora_scales`	A list of lora scales (floats) to apply	None

There are available several types of schedulers:

Scheduler	Description
`ddim`	DDIM
`pndm`	PNDM
`heun`	Heun
`unipc`	UniPC
`euler`	Euler
`euler_a`	Euler a
`lms`	LMS
`k_lms`	LMS Karras
`dpm_2`	DPM2
`k_dpm_2`	DPM2 Karras
`dpm_2_a`	DPM2 a
`k_dpm_2_a`	DPM2 a Karras
`dpmpp_2m`	DPM++ 2M
`k_dpmpp_2m`	DPM++ 2M Karras
`dpmpp_sde`	DPM++ SDE
`k_dpmpp_sde`	DPM++ SDE Karras
`dpmpp_2m_sde`	DPM++ 2M SDE
`k_dpmpp_2m_sde`	DPM++ 2M SDE Karras

Pipelines types available:

Pipeline type	Description
`StableDiffusionPipeline`	Stable diffusion pipeline
`StableDiffusionImg2ImgPipeline`	Stable diffusion image to image pipeline
`StableDiffusionDepth2ImgPipeline`	Stable diffusion depth to image pipeline
`DiffusionPipeline`	Diffusion pipeline
`StableDiffusionXLPipeline`	Stable diffusion XL pipeline
`StableVideoDiffusionPipeline`	Stable video diffusion pipeline
`AutoPipelineForText2Image`	Automatic detection pipeline for text to image
`VideoDiffusionPipeline`	Video diffusion pipeline
`StableDiffusion3Pipeline`	Stable diffusion 3 pipeline
`FluxPipeline`	Flux pipeline
`FluxTransformer2DModel`	Flux transformer 2D model
`SanaPipeline`	Sana pipeline

Advanced: Additional parameters

Additional arbitrarly parameters can be specified in the option field in key/value separated by ::

name: animagine-xl
options:
- "cfg_scale:6"

Note: There is no complete parameter list. Any parameter can be passed arbitrarly and is passed to the model directly as argument to the pipeline. Different pipelines/implementations support different parameters.

The example above, will result in the following python code when generating images:

pipe(
    prompt="A cute baby sea otter", # Options passed via API
    size="256x256", # Options passed via API
    cfg_scale=6 # Additional parameter passed via configuration file
)

Usage

Text to Image

Use the image generation endpoint with the model name from the configuration file:

curl http://localhost:8080/v1/images/generations \
    -H "Content-Type: application/json" \
    -d '{
      "prompt": "<positive prompt>|<negative prompt>", 
      "model": "animagine-xl", 
      "step": 51,
      "size": "1024x1024" 
    }'

Image to Image

https://huggingface.co/docs/diffusers/using-diffusers/img2img

An example model (GPU):

name: stablediffusion-edit
parameters:
  model: nitrosocke/Ghibli-Diffusion
backend: diffusers
step: 25
cuda: true
f16: true
diffusers:
  pipeline_type: StableDiffusionImg2ImgPipeline
  enable_parameters: "negative_prompt,num_inference_steps,image"

IMAGE_PATH=/path/to/your/image
(echo -n '{"file": "'; base64 $IMAGE_PATH; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-edit"}') |
curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations

🖼️ Flux kontext with `stable-diffusion.cpp`

LocalAI supports Flux Kontext and can be used to edit images via the API:

Install with:

local-ai run flux.1-kontext-dev

To test:

curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "model": "flux.1-kontext-dev",
  "prompt": "change 'flux.cpp' to 'LocalAI'",
  "size": "256x256",
  "ref_images": [
  	"https://raw.githubusercontent.com/leejet/stable-diffusion.cpp/master/assets/flux/flux1-dev-q8_0.png"
  ]
}'

Depth to Image

https://huggingface.co/docs/diffusers/using-diffusers/depth2img

name: stablediffusion-depth
parameters:
  model: stabilityai/stable-diffusion-2-depth
backend: diffusers
step: 50
f16: true
cuda: true
diffusers:
  pipeline_type: StableDiffusionDepth2ImgPipeline
  enable_parameters: "negative_prompt,num_inference_steps,image"

cfg_scale: 6

(echo -n '{"file": "'; base64 ~/path/to/image.jpeg; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-depth"}') |
curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations

img2vid

name: img2vid
parameters:
  model: stabilityai/stable-video-diffusion-img2vid
backend: diffusers
step: 25
f16: true
cuda: true
diffusers:
  pipeline_type: StableVideoDiffusionPipeline

(echo -n '{"file": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true","size": "512x512","model":"img2vid"}') |
curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations

txt2vid

name: txt2vid
parameters:
  model: damo-vilab/text-to-video-ms-1.7b
backend: diffusers
step: 25
f16: true
cuda: true
diffusers:
  pipeline_type: VideoDiffusionPipeline
  cuda: true

(echo -n '{"prompt": "spiderman surfing","size": "512x512","model":"txt2vid"}') |
curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations

🔍 Object detection

LocalAI supports object detection through various backends. This feature allows you to identify and locate objects within images with high accuracy and real-time performance. Currently, RF-DETR is available as an implementation.

Overview

Object detection in LocalAI is implemented through dedicated backends that can identify and locate objects within images. Each backend provides different capabilities and model architectures.

Key Features:

Real-time object detection
High accuracy detection with bounding boxes
Support for multiple hardware accelerators (CPU, NVIDIA GPU, Intel GPU, AMD GPU)
Structured detection results with confidence scores
Easy integration through the /v1/detection endpoint

Usage

Detection Endpoint

LocalAI provides a dedicated /v1/detection endpoint for object detection tasks. This endpoint is specifically designed for object detection and returns structured detection results with bounding boxes and confidence scores.

API Reference

To perform object detection, send a POST request to the /v1/detection endpoint:

curl -X POST http://localhost:8080/v1/detection \
  -H "Content-Type: application/json" \
  -d '{
    "model": "rfdetr-base",
    "image": "https://media.roboflow.com/dog.jpeg"
  }'

Request Format

The request body should contain:

model: The name of the object detection model (e.g., “rfdetr-base”)
image: The image to analyze, which can be:
- A URL to an image
- A base64-encoded image

Response Format

The API returns a JSON response with detected objects:

{
  "detections": [
    {
      "x": 100.5,
      "y": 150.2,
      "width": 200.0,
      "height": 300.0,
      "confidence": 0.95,
      "class_name": "dog"
    },
    {
      "x": 400.0,
      "y": 200.0,
      "width": 150.0,
      "height": 250.0,
      "confidence": 0.87,
      "class_name": "person"
    }
  ]
}

Each detection includes:

x, y: Coordinates of the bounding box top-left corner
width, height: Dimensions of the bounding box
confidence: Detection confidence score (0.0 to 1.0)
class_name: The detected object class

Backends

RF-DETR Backend

The RF-DETR backend is implemented as a Python-based gRPC service that integrates seamlessly with LocalAI. It provides object detection capabilities using the RF-DETR model architecture and supports multiple hardware configurations:

CPU: Optimized for CPU inference
NVIDIA GPU: CUDA acceleration for NVIDIA GPUs
Intel GPU: Intel oneAPI optimization
AMD GPU: ROCm acceleration for AMD GPUs
NVIDIA Jetson: Optimized for ARM64 NVIDIA Jetson devices

Setup

Using the Model Gallery (Recommended)
The easiest way to get started is using the model gallery. The rfdetr-base model is available in the official LocalAI gallery:
```
# Install and run the rfdetr-base model
local-ai run rfdetr-base
```
You can also install it through the web interface by navigating to the Models section and searching for “rfdetr-base”.
Manual Configuration
Create a model configuration file in your models directory:
```
name: rfdetr
backend: rfdetr
parameters:
  model: rfdetr-base
```

Available Models

Currently, the following model is available in the Model Gallery:

rfdetr-base: Base model with balanced performance and accuracy

You can browse and install this model through the LocalAI web interface or using the command line.

Examples

Basic Object Detection

curl -X POST http://localhost:8080/v1/detection \
  -H "Content-Type: application/json" \
  -d '{
    "model": "rfdetr-base",
    "image": "https://example.com/image.jpg"
  }'

Base64 Image Detection

base64_image=$(base64 -w 0 image.jpg)
curl -X POST http://localhost:8080/v1/detection \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"rfdetr-base\",
    \"image\": \"data:image/jpeg;base64,$base64_image\"
  }"

Troubleshooting

Common Issues

Model Loading Errors
- Ensure the model file is properly downloaded
- Check available disk space
- Verify model compatibility with your backend version
Low Detection Accuracy
- Ensure good image quality and lighting
- Check if objects are clearly visible
- Consider using a larger model for better accuracy
Slow Performance
- Enable GPU acceleration if available
- Use a smaller model for faster inference
- Optimize image resolution

Debug Mode

Enable debug logging for troubleshooting:

local-ai run --debug rfdetr-base

Object Detection Category

LocalAI includes a dedicated object-detection category for models and backends that specialize in identifying and locating objects within images. This category currently includes:

RF-DETR: Real-time transformer-based object detection

Additional object detection models and backends will be added to this category in the future. You can filter models by the object-detection tag in the model gallery to find all available object detection models.

🎨 Image generation: Generate images with AI
📖 Text generation: Generate text with language models
🔍 GPT Vision: Analyze images with language models
🚀 GPU acceleration: Optimize performance with GPU acceleration

🧠 Embeddings

LocalAI supports generating embeddings for text or list of tokens.

For the API documentation you can refer to the OpenAI docs: https://platform.openai.com/docs/api-reference/embeddings

Model compatibility

The embedding endpoint is compatible with llama.cpp models, bert.cpp models and sentence-transformers models available in huggingface.

Manual Setup

Create a YAML config file in the models directory. Specify the backend and the model file.

name: text-embedding-ada-002 # The model name used in the API
parameters:
  model: <model_file>
backend: "<backend>"
embeddings: true

Huggingface embeddings

To use sentence-transformers and models in huggingface you can use the sentencetransformers embedding backend.

name: text-embedding-ada-002
backend: sentencetransformers
embeddings: true
parameters:
  model: all-MiniLM-L6-v2

The sentencetransformers backend uses Python sentence-transformers. For a list of all pre-trained models available see here: https://github.com/UKPLab/sentence-transformers#pre-trained-models

Note

The sentencetransformers backend is an optional backend of LocalAI and uses Python. If you are running LocalAI from the containers you are good to go and should be already configured for use.
For local execution, you also have to specify the extra backend in the EXTERNAL_GRPC_BACKENDS environment variable.
- Example: EXTERNAL_GRPC_BACKENDS="sentencetransformers:/path/to/LocalAI/backend/python/sentencetransformers/sentencetransformers.py"
The sentencetransformers backend does support only embeddings of text, and not of tokens. If you need to embed tokens you can use the bert backend or llama.cpp.
No models are required to be downloaded before using the sentencetransformers backend. The models will be downloaded automatically the first time the API is used.

Llama.cpp embeddings

Embeddings with llama.cpp are supported with the llama-cpp backend, it needs to be enabled with embeddings set to true.

name: my-awesome-model
backend: llama-cpp
embeddings: true
parameters:
  model: ggml-file.bin

Then you can use the API to generate embeddings:

curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
  "input": "My text",
  "model": "my-awesome-model"
}' | jq "."

💡 Examples

Example that uses LLamaIndex and LocalAI as embedding: here.

🥽 GPT Vision

LocalAI supports understanding images by using LLaVA, and implements the GPT Vision API from OpenAI.

Usage

OpenAI docs: https://platform.openai.com/docs/guides/vision

To let LocalAI understand and reply with what sees in the image, use the /v1/chat/completions endpoint, for example with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llava",
     "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

Grammars and function tools can be used as well in conjunction with vision APIs:

 curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llava", "grammar": "root ::= (\"yes\" | \"no\")",
     "messages": [{"role": "user", "content": [{"type":"text", "text": "Is there some grass in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

Setup

All-in-One images have already shipped the llava model as gpt-4-vision-preview, so no setup is needed in this case.

To setup the LLaVa models, follow the full example in the configuration examples.

✍️ Constrained Grammars

Overview

The chat endpoint supports the grammar parameter, which allows users to specify a grammar in Backus-Naur Form (BNF). This feature enables the Large Language Model (LLM) to generate outputs adhering to a user-defined schema, such as JSON, YAML, or any other format that can be defined using BNF. For more details about BNF, see Backus-Naur Form on Wikipedia.

Note

Compatibility Notice: This feature is only supported by models that use the llama.cpp backend. For a complete list of compatible models, refer to the Model Compatibility page. For technical details, see the related pull requests: PR #1773 and PR #1887.

Setup

To use this feature, follow the installation and setup instructions on the LocalAI Functions page. Ensure that your local setup meets all the prerequisites specified for the llama.cpp backend.

💡 Usage Example

The following example demonstrates how to use the grammar parameter to constrain the model’s output to either “yes” or “no”. This can be particularly useful in scenarios where the response format needs to be strictly controlled.

Example: Binary Response Constraint

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Do you like apples?"}],
  "grammar": "root ::= (\"yes\" | \"no\")"
}'

In this example, the grammar parameter is set to a simple choice between “yes” and “no”, ensuring that the model’s response adheres strictly to one of these options regardless of the context.

Example: JSON Output Constraint

You can also use grammars to enforce JSON output format:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Generate a person object with name and age"}],
  "grammar": "root ::= \"{\" \"\\\"name\\\":\" string \",\\\"age\\\":\" number \"}\"\nstring ::= \"\\\"\" [a-z]+ \"\\\"\"\nnumber ::= [0-9]+"
}'

Example: YAML Output Constraint

Similarly, you can enforce YAML format:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Generate a YAML list of fruits"}],
  "grammar": "root ::= \"fruits:\" newline (\"  - \" string newline)+\nstring ::= [a-z]+\nnewline ::= \"\\n\""
}'

Advanced Usage

For more complex grammars, you can define multi-line BNF rules. The grammar parser supports:

Alternation (|)
Repetition (*, +)
Optional elements (?)
Character classes ([a-z])
String literals ("text")

OpenAI Functions - Function calling with structured outputs
Text Generation - General text generation capabilities

🆕🖧 Distributed Inference

This functionality enables LocalAI to distribute inference requests across multiple worker nodes, improving efficiency and performance. Nodes are automatically discovered and connect via p2p by using a shared token which makes sure the communication is secure and private between the nodes of the network.

LocalAI supports two modes of distributed inferencing via p2p:

Federated Mode: Requests are shared between the cluster and routed to a single worker node in the network based on the load balancer’s decision.
Worker Mode (aka “model sharding” or “splitting weights”): Requests are processed by all the workers which contributes to the final inference result (by sharing the model weights).

A list of global instances shared by the community is available at explorer.localai.io.

Usage

Starting LocalAI with --p2p generates a shared token for connecting multiple instances: and that’s all you need to create AI clusters, eliminating the need for intricate network setups.

Simply navigate to the “Swarm” section in the WebUI and follow the on-screen instructions.

For fully shared instances, initiate LocalAI with –p2p –federated and adhere to the Swarm section’s guidance. This feature, while still experimental, offers a tech preview quality experience.

Federated mode

Federated mode allows to launch multiple LocalAI instances and connect them together in a federated network. This mode is useful when you want to distribute the load of the inference across multiple nodes, but you want to have a single point of entry for the API. In the Swarm section of the WebUI, you can see the instructions to connect multiple instances together.

To start a LocalAI server in federated mode, run:

local-ai run --p2p --federated

This will generate a token that you can use to connect other LocalAI instances to the network or others can use to join the network. If you already have a token, you can specify it using the TOKEN environment variable.

To start a load balanced server that routes the requests to the network, run with the TOKEN:

local-ai federated

To see all the available options, run local-ai federated --help.

The instructions are displayed in the “Swarm” section of the WebUI, guiding you through the process of connecting multiple instances.

Workers mode

Note

This feature is available exclusively with llama-cpp compatible models.

This feature was introduced in LocalAI pull request #2324 and is based on the upstream work in llama.cpp pull request #6829.

To connect multiple workers to a single LocalAI instance, start first a server in p2p mode:

local-ai run --p2p

And navigate the WebUI to the “Swarm” section to see the instructions to connect multiple workers to the network.

Without P2P

To start workers for distributing the computational load, run:

local-ai worker llama-cpp-rpc --llama-cpp-args="-H <listening_address> -p <listening_port> -m <memory>"

And you can specify the address of the workers when starting LocalAI with the LLAMACPP_GRPC_SERVERS environment variable:

LLAMACPP_GRPC_SERVERS="address1:port,address2:port" local-ai run

The workload on the LocalAI server will then be distributed across the specified nodes.

Alternatively, you can build the RPC workers/server following the llama.cpp README, which is compatible with LocalAI.

Manual example (worker)

Use the WebUI to guide you in the process of starting new workers. This example shows the manual steps to highlight the process.

Start the server with --p2p:

./local-ai run --p2p

Copy the token from the WebUI or via API call (e.g., curl http://localhost:8000/p2p/token) and save it for later use.

To reuse the same token later, restart the server with --p2ptoken or P2P_TOKEN.

Start the workers. Copy the local-ai binary to other hosts and run as many workers as needed using the token:

TOKEN=XXX ./local-ai worker p2p-llama-cpp-rpc --llama-cpp-args="-m <memory>"

(Note: You can also supply the token via command-line arguments)

The server logs should indicate that new workers are being discovered.

Start inference as usual on the server initiated in step 1.

Environment Variables

There are options that can be tweaked or parameters that can be set using environment variables

Environment Variable	Description
LOCALAI_P2P	Set to “true” to enable p2p
LOCALAI_FEDERATED	Set to “true” to enable federated mode
FEDERATED_SERVER	Set to “true” to enable federated server
LOCALAI_P2P_DISABLE_DHT	Set to “true” to disable DHT and enable p2p layer to be local only (mDNS)
LOCALAI_P2P_ENABLE_LIMITS	Set to “true” to enable connection limits and resources management (useful when running with poor connectivity or want to limit resources consumption)
LOCALAI_P2P_LISTEN_MADDRS	Set to comma separated list of multiaddresses to override default libp2p 0.0.0.0 multiaddresses
LOCALAI_P2P_DHT_ANNOUNCE_MADDRS	Set to comma separated list of multiaddresses to override announcing of listen multiaddresses (useful when external address:port is remapped)
LOCALAI_P2P_BOOTSTRAP_PEERS_MADDRS	Set to comma separated list of multiaddresses to specify custom DHT bootstrap nodes
LOCALAI_P2P_TOKEN	Set the token for the p2p network
LOCALAI_P2P_LOGLEVEL	Set the loglevel for the LocalAI p2p stack (default: info)
LOCALAI_P2P_LIB_LOGLEVEL	Set the loglevel for the underlying libp2p stack (default: fatal)

Architecture

LocalAI uses https://github.com/libp2p/go-libp2p under the hood, the same project powering IPFS. Differently from other frameworks, LocalAI uses peer2peer without a single master server, but rather it uses sub/gossip and ledger functionalities to achieve consensus across different peers.

EdgeVPN is used as a library to establish the network and expose the ledger functionality under a shared token to ease out automatic discovery and have separated, private peer2peer networks.

The weights are split proportional to the memory when running into worker mode, when in federation mode each request is split to every node which have to load the model fully.

Debugging

To debug, it’s often useful to run in debug mode, for instance:

LOCALAI_P2P_LOGLEVEL=debug LOCALAI_P2P_LIB_LOGLEVEL=debug LOCALAI_P2P_ENABLE_LIMITS=true LOCALAI_P2P_DISABLE_DHT=true LOCALAI_P2P_TOKEN="<TOKEN>" ./local-ai ...

Notes

If running in p2p mode with container images, make sure you start the container with --net host or network_mode: host in the docker-compose file.
Only a single model is supported currently.
Ensure the server detects new workers before starting inference. Currently, additional workers cannot be added once inference has begun.
For more details on the implementation, refer to LocalAI pull request #2343

🔈 Audio to text

Audio to text models are models that can generate text from an audio file.

The transcription endpoint allows to convert audio files to text. The endpoint is based on whisper.cpp, a C++ library for audio transcription. The endpoint input supports all the audio formats supported by ffmpeg.

Usage

Once LocalAI is started and whisper models are installed, you can use the /v1/audio/transcriptions API endpoint.

For instance, with cURL:

curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@<FILE_PATH>" -F model="<MODEL_NAME>"

Example

Download one of the models from here in the models folder, and create a YAML file for your model:

name: whisper-1
backend: whisper
parameters:
  model: whisper-en

The transcriptions endpoint then can be tested like so:

## Get an example audio file
wget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg

## Send the example audio file to the transcriptions endpoint
curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@$PWD/gb1.ogg" -F model="whisper-1"

## Result
{"text":"My fellow Americans, this day has brought terrible news and great sadness to our country.At nine o'clock this morning, Mission Control in Houston lost contact with our Space ShuttleColumbia.A short time later, debris was seen falling from the skies above Texas.The Columbia's lost.There are no survivors.One board was a crew of seven.Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain DavidBrown, Commander William McCool, Dr. Kultna Shavla, and Elon Ramon, a colonel in the IsraeliAir Force.These men and women assumed great risk in the service to all humanity.In an age when spaceflight has come to seem almost routine, it is easy to overlook thedangers of travel by rocket and the difficulties of navigating the fierce outer atmosphere ofthe Earth.These astronauts knew the dangers, and they faced them willingly, knowing they had a highand noble purpose in life.Because of their courage and daring and idealism, we will miss them all the more.All Americans today are thinking as well of the families of these men and women who havebeen given this sudden shock and grief.You're not alone.Our entire nation agrees with you, and those you loved will always have the respect andgratitude of this country.The cause in which they died will continue.Mankind has led into the darkness beyond our world by the inspiration of discovery andthe longing to understand.Our journey into space will go on.In the skies today, we saw destruction and tragedy.As farther than we can see, there is comfort and hope.In the words of the prophet Isaiah, \"Lift your eyes and look to the heavens who createdall these, he who brings out the starry hosts one by one and calls them each by name.\"Because of his great power and mighty strength, not one of them is missing.The same creator who names the stars also knows the names of the seven souls we mourntoday.The crew of the shuttle Columbia did not return safely to Earth yet we can pray that all aresafely home.May God bless the grieving families and may God continue to bless America.[BLANK_AUDIO]"}

🔥 OpenAI functions and tools

LocalAI supports running OpenAI functions and tools API with llama.cpp compatible models.

To learn more about OpenAI functions, see also the OpenAI API blog post.

LocalAI is also supporting JSON mode out of the box with llama.cpp-compatible models.

💡 Check out also LocalAGI for an example on how to use LocalAI functions.

Setup

OpenAI functions are available only with ggml or gguf models compatible with llama.cpp.

You don’t need to do anything specific - just use ggml or gguf models.

Usage example

You can configure a model manually with a YAML config file in the models directory, for example:

name: gpt-3.5-turbo
parameters:
  # Model file name
  model: ggml-openllama.bin
  top_p: 80
  top_k: 0.9
  temperature: 0.1

To use the functions with the OpenAI client in python:

from openai import OpenAI

messages = [{"role": "user", "content": "What is the weather like in Beijing now?"}]
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Return the temperature of the specified region specified by the user",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "User specified region",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "temperature unit"
                    },
                },
                "required": ["location"],
            },
        },
    }
]

client = OpenAI(
    # This is the default and can be omitted
    api_key="test",
    base_url="http://localhost:8080/v1/"
)

response =client.chat.completions.create(
    messages=messages,
    tools=tools,
    tool_choice ="auto",
    model="gpt-4",
)
#...

For example, with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "What is the weather like in Beijing now?"}],
  "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "description": "Return the temperature of the specified region specified by the user",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "User specified region"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "temperature unit"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ],
    "tool_choice":"auto"
}'

Return data：

{
    "created": 1724210813,
    "object": "chat.completion",
    "id": "16b57014-477c-4e6b-8d25-aad028a5625e",
    "model": "gpt-4",
    "choices": [
        {
            "index": 0,
            "finish_reason": "tool_calls",
            "message": {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "index": 0,
                        "id": "16b57014-477c-4e6b-8d25-aad028a5625e",
                        "type": "function",
                        "function": {
                            "name": "get_current_weather",
                            "arguments": "{\"location\":\"Beijing\",\"unit\":\"celsius\"}"
                        }
                    }
                ]
            }
        }
    ],
    "usage": {
        "prompt_tokens": 221,
        "completion_tokens": 26,
        "total_tokens": 247
    }
}

Advanced

Use functions without grammars

The functions calls maps automatically to grammars which are currently supported only by llama.cpp, however, it is possible to turn off the use of grammars, and extract tool arguments from the LLM responses, by specifying in the YAML file no_grammar and a regex to map the response from the LLM:

name: model_name
parameters:
  # Model file name
  model: model/name

function:
  # set to true to not use grammars
  no_grammar: true
  # set one or more regexes used to extract the function tool arguments from the LLM response
  response_regex:
  - "(?P<function>\w+)\s*\((?P<arguments>.*)\)"

The response regex have to be a regex with named parameters to allow to scan the function name and the arguments. For instance, consider:

(?P<function>\w+)\s*\((?P<arguments>.*)\)

will catch

function_name({ "foo": "bar"})

Parallel tools calls

This feature is experimental and has to be configured in the YAML of the model by enabling function.parallel_calls:

name: gpt-3.5-turbo
parameters:
  # Model file name
  model: ggml-openllama.bin
  top_p: 80
  top_k: 0.9
  temperature: 0.1

function:
  # set to true to allow the model to call multiple functions in parallel
  parallel_calls: true

Use functions with grammar

It is possible to also specify the full function signature (for debugging, or to use with other clients).

The chat endpoint accepts the grammar_json_functions additional parameter which takes a JSON schema object.

For example, with curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "gpt-4",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.1,
     "grammar_json_functions": {
        "oneOf": [
            {
                "type": "object",
                "properties": {
                    "function": {"const": "create_event"},
                    "arguments": {
                        "type": "object",
                        "properties": {
                            "title": {"type": "string"},
                            "date": {"type": "string"},
                            "time": {"type": "string"}
                        }
                    }
                }
            },
            {
                "type": "object",
                "properties": {
                    "function": {"const": "search"},
                    "arguments": {
                        "type": "object",
                        "properties": {
                            "query": {"type": "string"}
                        }
                    }
                }
            }
        ]
    }
   }'

Grammars and function tools can be used as well in conjunction with vision APIs:

 curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llava", "grammar": "root ::= (\"yes\" | \"no\")",
     "messages": [{"role": "user", "content": [{"type":"text", "text": "Is there some grass in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

💡 Examples

A full e2e example with docker-compose is available here.

💾 Stores

Stores are an experimental feature to help with querying data using similarity search. It is a low level API that consists of only get, set, delete and find.

For example if you have an embedding of some text and want to find text with similar embeddings. You can create embeddings for chunks of all your text then compare them against the embedding of the text you are searching on.

An embedding here meaning a vector of numbers that represent some information about the text. The embeddings are created from an A.I. model such as BERT or a more traditional method such as word frequency.

Previously you would have to integrate with an external vector database or library directly. With the stores feature you can now do it through the LocalAI API.

Note however that doing a similarity search on embeddings is just one way to do retrieval. A higher level API can take this into account, so this may not be the best place to start.

API overview

There is an internal gRPC API and an external facing HTTP JSON API. We’ll just discuss the external HTTP API, however the HTTP API mirrors the gRPC API. Consult pkg/store/client for internal usage.

Everything is in columnar format meaning that instead of getting an array of objects with a key and a value each. You instead get two separate arrays of keys and values.

Keys are arrays of floating point numbers with a maximum width of 32bits. Values are strings (in gRPC they are bytes).

The key vectors must all be the same length and it’s best for search performance if they are normalized. When addings keys it will be detected if they are not normalized and what length they are.

All endpoints accept a store field which specifies which store to operate on. Presently they are created on the fly and there is only one store backend so no configuration is required.

Set

To set some keys you can do

curl -X POST http://localhost:8080/stores/set \
     -H "Content-Type: application/json" \
     -d '{"keys": [[0.1, 0.2], [0.3, 0.4]], "values": ["foo", "bar"]}'

Setting the same keys again will update their values.

On success 200 OK is returned with no body.

Get

To get some keys you can do

curl -X POST http://localhost:8080/stores/get \
     -H "Content-Type: application/json" \
     -d '{"keys": [[0.1, 0.2]]}'

Both the keys and values are returned, e.g: {"keys":[[0.1,0.2]],"values":["foo"]}

The order of the keys is not preserved! If a key does not exist then nothing is returned.

Delete

To delete keys and values you can do

curl -X POST http://localhost:8080/stores/delete \
     -H "Content-Type: application/json" \
     -d '{"keys": [[0.1, 0.2]]}'

If a key doesn’t exist then it is ignored.

On success 200 OK is returned with no body.

Find

To do a similarity search you can do

curl -X POST http://localhost:8080/stores/find 
     -H "Content-Type: application/json" \
     -d '{"topk": 2, "key": [0.2, 0.1]}'

topk limits the number of results returned. The result value is the same as get, except that it also includes an array of similarities. Where 1.0 is the maximum similarity. They are returned in the order of most similar to least.

🖼️ Model gallery

The model gallery is a curated collection of models configurations for LocalAI that enables one-click install of models directly from the LocalAI Web interface.

A list of the models available can also be browsed at the Public LocalAI Gallery.

LocalAI to ease out installations of models provide a way to preload models on start and downloading and installing them in runtime. You can install models manually by copying them over the models directory, or use the API or the Web interface to configure, download and verify the model assets for you.

Note

The models in this gallery are not directly maintained by LocalAI. If you find a model that is not working, please open an issue on the model gallery repository.

Note

GPT and text generation models might have a license which is not permissive for commercial use or might be questionable or without any license at all. Please check the model license before using it. The official gallery contains only open licensed models.

Useful Links and resources

Open LLM Leaderboard - here you can find a list of the most performing models on the Open LLM benchmark. Keep in mind models compatible with LocalAI must be quantized in the gguf format.

How it works

Navigate the WebUI interface in the “Models” section from the navbar at the top. Here you can find a list of models that can be installed, and you can install them by clicking the “Install” button.

Add other galleries

You can add other galleries by setting the GALLERIES environment variable. The GALLERIES environment variable is a list of JSON objects, where each object has a name and a url field. The name field is the name of the gallery, and the url field is the URL of the gallery’s index file, for example:

GALLERIES=[{"name":"<GALLERY_NAME>", "url":"<GALLERY_URL"}]

The models in the gallery will be automatically indexed and available for installation.

API Reference

Model repositories

You can install a model in runtime, while the API is running and it is started already, or before starting the API by preloading the models.

To install a model in runtime you will need to use the /models/apply LocalAI API endpoint.

By default LocalAI is configured with the localai repository.

To use additional repositories you need to start local-ai with the GALLERIES environment variable:

GALLERIES=[{"name":"<GALLERY_NAME>", "url":"<GALLERY_URL"}]

For example, to enable the default localai repository, you can start local-ai with:

GALLERIES=[{"name":"localai", "url":"github:mudler/localai/gallery/index.yaml"}]

where github:mudler/localai/gallery/index.yaml will be expanded automatically to https://raw.githubusercontent.com/mudler/LocalAI/main/index.yaml.

Note: the url are expanded automatically for github and huggingface, however https:// and http:// prefix works as well.

Note

If you want to build your own gallery, there is no documentation yet. However you can find the source of the default gallery in the LocalAI repository.

List Models

To list all the available models, use the /models/available endpoint:

curl http://localhost:8080/models/available

To search for a model, you can use jq:

curl http://localhost:8080/models/available | jq '.[] | select(.name | contains("replit"))'

curl http://localhost:8080/models/available | jq '.[] | .name | select(contains("localmodels"))'

curl http://localhost:8080/models/available | jq '.[] | .urls | select(. != null) | add | select(contains("orca"))'

How to install a model from the repositories

Models can be installed by passing the full URL of the YAML config file, or either an identifier of the model in the gallery. The gallery is a repository of models that can be installed by passing the model name.

To install a model from the gallery repository, you can pass the model name in the id field. For instance, to install the bert-embeddings model, you can use the following command:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "localai@bert-embeddings"
   }'

where:

localai is the repository. It is optional and can be omitted. If the repository is omitted LocalAI will search the model by name in all the repositories. In the case the same model name is present in both galleries the first match wins.
bert-embeddings is the model name in the gallery (read its config here).

How to install a model not part of a gallery

If you don’t want to set any gallery repository, you can still install models by loading a model configuration file.

In the body of the request you must specify the model configuration file URL (url), optionally a name to install the model (name), extra files to install (files), and configuration overrides (overrides). When calling the API endpoint, LocalAI will download the models files and write the configuration to the folder used to store models.

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "config_url": "<MODEL_CONFIG_FILE_URL>"
   }' 
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "<GALLERY>@<MODEL_NAME>"
   }' 
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE_URL>"
   }'

An example that installs hermes-2-pro-mistral can be:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "config_url": "https://raw.githubusercontent.com/mudler/LocalAI/v2.25.0/embedded/models/hermes-2-pro-mistral.yaml"
   }'

The API will return a job uuid that you can use to track the job progress:

{"uuid":"1059474d-f4f9-11ed-8d99-c4cbe106d571","status":"http://localhost:8080/models/jobs/1059474d-f4f9-11ed-8d99-c4cbe106d571"}

For instance, a small example bash script that waits a job to complete can be (requires jq):

response=$(curl -s http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{"url": "$model_url"}')

job_id=$(echo "$response" | jq -r '.uuid')

while [ "$(curl -s http://localhost:8080/models/jobs/"$job_id" | jq -r '.processed')" != "true" ]; do 
  sleep 1
done

echo "Job completed"

To preload models on start instead you can use the PRELOAD_MODELS environment variable.

To preload models on start, use the PRELOAD_MODELS environment variable by setting it to a JSON array of model uri:

PRELOAD_MODELS='[{"url": "<MODEL_URL>"}]'

Note: url or id must be specified. url is used to a url to a model gallery configuration, while an id is used to refer to models inside repositories. If both are specified, the id will be used.

For example:

PRELOAD_MODELS=[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]

or as arg:

local-ai --preload-models '[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:mudler/LocalAI/gallery/stablediffusion.yaml@master

Note

You can find already some open licensed models in the LocalAI gallery.

If you don’t find the model in the gallery you can try to use the “base” model and provide an URL to LocalAI:

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "github:mudler/LocalAI/gallery/base.yaml@master",
     "name": "model-name",
     "files": [
        {
            "uri": "<URL>",
            "sha256": "<SHA>",
            "filename": "model"
        }
     ]
   }'

Override a model name

To install a model with a different name, specify a name parameter in the request body.

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE>",
     "name": "<MODEL_NAME>"
   }'

For example, to install a model as gpt-3.5-turbo:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
      "url": "github:mudler/LocalAI/gallery/gpt4all-j.yaml",
      "name": "gpt-3.5-turbo"
   }'

Additional Files

To download additional files with the model, use the files parameter:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE>",
     "name": "<MODEL_NAME>",
     "files": [
        {
            "uri": "<additional_file_url>",
            "sha256": "<additional_file_hash>",
            "filename": "<additional_file_name>"
        }
     ]
   }'

Overriding configuration files

To override portions of the configuration file, such as the backend or the model file, use the overrides parameter:

LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_CONFIG_FILE>",
     "name": "<MODEL_NAME>",
     "overrides": {
        "backend": "llama",
        "f16": true,
        ...
     }
   }'

Examples

Embeddings: Bert

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
     "id": "bert-embeddings",
     "name": "text-embedding-ada-002"
   }'

To test it:

LOCALAI=http://localhost:8080
curl $LOCALAI/v1/embeddings -H "Content-Type: application/json" -d '{
    "input": "Test",
    "model": "text-embedding-ada-002"
  }'

Image generation: Stable diffusion

URL: https://github.com/EdVince/Stable-Diffusion-NCNN

While the API is running, you can install the model by using the /models/apply endpoint and point it to the stablediffusion model in the models-gallery:

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{         
     "url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"
   }'

You can set the PRELOAD_MODELS environment variable:

PRELOAD_MODELS=[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]

or as arg:

local-ai --preload-models '[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:mudler/LocalAI/gallery/stablediffusion.yaml@master

Test it:

curl $LOCALAI/v1/images/generations -H "Content-Type: application/json" -d '{
            "prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
            "mode": 2,  "seed":9000,
            "size": "256x256", "n":2
}'

Audio transcription: Whisper

URL: https://github.com/ggerganov/whisper.cpp

curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{         
     "url": "github:mudler/LocalAI/gallery/whisper-base.yaml@master",
     "name": "whisper-1"
   }'

You can set the PRELOAD_MODELS environment variable:

PRELOAD_MODELS=[{"url": "github:mudler/LocalAI/gallery/whisper-base.yaml@master", "name": "whisper-1"}]

or as arg:

local-ai --preload-models '[{"url": "github:mudler/LocalAI/gallery/whisper-base.yaml@master", "name": "whisper-1"}]'

or in a YAML file:

local-ai --preload-models-config "/path/to/yaml"

YAML:

- url: github:mudler/LocalAI/gallery/whisper-base.yaml@master
  name: whisper-1

Note

LocalAI will create a batch process that downloads the required files from a model definition and automatically reload itself to include the new model.

Input: url or id (required), name (optional), files (optional)

curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
     "url": "<MODEL_DEFINITION_URL>",
     "id": "<GALLERY>@<MODEL_NAME>",
     "name": "<INSTALLED_MODEL_NAME>",
     "files": [
        {
            "uri": "<additional_file>",
            "sha256": "<additional_file_hash>",
            "filename": "<additional_file_name>"
        },
      "overrides": { "backend": "...", "f16": true }
     ]
   }

An optional, list of additional files can be specified to be downloaded within files. The name allows to override the model name. Finally it is possible to override the model config file with override.

The url is a full URL, or a github url (github:org/repo/file.yaml), or a local file (file:///path/to/file.yaml). The id is a string in the form <GALLERY>@<MODEL_NAME>, where <GALLERY> is the name of the gallery, and <MODEL_NAME> is the name of the model in the gallery. Galleries can be specified during startup with the GALLERIES environment variable.

Returns an uuid and an url to follow up the state of the process:

{ "uuid":"251475c9-f666-11ed-95e0-9a8a4480ac58", "status":"http://localhost:8080/models/jobs/251475c9-f666-11ed-95e0-9a8a4480ac58"}

To see a collection example of curated models definition files, see the LocalAI repository.

Get model job state `/models/jobs/<uid>`

This endpoint returns the state of the batch job associated to a model installation.

curl http://localhost:8080/models/jobs/<JOB_ID>

Returns a json containing the error, and if the job is being processed:

{"error":null,"processed":true,"message":"completed"}

🔗 Model Context Protocol (MCP)

LocalAI now supports the Model Context Protocol (MCP), enabling powerful agentic capabilities by connecting AI models to external tools and services. This feature allows your LocalAI models to interact with various MCP servers, providing access to real-time data, APIs, and specialized tools.

What is MCP?

The Model Context Protocol is a standard for connecting AI models to external tools and data sources. It enables AI agents to:

Access real-time information from external APIs
Execute commands and interact with external systems
Use specialized tools for specific tasks
Maintain context across multiple tool interactions

Key Features

🔄 Real-time Tool Access: Connect to external MCP servers for live data
🛠️ Multiple Server Support: Configure both remote HTTP and local stdio servers
⚡ Cached Connections: Efficient tool caching for better performance
🔒 Secure Authentication: Support for bearer token authentication
🎯 OpenAI Compatible: Uses the familiar /mcp/v1/chat/completions endpoint
🧠 Advanced Reasoning: Configurable reasoning and re-evaluation capabilities
📋 Auto-Planning: Break down complex tasks into manageable steps
🎯 MCP Prompts: Specialized prompts for better MCP server interaction
🔄 Plan Re-evaluation: Dynamic plan adjustment based on results
⚙️ Flexible Agent Control: Customizable execution limits and retry behavior

Configuration

MCP support is configured in your model’s YAML configuration file using the mcp section:

name: my-agentic-model
backend: llama-cpp
parameters:
  model: qwen3-4b.gguf

mcp:
  remote: |
    {
      "mcpServers": {
        "weather-api": {
          "url": "https://api.weather.com/v1",
          "token": "your-api-token"
        },
        "search-engine": {
          "url": "https://search.example.com/mcp",
          "token": "your-search-token"
        }
      }
    }
  
  stdio: |
    {
      "mcpServers": {
        "file-manager": {
          "command": "python",
          "args": ["-m", "mcp_file_manager"],
          "env": {
            "API_KEY": "your-key"
          }
        },
        "database-tools": {
          "command": "node",
          "args": ["database-mcp-server.js"],
          "env": {
            "DB_URL": "postgresql://localhost/mydb"
          }
        }
      }
    }

agent:
  max_attempts: 3        # Maximum number of tool execution attempts
  max_iterations: 3     # Maximum number of reasoning iterations
  enable_reasoning: true # Enable tool reasoning capabilities
  enable_planning: false # Enable auto-planning capabilities
  enable_mcp_prompts: false # Enable MCP prompts
  enable_plan_re_evaluator: false # Enable plan re-evaluation

Configuration Options

Remote Servers (`remote`)

Configure HTTP-based MCP servers:

url: The MCP server endpoint URL
token: Bearer token for authentication (optional)

STDIO Servers (`stdio`)

Configure local command-based MCP servers:

command: The executable command to run
args: Array of command-line arguments
env: Environment variables (optional)

Agent Configuration (`agent`)

Configure agent behavior and tool execution:

max_attempts: Maximum number of tool execution attempts (default: 3)
max_iterations: Maximum number of reasoning iterations (default: 3)
enable_reasoning: Enable tool reasoning capabilities (default: false)
enable_planning: Enable auto-planning capabilities (default: false)
enable_mcp_prompts: Enable MCP prompts (default: false)
enable_plan_re_evaluator: Enable plan re-evaluation (default: false)

Usage

API Endpoint

Use the MCP-enabled completion endpoint:

curl http://localhost:8080/mcp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-agentic-model",
    "messages": [
      {"role": "user", "content": "What is the current weather in New York?"}
    ],
    "temperature": 0.7
  }'

Example Response

{
  "id": "chatcmpl-123",
  "created": 1699123456,
  "model": "my-agentic-model",
  "choices": [
    {
      "text": "The current weather in New York is 72°F (22°C) with partly cloudy skies. The humidity is 65% and there's a light breeze from the west at 8 mph."
    }
  ],
  "object": "text_completion"
}

Example Configurations

Docker-based Tools

name: docker-agent
backend: llama-cpp
parameters:
  model: qwen3-4b.gguf

mcp:
  stdio: |
    {
      "mcpServers": {
        "searxng": {
          "command": "docker",
          "args": [
            "run", "-i", "--rm",
            "quay.io/mudler/tests:duckduckgo-localai"
          ]
        }
      }
    }

agent:
  max_attempts: 5
  max_iterations: 5
  enable_reasoning: true
  enable_planning: true
  enable_mcp_prompts: true
  enable_plan_re_evaluator: true

Agent Configuration Details

The agent section controls how the AI model interacts with MCP tools:

Execution Control

max_attempts: Limits how many times a tool can be retried if it fails. Higher values provide more resilience but may increase response time.
max_iterations: Controls the maximum number of reasoning cycles the agent can perform. More iterations allow for complex multi-step problem solving.

Reasoning Capabilities

enable_reasoning: When enabled, the agent uses advanced reasoning to better understand tool results and plan next steps.

Planning Capabilities

enable_planning: When enabled, the agent uses auto-planning to break down complex tasks into manageable steps and execute them systematically. The agent will automatically detect when planning is needed.
enable_mcp_prompts: When enabled, the agent uses specialized prompts exposed by the MCP servers to interact with the exposed tools.
enable_plan_re_evaluator: When enabled, the agent can re-evaluate and adjust its execution plan based on intermediate results.

Recommended Settings

Simple tasks: max_attempts: 2, max_iterations: 2, enable_reasoning: false, enable_planning: false
Complex tasks: max_attempts: 5, max_iterations: 5, enable_reasoning: true, enable_planning: true, enable_mcp_prompts: true
Advanced planning: max_attempts: 5, max_iterations: 5, enable_reasoning: true, enable_planning: true, enable_mcp_prompts: true, enable_plan_re_evaluator: true
Development/Debugging: max_attempts: 1, max_iterations: 1, enable_reasoning: true, enable_planning: true

How It Works

Tool Discovery: LocalAI connects to configured MCP servers and discovers available tools
Tool Caching: Tools are cached per model for efficient reuse
Agent Execution: The AI model uses the Cogito framework to execute tools
Response Generation: The model generates responses incorporating tool results

Supported MCP Servers

LocalAI is compatible with any MCP-compliant server.

Best Practices

Security

Use environment variables for sensitive tokens
Validate MCP server endpoints before deployment
Implement proper authentication for remote servers

Performance

Cache frequently used tools
Use appropriate timeout values for external APIs
Monitor resource usage for stdio servers

Error Handling

Implement fallback mechanisms for tool failures
Log tool execution for debugging
Handle network timeouts gracefully

With External Applications

Use MCP-enabled models in your applications:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/mcp/v1",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="my-agentic-model",
    messages=[
        {"role": "user", "content": "Analyze the latest research papers on AI"}
    ]
)

MCP and adding packages

It might be handy to install packages before starting the container to setup the environment. This is an example on how you can do that with docker-compose (installing and configuring docker)

services:
  local-ai:
    image: localai/localai:latest
    #image: localai/localai:latest-gpu-nvidia-cuda-12
    container_name: local-ai
    restart: always
    entrypoint: [ "/bin/bash" ]
    command: >
     -c "apt-get update &&
         apt-get install -y docker.io &&
         /entrypoint.sh"
    environment:
      - DEBUG=true
      - LOCALAI_WATCHDOG_IDLE=true
      - LOCALAI_WATCHDOG_BUSY=true
      - LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m
      - LOCALAI_WATCHDOG_BUSY_TIMEOUT=15m
      - LOCALAI_API_KEY=my-beautiful-api-key
      - DOCKER_HOST=tcp://docker:2376
      - DOCKER_TLS_VERIFY=1
      - DOCKER_CERT_PATH=/certs/client
    ports:
      - "8080:8080"
    volumes:
      - /data/models:/models
      - /data/backends:/backends
      - certs:/certs:ro
    # uncomment for nvidia
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - capabilities: [gpu]
    #           device_ids: ['7']
    # runtime: nvidia

  docker:
    image: docker:dind
    privileged: true
    container_name: docker
    volumes:
      - certs:/certs
    healthcheck:
      test: ["CMD", "docker", "info"]
      interval: 10s
      timeout: 5s
volumes:
  certs:

An example model config (to append to any existing model you have) can be:

mcp:
  stdio: |
     {
      "mcpServers": {
        "weather": {
          "command": "docker",
          "args": [
            "run", "-i", "--rm",
            "ghcr.io/mudler/mcps/weather:master"
          ]
        },
        "memory": {
          "command": "docker",
          "env": {
            "MEMORY_FILE_PATH": "/data/memory.json"
          },
          "args": [
            "run", "-i", "--rm", "-v", "/host/data:/data",
            "ghcr.io/mudler/mcps/memory:master"
          ]
        },
        "ddg": {
          "command": "docker",
          "env": {
            "MAX_RESULTS": "10"
          },
          "args": [
            "run", "-i", "--rm", "-e", "MAX_RESULTS",
            "ghcr.io/mudler/mcps/duckduckgo:master"
          ]
        }
      }
     }

Features

Core Features

Advanced Features

Specialized Features

Getting Started

Subsections of Features

⚙️ Backends

Managing Backends in the UI

Backend Galleries

Adding a Backend Gallery

Backend Gallery Structure

Pre-installing Backends

Creating a Backend

Backend Container Requirements

Getting started

Publishing Your Backend

Backend Types

⚡ GPU acceleration

Automatic Backend Detection

Model configuration

CUDA(NVIDIA) acceleration

Requirements

ROCM(AMD) acceleration

Requirements

Recommendations

Limitations

Verified

System Prep

Setup Example (Docker/containerd)

Example (k8s) (Advanced Deployment/WIP)

Notes

Intel acceleration (sycl)

Requirements

Container images

Example

Notes

Vulkan acceleration

Requirements

Container images

Example

Notes

📖 Text generation (GPT)

API Reference

Chat completions

Edit completions

Completions

List models

Backends

RWKV

llama.cpp

Features

Setup

Manual setup

Automatic setup

YAML configuration

Backend Options

Reference

exllama/2

Model setup

vLLM

Setup

Usage

Transformers

Setup

Parameters

Type

Embeddings

Inference device selection

Inference Precision

Quantization

Trust Remote Code

Maximum Context Size

Auto Prompt Template

Custom Stop Words

Usage

Examples

OpenVINO

📈 Reranker

Usage

🗣 Text to audio (TTS)

🖼️ Flux kontext with `stable-diffusion.cpp`