LocalAI provides a comprehensive set of features for running AI models locally. This section covers all the capabilities and functionalities available in LocalAI.
Core Features
Text Generation - Generate text with GPT-compatible models using various backends
Image Generation - Create images with Stable Diffusion and other diffusion models
Audio Processing - Transcribe audio to text and generate speech from text
Embeddings - Generate vector embeddings for semantic search and RAG applications
GPT Vision - Analyze and understand images with vision-language models
Advanced Features
OpenAI Functions - Use function calling and tools API with local models
Model Gallery - Browse and install pre-configured models
Backends - Learn about available backends and how to manage them
Getting Started
To start using these features, make sure you have LocalAI installed and have downloaded some models. Then explore the feature pages above to learn how to use each capability.
Subsections of Features
⚙️ Backends
LocalAI supports a variety of backends that can be used to run different types of AI models. There are core Backends which are included, and there are containerized applications that provide the runtime environment for specific model types, such as LLMs, diffusion models, or text-to-speech models.
Managing Backends in the UI
The LocalAI web interface provides an intuitive way to manage your backends:
Navigate to the “Backends” section in the navigation menu
Browse available backends from configured galleries
Use the search bar to find specific backends by name, description, or type
Filter backends by type using the quick filter buttons (LLM, Diffusion, TTS, Whisper)
Install or delete backends with a single click
Monitor installation progress in real-time
Each backend card displays:
Backend name and description
Type of models it supports
Installation status
Action buttons (Install/Delete)
Additional information via the info button
Backend Galleries
Backend galleries are repositories that contain backend definitions. They work similarly to model galleries but are specifically for backends.
Adding a Backend Gallery
You can add backend galleries by specifying the Environment VariableLOCALAI_BACKEND_GALLERIES:
This section contains instruction on how to use LocalAI with GPU acceleration.
Details
For acceleration for AMD or Metal HW is still in development, for additional details see the build
Automatic Backend Detection
When you install a model from the gallery (or a YAML file), LocalAI intelligently detects the required backend and your system’s capabilities, then downloads the correct version for you. Whether you’re running on a standard CPU, an NVIDIA GPU, an AMD GPU, or an Intel GPU, LocalAI handles it automatically.
For advanced use cases or to override auto-detection, you can use the LOCALAI_FORCE_META_BACKEND_CAPABILITY environment variable. Here are the available options:
default: Forces CPU-only backend. This is the fallback if no specific hardware is detected.
nvidia: Forces backends compiled with CUDA support for NVIDIA GPUs.
amd: Forces backends compiled with ROCm support for AMD GPUs.
intel: Forces backends compiled with SYCL/oneAPI support for Intel GPUs.
Model configuration
Depending on the model architecture and backend used, there might be different ways to enable GPU acceleration. It is required to configure the model you intend to use with a YAML config file. For example, for llama.cpp workloads a configuration file might look like this (where gpu_layers is the number of layers to offload to the GPU):
name: my-model-nameparameters:
# Relative to the models pathmodel: llama.cpp-model.ggmlv3.q5_K_M.bincontext_size: 1024threads: 1f16: true# enable with GPU accelerationgpu_layers: 22# GPU Layers (only used when built with cublas)
For diffusers instead, it might look like this instead:
CUDA 11 tags: master-gpu-nvidia-cuda-11, v1.40.0-gpu-nvidia-cuda-11, …
CUDA 12 tags: master-gpu-nvidia-cuda-12, v1.40.0-gpu-nvidia-cuda-12, …
In addition to the commands to run LocalAI normally, you need to specify --gpus all to docker, for example:
docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v1.40.0-gpu-nvidia-cuda12
If the GPU inferencing is working, you should be able to see something like:
5:22PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla T4
llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 1024
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 4321.77 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1598 MB
...................................................................................................
llama_init_from_file: kv self size = 512.00 MB
ROCM(AMD) acceleration
There are a limited number of tested configurations for ROCm systems however most newer deditated GPU consumer grade devices seem to be supported under the current ROCm6 implementation.
Due to the nature of ROCm it is best to run all implementations in containers as this limits the number of packages required for installation on host system, compatibility and package versions for dependencies across all variations of OS must be tested independently if desired, please refer to the build documentation.
Installed to host: amdgpu-dkms and rocm >=6.0.0 as per ROCm documentation.
Recommendations
Make sure to do not use GPU assigned for compute for desktop rendering.
Ensure at least 100GB of free space on disk hosting container runtime and storing images prior to installation.
Limitations
Ongoing verification testing of ROCm compatibility with integrated backends.
Please note the following list of verified backends and devices.
LocalAI hipblas images are built against the following targets: gfx900,gfx906,gfx908,gfx940,gfx941,gfx942,gfx90a,gfx1030,gfx1031,gfx1100,gfx1101
If your device is not one of these you must specify the corresponding GPU_TARGETS and specify REBUILD=true. Otherwise you don’t need to specify these in the commands below.
Verified
The devices in the following list have been tested with hipblas images running ROCm 6.0.0
Backend
Verified
Devices
llama.cpp
yes
Radeon VII (gfx906)
diffusers
yes
Radeon VII (gfx906)
piper
yes
Radeon VII (gfx906)
whisper
no
none
bark
no
none
coqui
no
none
transformers
no
none
exllama
no
none
exllama2
no
none
mamba
no
none
sentencetransformers
no
none
transformers-musicgen
no
none
vall-e-x
no
none
vllm
no
none
You can help by expanding this list.
System Prep
Check your GPU LLVM target is compatible with the version of ROCm. This can be found in the LLVM Docs.
Check which ROCm version is compatible with your LLVM target and your chosen OS (pay special attention to supported kernel versions). See the following for compatibility for (ROCm 6.0.0) or (ROCm 6.0.2)
Install you chosen version of the dkms and rocm (it is recommended that the native package manager be used for this process for any OS as version changes are executed more easily via this method if updates are required). Take care to restart after installing amdgpu-dkms and before installing rocm, for details regarding this see the installation documentation for your chosen OS (6.0.2 or 6.0.0)
Deploy. Yes it’s that easy.
Setup Example (Docker/containerd)
The following are examples of the ROCm specific configuration elements required.
# For full functionality select a non-'core' image, version locking the image is recommended for debug purposes.image: quay.io/go-skynet/local-ai:master-aio-gpu-hipblasenvironment:
- DEBUG=true# If your gpu is not already included in the current list of default targets the following build details are required. - REBUILD=true - BUILD_TYPE=hipblas - GPU_TARGETS=gfx906# Example for Radeon VIIdevices:
# AMD GPU only require the following devices be passed through to the container for offloading to occur. - /dev/dri - /dev/kfd
The same can also be executed as a run for your container runtime
Please ensure to add all other required environment variables, port forwardings, etc to your compose file or run command.
The rebuild process will take some time to complete when deploying these containers and it is recommended that you pull the image prior to deployment as depending on the version these images may be ~20GB in size.
Example (k8s) (Advanced Deployment/WIP)
For k8s deployments there is an additional step required before deployment, this is the deployment of the ROCm/k8s-device-plugin.
For any k8s environment the documentation provided by AMD from the ROCm project should be successful. It is recommended that if you use rke2 or OpenShift that you deploy the SUSE or RedHat provided version of this resource to ensure compatibility.
After this has been completed the helm chart from go-skynet can be configured and deployed mostly un-edited.
The following are details of the changes that should be made to ensure proper function.
While these details may be configurable in the values.yaml development of this Helm chart is ongoing and is subject to change.
The following details indicate the final state of the localai deployment relevant to GPU function.
apiVersion: apps/v1kind: Deploymentmetadata:
name: {NAME}-local-ai...
spec:
...template:
...spec:
containers:
- env:
- name: HIP_VISIBLE_DEVICESvalue: '0'# This variable indicates the devices available to container (0:device1 1:device2 2:device3) etc.# For multiple devices (say device 1 and 3) the value would be equivalent to HIP_VISIBLE_DEVICES="0,2"# Please take note of this when an iGPU is present in host system as compatibility is not assured....resources:
limits:
amd.com/gpu: '1'requests:
amd.com/gpu: '1'
This configuration has been tested on a ‘custom’ cluster managed by SUSE Rancher that was deployed on top of Ubuntu 22.04.4, certification of other configuration is ongoing and compatibility is not guaranteed.
Notes
When installing the ROCM kernel driver on your system ensure that you are installing an equal or newer version that that which is currently implemented in LocalAI (6.0.0 at time of writing).
AMD documentation indicates that this will ensure functionality however your mileage may vary depending on the GPU and distro you are using.
If you encounter an Error 413 on attempting to upload an audio file or image for whisper or llava/bakllava on a k8s deployment, note that the ingress for your deployment may require the annotation nginx.ingress.kubernetes.io/proxy-body-size: "25m" to allow larger uploads. This may be included in future versions of the helm chart.
Intel acceleration (sycl)
Requirements
If building from source, you need to install Intel oneAPI Base Toolkit and have the Intel drivers available in the system.
Container images
To use SYCL, use the images with gpu-intel in the tag, for example v3.7.0-gpu-intel, …
LocalAI supports generating text with GPT with llama.cpp and other backends (such as rwkv.cpp as ) see also the Model compatibility for an up-to-date list of the supported model families.
Note:
You can also specify the model name as part of the OpenAI token.
If only one model is available, the API will use it for all the requests.
To generate a completion, you can send a POST request to the /v1/completions endpoint with the instruction as per the request body:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
Available additional parameters: top_p, top_k, max_tokens
List models
You can list all the models available with:
curl http://localhost:8080/v1/models
Backends
RWKV
RWKV support is available through llama.cpp (see below)
llama.cpp
llama.cpp is a popular port of Facebook’s LLaMA model in C/C++.
Note
The ggml file format has been deprecated. If you are using ggml models and you are configuring your model with a YAML file, specify, use a LocalAI version older than v2.25.0. For gguf models, use the llama backend. The go backend is deprecated as well but still available as go-llama.
Features
The llama.cpp model supports the following features:
Prompt templates are useful for models that are fine-tuned towards a specific prompt.
Automatic setup
LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for ggml or gguf models.
For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.1
}'
LocalAI will automatically download and configure the model in the model directory.
Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the model gallery documentation.
YAML configuration
To use the llama.cpp backend, specify llama-cpp as the backend in the YAML file:
name: llamabackend: llama-cppparameters:
# Relative to the models pathmodel: file.gguf
Backend Options
The llama.cpp backend supports additional configuration options that can be specified in the options field of your model YAML configuration. These options allow fine-tuning of the backend behavior:
Option
Type
Description
Example
use_jinja or jinja
boolean
Enable Jinja2 template processing for chat templates. When enabled, the backend uses Jinja2-based chat templates from the model for formatting messages.
use_jinja:true
context_shift
boolean
Enable context shifting, which allows the model to dynamically adjust context window usage.
context_shift:true
cache_ram
integer
Set the maximum RAM cache size in MiB for KV cache. Use -1 for unlimited (default).
cache_ram:2048
parallel or n_parallel
integer
Enable parallel request processing. When set to a value greater than 1, enables continuous batching for handling multiple requests concurrently.
parallel:4
grpc_servers or rpc_servers
string
Comma-separated list of gRPC server addresses for distributed inference. Allows distributing workload across multiple llama.cpp workers.
Note: The parallel option can also be set via the LLAMACPP_PARALLEL environment variable, and grpc_servers can be set via the LLAMACPP_GRPC_SERVERS environment variable. Options specified in the YAML file take precedence over environment variables.
Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”. Both exllama and exllama2 are supported.
Model setup
Download the model as a folder inside the model directory and create a YAML file specifying the exllama backend. For instance with the TheBloke/WizardLM-7B-uncensored-GPTQ model:
The backend will automatically download the required files in order to run the model.
Usage
Use the completions endpoint by specifying the vllm backend:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "vllm",
"prompt": "Hello, my name is",
"temperature": 0.1, "top_p": 0.1
}'
Transformers
Transformers is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.
LocalAI has a built-in integration with Transformers, and it can be used to run models.
This is an extra backend - in the container images (the extra images already contains python dependencies for Transformers) is already available and there is nothing to do for the setup.
Setup
Create a YAML file for the model you want to use with transformers.
To setup a model, you need to just specify the model name in the YAML config file:
The backend will automatically download the required files in order to run the model.
Parameters
Type
Type
Description
AutoModelForCausalLM
AutoModelForCausalLM is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration
OVModelForCausalLM
for Intel CPU/GPU/NPU OpenVINO Text Generation models
OVModelForFeatureExtraction
for Intel CPU/GPU/NPU OpenVINO Embedding acceleration
N/A
Defaults to AutoModel
OVModelForCausalLM requires OpenVINO IR Text Generation models from Hugging face
OVModelForFeatureExtraction works with any Safetensors Transformer Feature Extraction model from Huggingface (Embedding Model)
Please note that streaming is currently not implemente in AutoModelForCausalLM for Intel GPU.
AMD GPU support is not implemented.
Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.
Embeddings
Use embeddings: true if the model is an embedding model
Inference device selection
Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the main_gpu parameter.
Inference Engine
Applicable Values
CUDA
cuda, cuda.X where X is the GPU device like in nvidia-smi -L output
OpenVINO
Any applicable value from Inference Modes like AUTO,CPU,GPU,NPU,MULTI,HETERO
Example for CUDA:
main_gpu: cuda.0
Example for OpenVINO:
main_gpu: AUTO:-CPU
This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models.
Inference Precision
Transformer backend automatically select the fastest applicable inference precision according to the device support.
CUDA backend can manually enable bfloat16 if your hardware support it with the following parameter:
f16: true
Quantization
Quantization
Description
bnb_8bit
8-bit quantization
bnb_4bit
4-bit quantization
xpu_8bit
8-bit quantization for Intel XPUs
xpu_4bit
4-bit quantization for Intel XPUs
Trust Remote Code
Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library.
By default it is disabled for security.
It can be manually enabled with:
trust_remote_code: true
Maximum Context Size
Maximum context size in bytes can be specified with the parameter: context_size. Do not use values higher than what your model support.
Usage example:
context_size: 8192
Auto Prompt Template
Usually chat template is defined by the model author in the tokenizer_config.json file.
To enable it use the use_tokenizer_template: true parameter in the template section.
Usage example:
template:
use_tokenizer_template: true
Custom Stop Words
Stopwords are usually defined in tokenizer_config.json file.
They can be overridden with the stopwords parameter in case of need like in llama3-Instruct model.
Usage example:
stopwords:
- "<|eot_id|>"
- "<|end_of_text|>"
Usage
Use the completions endpoint by specifying the transformers model:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "transformers",
"prompt": "Hello, my name is",
"temperature": 0.1, "top_p": 0.1
}'
Examples
OpenVINO
A model configuration file for openvion and starling model:
A reranking model, often referred to as a cross-encoder, is a core component in the two-stage retrieval systems used in information retrieval and natural language processing tasks.
Given a query and a set of documents, it will output similarity scores.
We can use then the score to reorder the documents by relevance in our RAG system to increase its overall accuracy and filter out non-relevant results.
LocalAI supports reranker models, and you can use them by using the rerankers backend, which uses rerankers.
Usage
You can test rerankers by using container images with python (this does NOT work with core images) and a model config file like this, or by installing cross-encoder from the gallery in the UI:
aplay is a Linux command. You can use other tools to play the audio file.
The model name is the filename with the extension.
The model name is case sensitive.
LocalAI must be compiled with the GO_TAGS=tts flag.
Transformers-musicgen
LocalAI also has experimental support for transformers-musicgen for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:
Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.
Vall-E-X
VALL-E-X is an open source implementation of Microsoft’s VALL-E X zero-shot TTS model.
Setup
The backend will automatically download the required files in order to run the model.
This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.
Usage
Use the tts endpoint by specifying the vall-e-x backend:
In order to use voice cloning capabilities you must create a YAML configuration file to setup a model:
name: cloned-voicebackend: vall-e-xparameters:
model: "cloned-voice"tts:
vall-e:
# The path to the audio file to be cloned# relative to the models directory# Max 15saudio_path: "audio-sample.wav"
Then you can specify the model name in the requests:
You can also use a config-file to specify TTS models and their parameters.
In the following example we define a custom config to load the xtts_v2 model, and specify a voice and language.
name: xtts_v2backend: coquiparameters:
language: frmodel: tts_models/multilingual/multi-dataset/xtts_v2tts:
voice: Ana Florence
With this config, you can now use the following curl command to generate a text-to-speech audio file:
curl -L http://localhost:8080/tts \
-H "Content-Type: application/json"\
-d '{
"model": "xtts_v2",
"input": "Bonjour, je suis Ana Florence. Comment puis-je vous aider?"
}' | aplay
Response format
To provide some compatibility with OpenAI API regarding response_format, ffmpeg must be installed (or a docker image including ffmpeg used) to leverage converting the generated wav file before the api provide its response.
Warning regarding a change in behaviour. Before this addition, the parameter was ignored and a wav file was always returned, with potential codec errors later in the integration (like trying to decode a mp3 file from a wav, which is the default format used by OpenAI)
Supported format thanks to ffmpeg are wav, mp3, aac, flac, opus, defaulting to wav if an unknown or no format is provided.
Note: To set a negative prompt, you can split the prompt with |, for instance: a cute baby sea otter|malformed.
curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
"prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
"size": "256x256"
}'
Backends
stablediffusion-ggml
This backend is based on stable-diffusion.cpp. Every model supported by that backend is supported indeed with LocalAI.
Setup
There are already several models in the gallery that are available to install and get up and running with this backend, you can for example run flux by searching it in the Model gallery (flux.1-dev-ggml) or start LocalAI with run:
local-ai run flux.1-dev-ggml
To use a custom model, you can follow these steps:
Create a model file stablediffusion.yaml in the models folder:
Download the required assets to the models repository
Start LocalAI
Diffusers
Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. LocalAI has a diffusers backend which allows image generation using the diffusers library.
This is an extra backend - in the container is already available and there is nothing to do for the setup. Do not use core images (ending with -core). If you are building manually, see the build instructions.
Model setup
The models will be downloaded the first time you use the backend from huggingface automatically.
Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:
The following parameters are available in the configuration file:
Parameter
Description
Default
f16
Force the usage of float16 instead of float32
false
step
Number of steps to run the model for
30
cuda
Enable CUDA acceleration
false
enable_parameters
Parameters to enable for the model
negative_prompt,num_inference_steps,clip_skip
scheduler_type
Scheduler type
k_dpp_sde
cfg_scale
Configuration scale
8
clip_skip
Clip skip
None
pipeline_type
Pipeline type
AutoPipelineForText2Image
lora_adapters
A list of lora adapters (file names relative to model directory) to apply
None
lora_scales
A list of lora scales (floats) to apply
None
There are available several types of schedulers:
Scheduler
Description
ddim
DDIM
pndm
PNDM
heun
Heun
unipc
UniPC
euler
Euler
euler_a
Euler a
lms
LMS
k_lms
LMS Karras
dpm_2
DPM2
k_dpm_2
DPM2 Karras
dpm_2_a
DPM2 a
k_dpm_2_a
DPM2 a Karras
dpmpp_2m
DPM++ 2M
k_dpmpp_2m
DPM++ 2M Karras
dpmpp_sde
DPM++ SDE
k_dpmpp_sde
DPM++ SDE Karras
dpmpp_2m_sde
DPM++ 2M SDE
k_dpmpp_2m_sde
DPM++ 2M SDE Karras
Pipelines types available:
Pipeline type
Description
StableDiffusionPipeline
Stable diffusion pipeline
StableDiffusionImg2ImgPipeline
Stable diffusion image to image pipeline
StableDiffusionDepth2ImgPipeline
Stable diffusion depth to image pipeline
DiffusionPipeline
Diffusion pipeline
StableDiffusionXLPipeline
Stable diffusion XL pipeline
StableVideoDiffusionPipeline
Stable video diffusion pipeline
AutoPipelineForText2Image
Automatic detection pipeline for text to image
VideoDiffusionPipeline
Video diffusion pipeline
StableDiffusion3Pipeline
Stable diffusion 3 pipeline
FluxPipeline
Flux pipeline
FluxTransformer2DModel
Flux transformer 2D model
SanaPipeline
Sana pipeline
Advanced: Additional parameters
Additional arbitrarly parameters can be specified in the option field in key/value separated by ::
name: animagine-xloptions:
- "cfg_scale:6"
Note: There is no complete parameter list. Any parameter can be passed arbitrarly and is passed to the model directly as argument to the pipeline. Different pipelines/implementations support different parameters.
The example above, will result in the following python code when generating images:
pipe(
prompt="A cute baby sea otter", # Options passed via API size="256x256", # Options passed via API cfg_scale=6# Additional parameter passed via configuration file)
Usage
Text to Image
Use the image generation endpoint with the model name from the configuration file:
LocalAI supports object detection through various backends. This feature allows you to identify and locate objects within images with high accuracy and real-time performance. Currently, RF-DETR is available as an implementation.
Overview
Object detection in LocalAI is implemented through dedicated backends that can identify and locate objects within images. Each backend provides different capabilities and model architectures.
Key Features:
Real-time object detection
High accuracy detection with bounding boxes
Support for multiple hardware accelerators (CPU, NVIDIA GPU, Intel GPU, AMD GPU)
Structured detection results with confidence scores
Easy integration through the /v1/detection endpoint
Usage
Detection Endpoint
LocalAI provides a dedicated /v1/detection endpoint for object detection tasks. This endpoint is specifically designed for object detection and returns structured detection results with bounding boxes and confidence scores.
API Reference
To perform object detection, send a POST request to the /v1/detection endpoint:
x, y: Coordinates of the bounding box top-left corner
width, height: Dimensions of the bounding box
confidence: Detection confidence score (0.0 to 1.0)
class_name: The detected object class
Backends
RF-DETR Backend
The RF-DETR backend is implemented as a Python-based gRPC service that integrates seamlessly with LocalAI. It provides object detection capabilities using the RF-DETR model architecture and supports multiple hardware configurations:
CPU: Optimized for CPU inference
NVIDIA GPU: CUDA acceleration for NVIDIA GPUs
Intel GPU: Intel oneAPI optimization
AMD GPU: ROCm acceleration for AMD GPUs
NVIDIA Jetson: Optimized for ARM64 NVIDIA Jetson devices
Setup
Using the Model Gallery (Recommended)
The easiest way to get started is using the model gallery. The rfdetr-base model is available in the official LocalAI gallery:
# Install and run the rfdetr-base modellocal-ai run rfdetr-base
You can also install it through the web interface by navigating to the Models section and searching for “rfdetr-base”.
Manual Configuration
Create a model configuration file in your models directory:
Verify model compatibility with your backend version
Low Detection Accuracy
Ensure good image quality and lighting
Check if objects are clearly visible
Consider using a larger model for better accuracy
Slow Performance
Enable GPU acceleration if available
Use a smaller model for faster inference
Optimize image resolution
Debug Mode
Enable debug logging for troubleshooting:
local-ai run --debug rfdetr-base
Object Detection Category
LocalAI includes a dedicated object-detection category for models and backends that specialize in identifying and locating objects within images. This category currently includes:
Additional object detection models and backends will be added to this category in the future. You can filter models by the object-detection tag in the model gallery to find all available object detection models.
The sentencetransformers backend is an optional backend of LocalAI and uses Python. If you are running LocalAI from the containers you are good to go and should be already configured for use.
For local execution, you also have to specify the extra backend in the EXTERNAL_GRPC_BACKENDS environment variable.
The sentencetransformers backend does support only embeddings of text, and not of tokens. If you need to embed tokens you can use the bert backend or llama.cpp.
No models are required to be downloaded before using the sentencetransformers backend. The models will be downloaded automatically the first time the API is used.
Llama.cpp embeddings
Embeddings with llama.cpp are supported with the llama-cpp backend, it needs to be enabled with embeddings set to true.
The chat endpoint supports the grammar parameter, which allows users to specify a grammar in Backus-Naur Form (BNF). This feature enables the Large Language Model (LLM) to generate outputs adhering to a user-defined schema, such as JSON, YAML, or any other format that can be defined using BNF. For more details about BNF, see Backus-Naur Form on Wikipedia.
Note
Compatibility Notice: This feature is only supported by models that use the llama.cpp backend. For a complete list of compatible models, refer to the Model Compatibility page. For technical details, see the related pull requests: PR #1773 and PR #1887.
Setup
To use this feature, follow the installation and setup instructions on the LocalAI Functions page. Ensure that your local setup meets all the prerequisites specified for the llama.cpp backend.
💡 Usage Example
The following example demonstrates how to use the grammar parameter to constrain the model’s output to either “yes” or “no”. This can be particularly useful in scenarios where the response format needs to be strictly controlled.
In this example, the grammar parameter is set to a simple choice between “yes” and “no”, ensuring that the model’s response adheres strictly to one of these options regardless of the context.
Example: JSON Output Constraint
You can also use grammars to enforce JSON output format:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Generate a person object with name and age"}],
"grammar": "root ::= \"{\" \"\\\"name\\\":\" string \",\\\"age\\\":\" number \"}\"\nstring ::= \"\\\"\" [a-z]+ \"\\\"\"\nnumber ::= [0-9]+"
}'
This functionality enables LocalAI to distribute inference requests across multiple worker nodes, improving efficiency and performance. Nodes are automatically discovered and connect via p2p by using a shared token which makes sure the communication is secure and private between the nodes of the network.
LocalAI supports two modes of distributed inferencing via p2p:
Federated Mode: Requests are shared between the cluster and routed to a single worker node in the network based on the load balancer’s decision.
Worker Mode (aka “model sharding” or “splitting weights”): Requests are processed by all the workers which contributes to the final inference result (by sharing the model weights).
A list of global instances shared by the community is available at explorer.localai.io.
Usage
Starting LocalAI with --p2p generates a shared token for connecting multiple instances: and that’s all you need to create AI clusters, eliminating the need for intricate network setups.
Simply navigate to the “Swarm” section in the WebUI and follow the on-screen instructions.
For fully shared instances, initiate LocalAI with –p2p –federated and adhere to the Swarm section’s guidance. This feature, while still experimental, offers a tech preview quality experience.
Federated mode
Federated mode allows to launch multiple LocalAI instances and connect them together in a federated network. This mode is useful when you want to distribute the load of the inference across multiple nodes, but you want to have a single point of entry for the API. In the Swarm section of the WebUI, you can see the instructions to connect multiple instances together.
To start a LocalAI server in federated mode, run:
local-ai run --p2p --federated
This will generate a token that you can use to connect other LocalAI instances to the network or others can use to join the network. If you already have a token, you can specify it using the TOKEN environment variable.
To start a load balanced server that routes the requests to the network, run with the TOKEN:
local-ai federated
To see all the available options, run local-ai federated --help.
The instructions are displayed in the “Swarm” section of the WebUI, guiding you through the process of connecting multiple instances.
Workers mode
Note
This feature is available exclusively with llama-cpp compatible models.
(Note: You can also supply the token via command-line arguments)
The server logs should indicate that new workers are being discovered.
Start inference as usual on the server initiated in step 1.
Environment Variables
There are options that can be tweaked or parameters that can be set using environment variables
Environment Variable
Description
LOCALAI_P2P
Set to “true” to enable p2p
LOCALAI_FEDERATED
Set to “true” to enable federated mode
FEDERATED_SERVER
Set to “true” to enable federated server
LOCALAI_P2P_DISABLE_DHT
Set to “true” to disable DHT and enable p2p layer to be local only (mDNS)
LOCALAI_P2P_ENABLE_LIMITS
Set to “true” to enable connection limits and resources management (useful when running with poor connectivity or want to limit resources consumption)
LOCALAI_P2P_LISTEN_MADDRS
Set to comma separated list of multiaddresses to override default libp2p 0.0.0.0 multiaddresses
LOCALAI_P2P_DHT_ANNOUNCE_MADDRS
Set to comma separated list of multiaddresses to override announcing of listen multiaddresses (useful when external address:port is remapped)
LOCALAI_P2P_BOOTSTRAP_PEERS_MADDRS
Set to comma separated list of multiaddresses to specify custom DHT bootstrap nodes
LOCALAI_P2P_TOKEN
Set the token for the p2p network
LOCALAI_P2P_LOGLEVEL
Set the loglevel for the LocalAI p2p stack (default: info)
LOCALAI_P2P_LIB_LOGLEVEL
Set the loglevel for the underlying libp2p stack (default: fatal)
Architecture
LocalAI uses https://github.com/libp2p/go-libp2p under the hood, the same project powering IPFS. Differently from other frameworks, LocalAI uses peer2peer without a single master server, but rather it uses sub/gossip and ledger functionalities to achieve consensus across different peers.
EdgeVPN is used as a library to establish the network and expose the ledger functionality under a shared token to ease out automatic discovery and have separated, private peer2peer networks.
The weights are split proportional to the memory when running into worker mode, when in federation mode each request is split to every node which have to load the model fully.
Debugging
To debug, it’s often useful to run in debug mode, for instance:
Audio to text models are models that can generate text from an audio file.
The transcription endpoint allows to convert audio files to text. The endpoint is based on whisper.cpp, a C++ library for audio transcription. The endpoint input supports all the audio formats supported by ffmpeg.
Usage
Once LocalAI is started and whisper models are installed, you can use the /v1/audio/transcriptions API endpoint.
The transcriptions endpoint then can be tested like so:
## Get an example audio filewget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg
## Send the example audio file to the transcriptions endpointcurl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@$PWD/gb1.ogg" -F model="whisper-1"## Result{"text":"My fellow Americans, this day has brought terrible news and great sadness to our country.At nine o'clock this morning, Mission Control in Houston lost contact with our Space ShuttleColumbia.A short time later, debris was seen falling from the skies above Texas.The Columbia's lost.There are no survivors.One board was a crew of seven.Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain DavidBrown, Commander William McCool, Dr. Kultna Shavla, and Elon Ramon, a colonel in the IsraeliAir Force.These men and women assumed great risk in the service to all humanity.In an age when spaceflight has come to seem almost routine, it is easy to overlook thedangers of travel by rocket and the difficulties of navigating the fierce outer atmosphere ofthe Earth.These astronauts knew the dangers, and they faced them willingly, knowing they had a highand noble purpose in life.Because of their courage and daring and idealism, we will miss them all the more.All Americans today are thinking as well of the families of these men and women who havebeen given this sudden shock and grief.You're not alone.Our entire nation agrees with you, and those you loved will always have the respect andgratitude of this country.The cause in which they died will continue.Mankind has led into the darkness beyond our world by the inspiration of discovery andthe longing to understand.Our journey into space will go on.In the skies today, we saw destruction and tragedy.As farther than we can see, there is comfort and hope.In the words of the prophet Isaiah, \"Lift your eyes and look to the heavens who createdall these, he who brings out the starry hosts one by one and calls them each by name.\"Because of his great power and mighty strength, not one of them is missing.The same creator who names the stars also knows the names of the seven souls we mourntoday.The crew of the shuttle Columbia did not return safely to Earth yet we can pray that all aresafely home.May God bless the grieving families and may God continue to bless America.[BLANK_AUDIO]"}
The functions calls maps automatically to grammars which are currently supported only by llama.cpp, however, it is possible to turn off the use of grammars, and extract tool arguments from the LLM responses, by specifying in the YAML file no_grammar and a regex to map the response from the LLM:
name: model_nameparameters:
# Model file namemodel: model/namefunction:
# set to true to not use grammarsno_grammar: true# set one or more regexes used to extract the function tool arguments from the LLM responseresponse_regex:
- "(?P<function>\w+)\s*\((?P<arguments>.*)\)"
The response regex have to be a regex with named parameters to allow to scan the function name and the arguments. For instance, consider:
(?P<function>\w+)\s*\((?P<arguments>.*)\)
will catch
function_name({ "foo": "bar"})
Parallel tools calls
This feature is experimental and has to be configured in the YAML of the model by enabling function.parallel_calls:
name: gpt-3.5-turboparameters:
# Model file namemodel: ggml-openllama.bintop_p: 80top_k: 0.9temperature: 0.1function:
# set to true to allow the model to call multiple functions in parallelparallel_calls: true
Use functions with grammar
It is possible to also specify the full function signature (for debugging, or to use with other clients).
The chat endpoint accepts the grammar_json_functions additional parameter which takes a JSON schema object.
Grammars and function tools can be used as well in conjunction with vision APIs:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llava", "grammar": "root ::= (\"yes\" | \"no\")",
"messages": [{"role": "user", "content": [{"type":"text", "text": "Is there some grass in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
💡 Examples
A full e2e example with docker-compose is available here.
💾 Stores
Stores are an experimental feature to help with querying data using similarity search. It is
a low level API that consists of only get, set, delete and find.
For example if you have an embedding of some text and want to find text with similar embeddings.
You can create embeddings for chunks of all your text then compare them against the embedding of the text you
are searching on.
An embedding here meaning a vector of numbers that represent some information about the text. The
embeddings are created from an A.I. model such as BERT or a more traditional method such as word
frequency.
Previously you would have to integrate with an external vector database or library directly.
With the stores feature you can now do it through the LocalAI API.
Note however that doing a similarity search on embeddings is just one way to do retrieval. A higher level
API can take this into account, so this may not be the best place to start.
API overview
There is an internal gRPC API and an external facing HTTP JSON API. We’ll just discuss the external HTTP API,
however the HTTP API mirrors the gRPC API. Consult pkg/store/client for internal usage.
Everything is in columnar format meaning that instead of getting an array of objects with a key and a value each.
You instead get two separate arrays of keys and values.
Keys are arrays of floating point numbers with a maximum width of 32bits. Values are strings (in gRPC they are bytes).
The key vectors must all be the same length and it’s best for search performance if they are normalized. When
addings keys it will be detected if they are not normalized and what length they are.
All endpoints accept a store field which specifies which store to operate on. Presently they are created
on the fly and there is only one store backend so no configuration is required.
topk limits the number of results returned. The result value is the same as get,
except that it also includes an array of similarities. Where 1.0 is the maximum similarity.
They are returned in the order of most similar to least.
🖼️ Model gallery
The model gallery is a curated collection of models configurations for LocalAI that enables one-click install of models directly from the LocalAI Web interface.
LocalAI to ease out installations of models provide a way to preload models on start and downloading and installing them in runtime. You can install models manually by copying them over the models directory, or use the API or the Web interface to configure, download and verify the model assets for you.
Note
The models in this gallery are not directly maintained by LocalAI. If you find a model that is not working, please open an issue on the model gallery repository.
Note
GPT and text generation models might have a license which is not permissive for commercial use or might be questionable or without any license at all. Please check the model license before using it. The official gallery contains only open licensed models.
Useful Links and resources
Open LLM Leaderboard - here you can find a list of the most performing models on the Open LLM benchmark. Keep in mind models compatible with LocalAI must be quantized in the gguf format.
How it works
Navigate the WebUI interface in the “Models” section from the navbar at the top. Here you can find a list of models that can be installed, and you can install them by clicking the “Install” button.
Add other galleries
You can add other galleries by setting the GALLERIES environment variable. The GALLERIES environment variable is a list of JSON objects, where each object has a name and a url field. The name field is the name of the gallery, and the url field is the URL of the gallery’s index file, for example:
where github:mudler/localai/gallery/index.yaml will be expanded automatically to https://raw.githubusercontent.com/mudler/LocalAI/main/index.yaml.
Note: the url are expanded automatically for github and huggingface, however https:// and http:// prefix works as well.
Note
If you want to build your own gallery, there is no documentation yet. However you can find the source of the default gallery in the LocalAI repository.
List Models
To list all the available models, use the /models/available endpoint:
Models can be installed by passing the full URL of the YAML config file, or either an identifier of the model in the gallery. The gallery is a repository of models that can be installed by passing the model name.
To install a model from the gallery repository, you can pass the model name in the id field. For instance, to install the bert-embeddings model, you can use the following command:
localai is the repository. It is optional and can be omitted. If the repository is omitted LocalAI will search the model by name in all the repositories. In the case the same model name is present in both galleries the first match wins.
bert-embeddings is the model name in the gallery
(read its config here).
How to install a model not part of a gallery
If you don’t want to set any gallery repository, you can still install models by loading a model configuration file.
In the body of the request you must specify the model configuration file URL (url), optionally a name to install the model (name), extra files to install (files), and configuration overrides (overrides). When calling the API endpoint, LocalAI will download the models files and write the configuration to the folder used to store models.
To preload models on start instead you can use the PRELOAD_MODELS environment variable.
To preload models on start, use the PRELOAD_MODELS environment variable by setting it to a JSON array of model uri:
PRELOAD_MODELS='[{"url": "<MODEL_URL>"}]'
Note: url or id must be specified. url is used to a url to a model gallery configuration, while an id is used to refer to models inside repositories. If both are specified, the id will be used.
While the API is running, you can install the model by using the /models/apply endpoint and point it to the stablediffusion model in the models-gallery:
LocalAI will create a batch process that downloads the required files from a model definition and automatically reload itself to include the new model.
Input: url or id (required), name (optional), files (optional)
An optional, list of additional files can be specified to be downloaded within files. The name allows to override the model name. Finally it is possible to override the model config file with override.
The url is a full URL, or a github url (github:org/repo/file.yaml), or a local file (file:///path/to/file.yaml).
The id is a string in the form <GALLERY>@<MODEL_NAME>, where <GALLERY> is the name of the gallery, and <MODEL_NAME> is the name of the model in the gallery. Galleries can be specified during startup with the GALLERIES environment variable.
Returns an uuid and an url to follow up the state of the process:
LocalAI now supports the Model Context Protocol (MCP), enabling powerful agentic capabilities by connecting AI models to external tools and services. This feature allows your LocalAI models to interact with various MCP servers, providing access to real-time data, APIs, and specialized tools.
What is MCP?
The Model Context Protocol is a standard for connecting AI models to external tools and data sources. It enables AI agents to:
Access real-time information from external APIs
Execute commands and interact with external systems
Use specialized tools for specific tasks
Maintain context across multiple tool interactions
Key Features
🔄 Real-time Tool Access: Connect to external MCP servers for live data
🛠️ Multiple Server Support: Configure both remote HTTP and local stdio servers
⚡ Cached Connections: Efficient tool caching for better performance
🔒 Secure Authentication: Support for bearer token authentication
🎯 OpenAI Compatible: Uses the familiar /mcp/v1/chat/completions endpoint
🧠 Advanced Reasoning: Configurable reasoning and re-evaluation capabilities
📋 Auto-Planning: Break down complex tasks into manageable steps
🎯 MCP Prompts: Specialized prompts for better MCP server interaction
🔄 Plan Re-evaluation: Dynamic plan adjustment based on results
⚙️ Flexible Agent Control: Customizable execution limits and retry behavior
Configuration
MCP support is configured in your model’s YAML configuration file using the mcp section:
enable_plan_re_evaluator: Enable plan re-evaluation (default: false)
Usage
API Endpoint
Use the MCP-enabled completion endpoint:
curl http://localhost:8080/mcp/v1/chat/completions \
-H "Content-Type: application/json"\
-d '{
"model": "my-agentic-model",
"messages": [
{"role": "user", "content": "What is the current weather in New York?"}
],
"temperature": 0.7
}'
Example Response
{
"id": "chatcmpl-123",
"created": 1699123456,
"model": "my-agentic-model",
"choices": [
{
"text": "The current weather in New York is 72°F (22°C) with partly cloudy skies. The humidity is 65% and there's a light breeze from the west at 8 mph." }
],
"object": "text_completion"}
The agent section controls how the AI model interacts with MCP tools:
Execution Control
max_attempts: Limits how many times a tool can be retried if it fails. Higher values provide more resilience but may increase response time.
max_iterations: Controls the maximum number of reasoning cycles the agent can perform. More iterations allow for complex multi-step problem solving.
Reasoning Capabilities
enable_reasoning: When enabled, the agent uses advanced reasoning to better understand tool results and plan next steps.
Planning Capabilities
enable_planning: When enabled, the agent uses auto-planning to break down complex tasks into manageable steps and execute them systematically. The agent will automatically detect when planning is needed.
enable_mcp_prompts: When enabled, the agent uses specialized prompts exposed by the MCP servers to interact with the exposed tools.
enable_plan_re_evaluator: When enabled, the agent can re-evaluate and adjust its execution plan based on intermediate results.
Tool Discovery: LocalAI connects to configured MCP servers and discovers available tools
Tool Caching: Tools are cached per model for efficient reuse
Agent Execution: The AI model uses the Cogito framework to execute tools
Response Generation: The model generates responses incorporating tool results
Supported MCP Servers
LocalAI is compatible with any MCP-compliant server.
Best Practices
Security
Use environment variables for sensitive tokens
Validate MCP server endpoints before deployment
Implement proper authentication for remote servers
Performance
Cache frequently used tools
Use appropriate timeout values for external APIs
Monitor resource usage for stdio servers
Error Handling
Implement fallback mechanisms for tool failures
Log tool execution for debugging
Handle network timeouts gracefully
With External Applications
Use MCP-enabled models in your applications:
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/mcp/v1",
api_key="your-api-key")
response = client.chat.completions.create(
model="my-agentic-model",
messages=[
{"role": "user", "content": "Analyze the latest research papers on AI"}
]
)
MCP and adding packages
It might be handy to install packages before starting the container to setup the environment. This is an example on how you can do that with docker-compose (installing and configuring docker)