The free, OpenAI, Anthropic alternative. Your All-in-One Complete AI Stack - Run powerful language models, autonomous agents, and document intelligence locally on your hardware.
Drop-in replacement for OpenAI API - modular suite of tools that work seamlessly together or independently.
Start with LocalAI’s OpenAI-compatible API, extend with LocalAGI’s autonomous agents, and enhance with LocalRecall’s semantic search - all running locally on your hardware.
Open Source MIT Licensed.
Why Choose LocalAI?
OpenAI API Compatible - Run AI models locally with our modular ecosystem. From language models to autonomous agents and semantic search, build your complete AI stack without the cloud.
Key Features
LLM Inferencing: LocalAI is a free, Open Source OpenAI alternative. Run LLMs, generate images, audio and more locally with consumer grade hardware.
Agentic-first: Extend LocalAI with LocalAGI, an autonomous AI agent platform that runs locally, no coding required. Build and deploy autonomous agents with ease.
Memory and Knowledge base: Extend LocalAI with LocalRecall, A local rest api for semantic search and memory management. Perfect for AI applications.
OpenAI Compatible: Drop-in replacement for OpenAI API. Compatible with existing applications and libraries.
No GPU Required: Run on consumer grade hardware. No need for expensive GPUs or cloud services.
Multiple Models: Support for various model families including LLMs, image generation, and audio models. Supports multiple backends for inferencing.
Privacy Focused: Keep your data local. No data leaves your machine, ensuring complete privacy.
Easy Setup: Simple installation and configuration. Get started in minutes with Binaries installation, Docker, Podman, Kubernetes or local installation.
Community Driven: Active community support and regular updates. Contribute and help shape the future of LocalAI.
Quick Start
Docker is the recommended installation method for most users:
docker run -p 8080:8080 --name local-ai -ti localai/localai:latest
LocalAI is your complete AI stack for running AI models locally. It’s designed to be simple, efficient, and accessible, providing a drop-in replacement for OpenAI’s API while keeping your data private and secure.
Why LocalAI?
In today’s AI landscape, privacy, control, and flexibility are paramount. LocalAI addresses these needs by:
Privacy First: Your data never leaves your machine
Complete Control: Run models on your terms, with your hardware
Open Source: MIT licensed and community-driven
Flexible Deployment: From laptops to servers, with or without GPUs
Extensible: Add new models and features as needed
Core Components
LocalAI is more than just a single tool - it’s a complete ecosystem:
LocalAI can be installed in several ways. Docker is the recommended installation method for most users as it provides the easiest setup and works across all platforms.
Recommended: Docker Installation
The quickest way to get started with LocalAI is using Docker:
docker run -p 8080:8080 --name local-ai -ti localai/localai:latest
For complete installation instructions including Docker, macOS, Linux, Kubernetes, and building from source, see the Installation guide.
Key Features
Text Generation: Run various LLMs locally
Image Generation: Create images with stable diffusion
Audio Processing: Text-to-speech and speech-to-text
Vision API: Image understanding and analysis
Embeddings: Vector database support
Functions: OpenAI-compatible function calling
MCP Support: Model Context Protocol for agentic capabilities
LocalAI can be installed in multiple ways depending on your platform and preferences.
Tip
Recommended: Docker Installation
Docker is the recommended installation method for most users as it works across all platforms (Linux, macOS, Windows) and provides the easiest setup experience. It’s the fastest way to get started with LocalAI.
Installation Methods
Choose the installation method that best suits your needs:
Docker ⭐ Recommended - Works on all platforms, easiest setup
Text Generation: LLM models for chat and completion
Image Generation: Stable Diffusion models
Text to Speech: TTS models
Speech to Text: Whisper models
Embeddings: Vector embedding models
Function Calling: Support for OpenAI-compatible function calling
The AIO images use OpenAI-compatible model names (like gpt-4, gpt-4-vision-preview) but are backed by open-source models. See the container images documentation for the complete mapping.
Next Steps
After installation:
Access the WebUI at http://localhost:8080
Check available models: curl http://localhost:8080/v1/models
Set to "true" to make the instance a worker (p2p token is required)
FEDERATED
Set to "true" to share the instance with the federation (p2p token is required)
FEDERATED_SERVER
Set to "true" to run the instance as a federation server which forwards requests to the federation (p2p token is required)
Image Selection
The installer will automatically detect your GPU and select the appropriate image. By default, it uses the standard images without extra Python dependencies. You can customize the image selection:
USE_AIO=true: Use all-in-one images that include all dependencies
USE_VULKAN=true: Use Vulkan GPU support instead of vendor-specific GPU support
Uninstallation
To uninstall LocalAI installed via the script:
curl https://localai.io/install.sh | sh -s -- --uninstall
Manual Installation
Download Binary
You can manually download the appropriate binary for your system from the releases page:
LocalAI can be built as a container image or as a single, portable binary. Note that some model architectures might require Python libraries, which are not included in the binary.
LocalAI’s extensible architecture allows you to add your own backends, which can be written in any language, and as such the container images contains also the Python dependencies to run all the available backends (for example, in order to run backends like Diffusers that allows to generate images and videos from text).
This section contains instructions on how to build LocalAI from source.
Build LocalAI locally
Requirements
In order to build LocalAI locally, you need the following requirements:
Golang >= 1.21
GCC
GRPC
To install the dependencies follow the instructions below:
Install xcode from the App Store
brew install go protobuf protoc-gen-go protoc-gen-go-grpc wget
apt install golang make protobuf-compiler-grpc
After you have golang installed and working, you can install the required binaries for compiling the golang protobuf components via the following commands
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
make build
Build
To build LocalAI with make:
git clone https://github.com/go-skynet/LocalAI
cd LocalAI
make build
This should produce the binary local-ai
Container image
Requirements:
Docker or podman, or a container engine
In order to build the LocalAI container image locally you can use docker, for example:
docker build -t localai .
docker run localai
Example: Build on mac
Building on Mac (M1, M2 or M3) works, but you may need to install some prerequisites using brew.
The below has been tested by one mac user and found to work. Note that this doesn’t use Docker to run the server:
Install xcode from the Apps Store (needed for metalkit)
If you encounter errors regarding a missing utility metal, install Xcode from the App Store.
After the installation of Xcode, if you receive a xcrun error 'xcrun: error: unable to find utility "metal", not a developer tool or in PATH'. You might have installed the Xcode command line tools before installing Xcode, the former one is pointing to an incomplete SDK.
If completions are slow, ensure that gpu-layers in your model yaml matches the number of layers from the model in use (or simply use a high number such as 256).
If you get a compile error: error: only virtual member functions can be marked 'final', reinstall all the necessary brew packages, clean the build, and try again.
brew reinstall go grpc protobuf wget
make clean
make build
Build backends
LocalAI have several backends available for installation in the backend gallery. The backends can be also built by source. As backends might vary from language and dependencies that they require, the documentation will provide generic guidance for few of the backends, which can be applied with some slight modifications also to the others.
Manually
Typically each backend include a Makefile which allow to package the backend.
In the LocalAI repository, for instance you can build bark-cpp by doing:
git clone https://github.com/go-skynet/LocalAI.git
make -C LocalAI/backend/go/bark-cpp build package
make -C LocalAI/backend/python/vllm
With Docker
Building with docker is simpler as abstracts away all the requirement, and focuses on building the final OCI images that are available in the gallery. This allows for instance also to build locally a backend and install it with LocalAI. You can refer to Backends for general guidance on how to install and develop backends.
In the LocalAI repository, you can build bark-cpp by doing:
git clone https://github.com/go-skynet/LocalAI.git
make docker-build-bark-cpp
Note that make is only by convenience, in reality it just runs a simple docker command as:
LocalAI is a free, open-source alternative to OpenAI (Anthropic, etc.), functioning as a drop-in replacement REST API for local inferencing. It allows you to run LLMs, generate images, and produce audio, all locally or on-premises with consumer-grade hardware, supporting multiple model families and architectures.
Tip
Security considerations
If you are exposing LocalAI remotely, make sure you protect the API endpoints adequately with a mechanism which allows to protect from the incoming traffic or alternatively, run LocalAI with API_KEY to gate the access with an API key. The API key guarantees a total access to the features (there is no role separation), and it is to be considered as likely as an admin role.
Once installed, start LocalAI. For Docker installations:
docker run -p 8080:8080 --name local-ai -ti localai/localai:latest
The API will be available at http://localhost:8080.
Downloading models on start
When starting LocalAI (either via Docker or via CLI) you can specify as argument a list of models to install automatically before starting the API, for example:
local-ai run llama-3.2-1b-instruct:q4_k_m
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
local-ai run ollama://gemma:2b
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
local-ai run oci://localai/phi-2:latest
Tip
Automatic Backend Detection: When you install models from the gallery or YAML files, LocalAI automatically detects your system’s GPU capabilities (NVIDIA, AMD, Intel) and downloads the appropriate backend. For advanced configuration options, see GPU Acceleration.
For a full list of options, you can run LocalAI with --help or refer to the Linux Installation guide for installer configuration options.
Using LocalAI and the full stack with LocalAGI
LocalAI is part of the Local family stack, along with LocalAGI and LocalRecall.
LocalAGI is a powerful, self-hostable AI Agent platform designed for maximum privacy and flexibility which encompassess and uses all the software stack. It provides a complete drop-in replacement for OpenAI’s Responses APIs with advanced agentic capabilities, working entirely locally on consumer-grade hardware (CPU and GPU).
Quick Start
git clone https://github.com/mudler/LocalAGI
cd LocalAGI
docker compose up
docker compose -f docker-compose.nvidia.yaml up
docker compose -f docker-compose.intel.yaml up
MODEL_NAME=gemma-3-12b-it docker compose up
MODEL_NAME=gemma-3-12b-it \
MULTIMODAL_MODEL=minicpm-v-4_5 \
IMAGE_MODEL=flux.1-dev-ggml \
docker compose -f docker-compose.nvidia.yaml up
Key Features
Privacy-Focused: All processing happens locally, ensuring your data never leaves your machine
Flexible Deployment: Supports CPU, NVIDIA GPU, and Intel GPU configurations
Multiple Model Support: Compatible with various models from Hugging Face and other sources
Web Interface: User-friendly chat interface for interacting with AI agents
Advanced Capabilities: Supports multimodal models, image generation, and more
Docker Integration: Easy deployment using Docker Compose
Environment Variables
You can customize your LocalAGI setup using the following environment variables:
MODEL_NAME: Specify the model to use (e.g., gemma-3-12b-it)
There is much more to explore with LocalAI! You can run any model from Hugging Face, perform video generation, and also voice cloning. For a comprehensive overview, check out the features section.
Explore additional resources and community contributions:
This section covers everything you need to know about installing and configuring models in LocalAI. You’ll learn multiple methods to get models running.
Prerequisites
LocalAI installed and running (see Quickstart if you haven’t set it up yet)
Basic understanding of command line usage
Method 1: Using the Model Gallery (Easiest)
The Model Gallery is the simplest way to install models. It provides pre-configured models ready to use.
# List available modelslocal-ai models list
# Install a specific modellocal-ai models install llama-3.2-1b-instruct:q4_k_m
# Start LocalAI with a model from the gallerylocal-ai run llama-3.2-1b-instruct:q4_k_m
To run models available in the LocalAI gallery, you can use the model name as the URI. For example, to run LocalAI with the Hermes model, execute:
local-ai run hermes-2-theta-llama-3-8b
To install only the model, use:
local-ai models install hermes-2-theta-llama-3-8b
Note: The galleries available in LocalAI can be customized to point to a different URL or a local directory. For more information on how to setup your own gallery, see the Gallery Documentation.
Browse Online
Visit models.localai.io to browse all available models in your browser.
Method 1.5: Import Models via WebUI
The WebUI provides a powerful model import interface that supports both simple and advanced configuration:
Simple Import Mode
Open the LocalAI WebUI at http://localhost:8080
Click “Import Model”
Enter the model URI (e.g., https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct-GGUF)
Optionally configure preferences:
Backend selection
Model name
Description
Quantizations
Embeddings support
Custom preferences
Click “Import Model” to start the import process
Advanced Import Mode
For full control over model configuration:
In the WebUI, click “Import Model”
Toggle to “Advanced Mode”
Edit the YAML configuration directly in the code editor
Use the “Validate” button to check your configuration
Click “Create” or “Update” to save
The advanced editor includes:
Syntax highlighting
YAML validation
Format and copy tools
Full configuration options
This is especially useful for:
Custom model configurations
Fine-tuning model parameters
Setting up complex model setups
Editing existing model configurations
Method 2: Installing from Hugging Face
LocalAI can directly install models from Hugging Face:
# Install and run a model from Hugging Facelocal-ai run huggingface://TheBloke/phi-2-GGUF
The format is: huggingface://<repository>/<model-file> ( is optional)
Examples
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
Method 3: Installing from OCI Registries
Ollama Registry
local-ai run ollama://gemma:2b
Standard OCI Registry
local-ai run oci://localai/phi-2:latest
Run Models via URI
To run models via URI, specify a URI to a model file or a configuration file when starting LocalAI. Valid syntax includes:
From OCIs: oci://container_image:tag, ollama://model_id:tag
From configuration files: https://gist.githubusercontent.com/.../phi-2.yaml
Configuration files can be used to customize the model defaults and settings. For advanced configurations, refer to the Customize Models section.
Examples
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
local-ai run ollama://gemma:2b
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
local-ai run oci://localai/phi-2:latest
Method 4: Manual Installation
For full control, you can manually download and configure models.
If running on Apple Silicon (ARM), it is not recommended to run on Docker due to emulation. Follow the build instructions to use Metal acceleration for full GPU support.
If you are running on Apple x86_64, you can use Docker without additional gain from building it from source.
git clone https://github.com/go-skynet/LocalAI
cd LocalAI
cp your-model.gguf models/
docker compose up -d --pull always
curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "your-model.gguf",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
Tip
Other Docker Images:
For other Docker images, please refer to the table in Getting Started.
Note: If you are on Windows, ensure the project is on the Linux filesystem to avoid slow model loading. For more information, see the Microsoft Docs.
# Via APIcurl http://localhost:8080/v1/models
# Via CLIlocal-ai models list
Remove Models
Simply delete the model file and configuration from your models directory:
rm models/model-name.gguf
rm models/model-name.yaml # if exists
Troubleshooting
Model Not Loading
Check backend: Ensure the required backend is installed
local-ai backends list
local-ai backends install llama-cpp # if needed
Check logs: Enable debug mode
DEBUG=true local-ai
Verify file: Ensure the model file is not corrupted
Out of Memory
Use a smaller quantization (Q4_K_S or Q2_K)
Reduce context_size in configuration
Close other applications to free RAM
Wrong Backend
Check the Compatibility Table to ensure you’re using the correct backend for your model.
Best Practices
Start small: Begin with smaller models to test your setup
Use quantized models: Q4_K_M is a good balance for most use cases
Organize models: Keep your models directory organized
Backup configurations: Save your YAML configurations
Monitor resources: Watch RAM and disk usage
Try it out
Once LocalAI is installed, you can start it (either by using docker, or the cli, or the systemd service).
By default the LocalAI WebUI should be accessible from http://localhost:8080. You can also use 3rd party projects to interact with LocalAI as you would use OpenAI (see also Integrations ).
After installation, install new models by navigating the model gallery, or by using the local-ai CLI.
Tip
To install models with the WebUI, see the Models section.
With the CLI you can list the models with local-ai models list and install them with local-ai models install <model-name>.
You can also run models manually by copying files into the models directory.
You can test out the API endpoints using curl, few examples are listed below. The models we are referring here (gpt-4, gpt-4-vision-preview, tts-1, whisper-1) are the default models that come with the AIO images - you can also use any other model you have installed.
curl http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json"\
-d '{
"model": "tts-1",
"input": "The quick brown fox jumped over the lazy dog.",
"voice": "alloy"
}'\
--output speech.mp3
Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. OpenAI Embeddings.
curl http://localhost:8080/embeddings \
-X POST -H "Content-Type: application/json"\
-d '{
"input": "Your text string goes here",
"model": "text-embedding-ada-002"
}'
Tip
Don’t use the model file as model in the request unless you want to handle the prompt template for yourself.
Use the model names like you would do with OpenAI like in the examples below. For instance gpt-4-vision-preview, or gpt-4.
Customizing the Model
To customize the prompt template or the default settings of the model, a configuration file is utilized. This file must adhere to the LocalAI YAML configuration standards. For comprehensive syntax details, refer to the advanced documentation. The configuration file can be located either remotely (such as in a Github Gist) or within the local filesystem or a remote URL.
LocalAI can be initiated using either its container image or binary, with a command that includes URLs of model config files or utilizes a shorthand format (like huggingface:// or github://), which is then expanded into complete URLs.
The configuration can also be set via an environment variable. For instance:
name: phi-2context_size: 2048f16: truethreads: 11gpu_layers: 90mmap: trueparameters:
# Reference any HF model or a local file heremodel: huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguftemperature: 0.2top_k: 40top_p: 0.95template:
chat: &template | Instruct: {{.Input}}
Output:# Modify the prompt template here ^^^ as per your requirementscompletion: *template
Then, launch LocalAI using your gist’s URL:
## Important! Substitute with your gist's URL!docker run -p 8080:8080 localai/localai:v3.7.0 https://gist.githubusercontent.com/xxxx/phi-2.yaml
Next Steps
Visit the advanced section for more insights on prompt templates and configuration files.
Building LocalAI from source is an installation method that allows you to compile LocalAI yourself, which is useful for custom configurations, development, or when you need specific build options.
For complete build instructions, see the Build from Source documentation in the Installation section.
Run with container images
LocalAI provides a variety of images to support different environments. These images are available on quay.io and Docker Hub.
All-in-One images comes with a pre-configured set of models and backends, standard images instead do not have any model pre-configured and installed.
For GPU Acceleration support for Nvidia video graphic cards, use the Nvidia/CUDA images, if you don’t have a GPU, use the CPU images. If you have AMD or Mac Silicon, see the build section.
Tip
Available Images Types:
Images ending with -core are smaller images without predownload python dependencies. Use these images if you plan to use llama.cpp, stablediffusion-ncn or rwkv backends - if you are not sure which one to use, do not use these images.
Images containing the aio tag are all-in-one images with all the features enabled, and come with an opinionated set of configuration.
Prerequisites
Before you begin, ensure you have a container engine installed if you are not using the binaries. Suitable options include Docker or Podman. For installation instructions, refer to the following guides:
Hardware Requirements: The hardware requirements for LocalAI vary based on the model size and quantization method used. For performance benchmarks with different backends, such as llama.cpp, visit this link. The rwkv backend is noted for its lower resource consumption.
Standard container images
Standard container images do not have pre-installed models. Use these if you want to configure models manually.
These images are compatible with Nvidia ARM64 devices, such as the Jetson Nano, Jetson Xavier NX, and Jetson AGX Xavier. For more information, see the Nvidia L4T guide.
All-In-One images are images that come pre-configured with a set of models and backends to fully leverage almost all the LocalAI featureset. These images are available for both CPU and GPU environments. The AIO images are designed to be easy to use and require no configuration. Models configuration can be found here separated by size.
In the AIO images there are models configured with the names of OpenAI models, however, they are really backed by Open Source models. You can find the table below
Category
Model name
Real model (CPU)
Real model (GPU)
Text Generation
gpt-4
phi-2
hermes-2-pro-mistral
Multimodal Vision
gpt-4-vision-preview
bakllava
llava-1.6-mistral
Image Generation
stablediffusion
stablediffusion
dreamshaper-8
Speech to Text
whisper-1
whisper with whisper-base model
<= same
Text to Speech
tts-1
en-us-amy-low.onnx from rhasspy/piper
<= same
Embeddings
text-embedding-ada-002
all-MiniLM-L6-v2 in Q4
all-MiniLM-L6-v2
Usage
Select the image (CPU or GPU) and start the container with Docker:
docker run -p 8080:8080 --name local-ai -ti localai/localai:latest-aio-cpu
LocalAI will automatically download all the required models, and the API will be available at localhost:8080.
Or with a docker-compose file:
version: "3.9"services:
api:
image: localai/localai:latest-aio-cpu# For a specific version:# image: localai/localai:v3.7.0-aio-cpu# For Nvidia GPUs decomment one of the following (cuda11 or cuda12):# image: localai/localai:v3.7.0-aio-gpu-nvidia-cuda-11# image: localai/localai:v3.7.0-aio-gpu-nvidia-cuda-12# image: localai/localai:latest-aio-gpu-nvidia-cuda-11# image: localai/localai:latest-aio-gpu-nvidia-cuda-12healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
interval: 1mtimeout: 20mretries: 5ports:
- 8080:8080environment:
- DEBUG=true# ...volumes:
- ./models:/models:cached# decomment the following piece if running with Nvidia GPUs# deploy:# resources:# reservations:# devices:# - driver: nvidia# count: 1# capabilities: [gpu]
Tip
Models caching: The AIO image will download the needed models on the first run if not already present and store those in /models inside the container. The AIO models will be automatically updated with new versions of AIO images.
You can change the directory inside the container by specifying a MODELS_PATH environment variable (or --models-path).
If you want to use a named model or a local directory, you can mount it as a volume to /models:
docker run -p 8080:8080 --name local-ai -ti -v $PWD/models:/models localai/localai:latest-aio-cpu
The AIO Images are inheriting the same environment variables as the base images and the environment of LocalAI (that you can inspect by calling --help). However, it supports additional environment variables available only from the container image
Variable
Default
Description
PROFILE
Auto-detected
The size of the model to use. Available: cpu, gpu-8g
MODELS
Auto-detected
A list of models YAML Configuration file URI/URL (see also running models)
Watchdog for backends (
1341
). As https://github.com/ggerganov/llama.cpp/issues/3969 is hitting LocalAI’s llama-cpp implementation, we have now a watchdog that can be used to make sure backends are not stalling. This is a generic mechanism that can be enabled for all the backends now.
Due to the python dependencies size of images grew in size.
If you still want to use smaller images without python dependencies, you can use the corresponding images tags ending with -core.
This release now brings the llama-cpp backend which is a c++ backend tied to llama.cpp. It follows more closely and tracks recent versions of llama.cpp. It is not feature compatible with the current llama backend but plans are to sunset the current llama backend in favor of this one. This one will be probably be the latest release containing the older llama backend written in go and c++. The major improvement with this change is that there are less layers that could be expose to potential bugs - and as well it ease out maintenance as well.
Support for ROCm/HIPBLAS
This release bring support for AMD thanks to @65a . See more details in
1100
More CLI commands
Thanks to @jespino now the local-ai binary has more subcommands allowing to manage the gallery or try out directly inferencing, check it out!
This is an exciting LocalAI release! Besides bug-fixes and enhancements this release brings the new backend to a whole new level by extending support to vllm and vall-e-x for audio generation!
Check out the documentation for vllm here and Vall-E-X here
Hey everyone, Ettore here, I’m so happy to share this release out - while this summer is hot apparently doesn’t stop LocalAI development :)
This release brings a lot of new features, bugfixes and updates! Also a big shout out to the community, this was a great release!
Attention 🚨
From this release the llama backend supports only gguf files (see
943
). LocalAI however still supports ggml files. We ship a version of llama.cpp before that change in a separate backend, named llama-stable to allow still loading ggml files. If you were specifying the llama backend manually to load ggml files from this release you should use llama-stable instead, or do not specify a backend at all (LocalAI will automatically handle this).
Image generation enhancements
The Diffusers backend got now various enhancements, including support to generate images from images, longer prompts, and support for more kernels schedulers. See the Diffusers documentation for more information.
Lora adapters
Now it’s possible to load lora adapters for llama.cpp. See
955
for more information.
Device management
It is now possible for single-devices with one GPU to specify --single-active-backend to allow only one backend active at the time
925
.
Community spotlight
Resources management
Thanks to the continous community efforts (another cool contribution from
dave-gray101
) now it’s possible to shutdown a backend programmatically via the API.
There is an ongoing effort in the community to better handling of resources. See also the 🔥Roadmap.
New how-to section
Thanks to the community efforts now we have a new how-to website with various examples on how to use LocalAI. This is a great starting point for new users! We are currently working on improving it, a huge shout out to
lunamidori5
from the community for the impressive efforts on this!
💡 More examples!
Open source autopilot? See the new addition by
gruberdev
in our examples on how to use Continue with LocalAI!
feat: pre-configure LocalAI galleries by
mudler
in
886
🐶 Bark
Bark is a text-prompted generative audio model - it combines GPT techniques to generate Audio from text. It is a great addition to LocalAI, and it’s available in the container images by default.
It can also generate music, see the example: lion.webm
🦙 AutoGPTQ
AutoGPTQ is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
It is targeted mainly for GPU usage only. Check out the documentation for usage.
🦙 Exllama
Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”. It is a faster alternative to run LLaMA models on GPU.Check out the Exllama documentation for usage.
🧨 Diffusers
Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Currently it is experimental, and supports generation only of images so you might encounter some issues on models which weren’t tested yet. Check out the Diffusers documentation for usage.
🔑 API Keys
Thanks to the community contributions now it’s possible to specify a list of API keys that can be used to gate API requests.
API Keys can be specified with the API_KEY environment variable as a comma-separated list of keys.
🖼️ Galleries
Now by default the model-gallery repositories are configured in the container images
💡 New project
LocalAGI is a simple agent that uses LocalAI functions to have a full locally runnable assistant (with no API keys needed).
See it here in action planning a trip for San Francisco!
feat(llama2): add template for chat messages by
dave-gray101
in
782
Note
From this release to use the OpenAI functions you need to use the llama-grammar backend. It has been added a llama backend for tracking llama.cpp master and llama-grammar for the grammar functionalities that have not been merged yet upstream. See also OpenAI functions. Until the feature is merged we will have two llama backends.
Huggingface embeddings
In this release is now possible to specify to LocalAI external gRPC backends that can be used for inferencing
778
. It is now possible to write internal backends in any language, and a huggingface-embeddings backend is now available in the container image to be used with https://github.com/UKPLab/sentence-transformers. See also Embeddings.
LLaMa 2 has been released!
Thanks to the community effort now LocalAI supports templating for LLaMa2! more at:
782
until we update the model gallery with LLaMa2 models!
The former, ggml-based backend has been renamed to falcon-ggml.
Default pre-compiled binaries
From this release the default behavior of images has changed. Compilation is not triggered on start automatically, to recompile local-ai from scratch on start and switch back to the old behavior, you can set REBUILD=true in the environment variables. Rebuilding can be necessary if your CPU and/or architecture is old and the pre-compiled binaries are not compatible with your platform. See the build section for more information.
Add Text-to-Audio generation with go-piper by
mudler
in
649
See API endpoints in our documentation.
Add gallery repository by
mudler
in
663
. See models for documentation.
Container images
Standard (GPT + stablediffusion): quay.io/go-skynet/local-ai:v1.20.0
FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-ffmpeg
CUDA 11+FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-gpu-nvidia-cuda11-ffmpeg
CUDA 12+FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-gpu-nvidia-cuda12-ffmpeg
Updates
Updates to llama.cpp, go-transformers, gpt4all.cpp and rwkv.cpp.
The NUMA option was enabled by
mudler
in
684
, along with many new parameters (mmap,mmlock, ..). See advanced for the full list of parameters.
Gallery repositories
In this release there is support for gallery repositories. These are repositories that contain models, and can be used to install models. The default gallery which contains only freely licensed models is in Github: https://github.com/go-skynet/model-gallery, but you can use your own gallery by setting the GALLERIES environment variable. An automatic index of huggingface models is available as well.
For example, now you can start LocalAI with the following environment variable to use both galleries:
Now LocalAI uses piper and go-piper to generate audio from text. This is an experimental feature, and it requires GO_TAGS=tts to be set during build. It is enabled by default in the pre-built container images.
Full CUDA GPU offload support ( PR by mudler. Thanks to chnyda for handing over the GPU access, and lu-zero to help in debugging )
Full GPU Metal Support is now fully functional. Thanks to Soleblaze to iron out the Metal Apple silicon support!
Container images:
Standard (GPT + stablediffusion): quay.io/go-skynet/local-ai:v1.19.2
FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-ffmpeg
CUDA 11+FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-gpu-nvidia-cuda11-ffmpeg
CUDA 12+FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-gpu-nvidia-cuda12-ffmpeg
🔥🔥🔥 06-06-2023: v1.18.0 🚀
This LocalAI release is plenty of new features, bugfixes and updates! Thanks to the community for the help, this was a great community release!
We now support a vast variety of models, while being backward compatible with prior quantization formats, this new release allows still to load older formats and new k-quants!
New features
✨ Added support for falcon-based model families (7b) ( mudler )
✨ Experimental support for Metal Apple Silicon GPU - ( mudler and thanks to Soleblaze for testing! ). See the build section.
✨ Support for token stream in the /v1/completions endpoint ( samm81 )
🆙 Bloomz has been updated to the latest ggml changes, including new quantization format ( mudler )
🆙 RWKV has been updated to the new quantization format( mudler )
🆙 k-quants format support for the llama models ( mudler )
🆙 gpt4all has been updated, incorporating upstream changes allowing to load older models, and with different CPU instruction set (AVX only, AVX2) from the same binary! ( mudler )
23-05-2023: v1.15.0 released. go-gpt2.cpp backend got renamed to go-ggml-transformers.cpp updated including https://github.com/ggerganov/llama.cpp/pull/1508 which breaks compatibility with older models. This impacts RedPajama, GptNeoX, MPT(not gpt4all-mpt), Dolly, GPT2 and Starcoder based models. Binary releases available, various fixes, including
341
.
21-05-2023: v1.14.0 released. Minor updates to the /models/apply endpoint, llama.cpp backend updated including https://github.com/ggerganov/llama.cpp/pull/1508 which breaks compatibility with older models. gpt4all is still compatible with the old format.
19-05-2023: v1.13.0 released! 🔥🔥 updates to the gpt4all and llama backend, consolidated CUDA support (
310
thanks to @bubthegreat and @Thireus ), preliminar support for installing models via API.
17-05-2023: v1.12.0 released! 🔥🔥 Minor fixes, plus CUDA (
258
) support for llama.cpp-compatible models and image generation (
272
).
16-05-2023: 🔥🔥🔥 Experimental support for CUDA (
258
) in the llama.cpp backend and Stable diffusion CPU image generation (
272
) in master.
13-05-2023: v1.11.0 released! 🔥 Updated llama.cpp bindings: This update includes a breaking change in the model files ( https://github.com/ggerganov/llama.cpp/pull/1405 ) - old models should still work with the gpt4all-llama backend.
12-05-2023: v1.10.0 released! 🔥🔥 Updated gpt4all bindings. Added support for GPTNeox (experimental), RedPajama (experimental), Starcoder (experimental), Replit (experimental), MosaicML MPT. Also now embeddings endpoint supports tokens arrays. See the langchain-chroma example! Note - this update does NOT include https://github.com/ggerganov/llama.cpp/pull/1405 which makes models incompatible.
11-05-2023: v1.9.0 released! 🔥 Important whisper updates (
233229
) and extended gpt4all model families support (
232
). Redpajama/dolly experimental (
214
)
10-05-2023: v1.8.0 released! 🔥 Added support for fast and accurate embeddings with bert.cpp (
222
)
09-05-2023: Added experimental support for transcriptions endpoint (
211
)
08-05-2023: Support for embeddings with models using the llama.cpp backend (
207
)
02-05-2023: Support for rwkv.cpp models (
158
) and for /edits endpoint
01-05-2023: Support for SSE stream of tokens in llama.cpp backends (
152
)
Chapter 8
Features
LocalAI provides a comprehensive set of features for running AI models locally. This section covers all the capabilities and functionalities available in LocalAI.
Core Features
Text Generation - Generate text with GPT-compatible models using various backends
Image Generation - Create images with Stable Diffusion and other diffusion models
Audio Processing - Transcribe audio to text and generate speech from text
Embeddings - Generate vector embeddings for semantic search and RAG applications
GPT Vision - Analyze and understand images with vision-language models
Advanced Features
OpenAI Functions - Use function calling and tools API with local models
Model Gallery - Browse and install pre-configured models
Backends - Learn about available backends and how to manage them
Getting Started
To start using these features, make sure you have LocalAI installed and have downloaded some models. Then explore the feature pages above to learn how to use each capability.
Subsections of Features
⚙️ Backends
LocalAI supports a variety of backends that can be used to run different types of AI models. There are core Backends which are included, and there are containerized applications that provide the runtime environment for specific model types, such as LLMs, diffusion models, or text-to-speech models.
Managing Backends in the UI
The LocalAI web interface provides an intuitive way to manage your backends:
Navigate to the “Backends” section in the navigation menu
Browse available backends from configured galleries
Use the search bar to find specific backends by name, description, or type
Filter backends by type using the quick filter buttons (LLM, Diffusion, TTS, Whisper)
Install or delete backends with a single click
Monitor installation progress in real-time
Each backend card displays:
Backend name and description
Type of models it supports
Installation status
Action buttons (Install/Delete)
Additional information via the info button
Backend Galleries
Backend galleries are repositories that contain backend definitions. They work similarly to model galleries but are specifically for backends.
Adding a Backend Gallery
You can add backend galleries by specifying the Environment VariableLOCALAI_BACKEND_GALLERIES:
This section contains instruction on how to use LocalAI with GPU acceleration.
Details
For acceleration for AMD or Metal HW is still in development, for additional details see the build
Automatic Backend Detection
When you install a model from the gallery (or a YAML file), LocalAI intelligently detects the required backend and your system’s capabilities, then downloads the correct version for you. Whether you’re running on a standard CPU, an NVIDIA GPU, an AMD GPU, or an Intel GPU, LocalAI handles it automatically.
For advanced use cases or to override auto-detection, you can use the LOCALAI_FORCE_META_BACKEND_CAPABILITY environment variable. Here are the available options:
default: Forces CPU-only backend. This is the fallback if no specific hardware is detected.
nvidia: Forces backends compiled with CUDA support for NVIDIA GPUs.
amd: Forces backends compiled with ROCm support for AMD GPUs.
intel: Forces backends compiled with SYCL/oneAPI support for Intel GPUs.
Model configuration
Depending on the model architecture and backend used, there might be different ways to enable GPU acceleration. It is required to configure the model you intend to use with a YAML config file. For example, for llama.cpp workloads a configuration file might look like this (where gpu_layers is the number of layers to offload to the GPU):
name: my-model-nameparameters:
# Relative to the models pathmodel: llama.cpp-model.ggmlv3.q5_K_M.bincontext_size: 1024threads: 1f16: true# enable with GPU accelerationgpu_layers: 22# GPU Layers (only used when built with cublas)
For diffusers instead, it might look like this instead:
CUDA 11 tags: master-gpu-nvidia-cuda-11, v1.40.0-gpu-nvidia-cuda-11, …
CUDA 12 tags: master-gpu-nvidia-cuda-12, v1.40.0-gpu-nvidia-cuda-12, …
In addition to the commands to run LocalAI normally, you need to specify --gpus all to docker, for example:
docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v1.40.0-gpu-nvidia-cuda12
If the GPU inferencing is working, you should be able to see something like:
5:22PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla T4
llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 1024
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 4321.77 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1598 MB
...................................................................................................
llama_init_from_file: kv self size = 512.00 MB
ROCM(AMD) acceleration
There are a limited number of tested configurations for ROCm systems however most newer deditated GPU consumer grade devices seem to be supported under the current ROCm6 implementation.
Due to the nature of ROCm it is best to run all implementations in containers as this limits the number of packages required for installation on host system, compatibility and package versions for dependencies across all variations of OS must be tested independently if desired, please refer to the build documentation.
Installed to host: amdgpu-dkms and rocm >=6.0.0 as per ROCm documentation.
Recommendations
Make sure to do not use GPU assigned for compute for desktop rendering.
Ensure at least 100GB of free space on disk hosting container runtime and storing images prior to installation.
Limitations
Ongoing verification testing of ROCm compatibility with integrated backends.
Please note the following list of verified backends and devices.
LocalAI hipblas images are built against the following targets: gfx900,gfx906,gfx908,gfx940,gfx941,gfx942,gfx90a,gfx1030,gfx1031,gfx1100,gfx1101
If your device is not one of these you must specify the corresponding GPU_TARGETS and specify REBUILD=true. Otherwise you don’t need to specify these in the commands below.
Verified
The devices in the following list have been tested with hipblas images running ROCm 6.0.0
Backend
Verified
Devices
llama.cpp
yes
Radeon VII (gfx906)
diffusers
yes
Radeon VII (gfx906)
piper
yes
Radeon VII (gfx906)
whisper
no
none
bark
no
none
coqui
no
none
transformers
no
none
exllama
no
none
exllama2
no
none
mamba
no
none
sentencetransformers
no
none
transformers-musicgen
no
none
vall-e-x
no
none
vllm
no
none
You can help by expanding this list.
System Prep
Check your GPU LLVM target is compatible with the version of ROCm. This can be found in the LLVM Docs.
Check which ROCm version is compatible with your LLVM target and your chosen OS (pay special attention to supported kernel versions). See the following for compatibility for (ROCm 6.0.0) or (ROCm 6.0.2)
Install you chosen version of the dkms and rocm (it is recommended that the native package manager be used for this process for any OS as version changes are executed more easily via this method if updates are required). Take care to restart after installing amdgpu-dkms and before installing rocm, for details regarding this see the installation documentation for your chosen OS (6.0.2 or 6.0.0)
Deploy. Yes it’s that easy.
Setup Example (Docker/containerd)
The following are examples of the ROCm specific configuration elements required.
# For full functionality select a non-'core' image, version locking the image is recommended for debug purposes.image: quay.io/go-skynet/local-ai:master-aio-gpu-hipblasenvironment:
- DEBUG=true# If your gpu is not already included in the current list of default targets the following build details are required. - REBUILD=true - BUILD_TYPE=hipblas - GPU_TARGETS=gfx906# Example for Radeon VIIdevices:
# AMD GPU only require the following devices be passed through to the container for offloading to occur. - /dev/dri - /dev/kfd
The same can also be executed as a run for your container runtime
Please ensure to add all other required environment variables, port forwardings, etc to your compose file or run command.
The rebuild process will take some time to complete when deploying these containers and it is recommended that you pull the image prior to deployment as depending on the version these images may be ~20GB in size.
Example (k8s) (Advanced Deployment/WIP)
For k8s deployments there is an additional step required before deployment, this is the deployment of the ROCm/k8s-device-plugin.
For any k8s environment the documentation provided by AMD from the ROCm project should be successful. It is recommended that if you use rke2 or OpenShift that you deploy the SUSE or RedHat provided version of this resource to ensure compatibility.
After this has been completed the helm chart from go-skynet can be configured and deployed mostly un-edited.
The following are details of the changes that should be made to ensure proper function.
While these details may be configurable in the values.yaml development of this Helm chart is ongoing and is subject to change.
The following details indicate the final state of the localai deployment relevant to GPU function.
apiVersion: apps/v1kind: Deploymentmetadata:
name: {NAME}-local-ai...
spec:
...template:
...spec:
containers:
- env:
- name: HIP_VISIBLE_DEVICESvalue: '0'# This variable indicates the devices available to container (0:device1 1:device2 2:device3) etc.# For multiple devices (say device 1 and 3) the value would be equivalent to HIP_VISIBLE_DEVICES="0,2"# Please take note of this when an iGPU is present in host system as compatibility is not assured....resources:
limits:
amd.com/gpu: '1'requests:
amd.com/gpu: '1'
This configuration has been tested on a ‘custom’ cluster managed by SUSE Rancher that was deployed on top of Ubuntu 22.04.4, certification of other configuration is ongoing and compatibility is not guaranteed.
Notes
When installing the ROCM kernel driver on your system ensure that you are installing an equal or newer version that that which is currently implemented in LocalAI (6.0.0 at time of writing).
AMD documentation indicates that this will ensure functionality however your mileage may vary depending on the GPU and distro you are using.
If you encounter an Error 413 on attempting to upload an audio file or image for whisper or llava/bakllava on a k8s deployment, note that the ingress for your deployment may require the annotation nginx.ingress.kubernetes.io/proxy-body-size: "25m" to allow larger uploads. This may be included in future versions of the helm chart.
Intel acceleration (sycl)
Requirements
If building from source, you need to install Intel oneAPI Base Toolkit and have the Intel drivers available in the system.
Container images
To use SYCL, use the images with gpu-intel in the tag, for example v3.7.0-gpu-intel, …
LocalAI supports generating text with GPT with llama.cpp and other backends (such as rwkv.cpp as ) see also the Model compatibility for an up-to-date list of the supported model families.
Note:
You can also specify the model name as part of the OpenAI token.
If only one model is available, the API will use it for all the requests.
To generate a completion, you can send a POST request to the /v1/completions endpoint with the instruction as per the request body:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
Available additional parameters: top_p, top_k, max_tokens
List models
You can list all the models available with:
curl http://localhost:8080/v1/models
Backends
RWKV
RWKV support is available through llama.cpp (see below)
llama.cpp
llama.cpp is a popular port of Facebook’s LLaMA model in C/C++.
Note
The ggml file format has been deprecated. If you are using ggml models and you are configuring your model with a YAML file, specify, use a LocalAI version older than v2.25.0. For gguf models, use the llama backend. The go backend is deprecated as well but still available as go-llama.
Features
The llama.cpp model supports the following features:
Prompt templates are useful for models that are fine-tuned towards a specific prompt.
Automatic setup
LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for ggml or gguf models.
For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.1
}'
LocalAI will automatically download and configure the model in the model directory.
Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the model gallery documentation.
YAML configuration
To use the llama.cpp backend, specify llama-cpp as the backend in the YAML file:
name: llamabackend: llama-cppparameters:
# Relative to the models pathmodel: file.gguf
Backend Options
The llama.cpp backend supports additional configuration options that can be specified in the options field of your model YAML configuration. These options allow fine-tuning of the backend behavior:
Option
Type
Description
Example
use_jinja or jinja
boolean
Enable Jinja2 template processing for chat templates. When enabled, the backend uses Jinja2-based chat templates from the model for formatting messages.
use_jinja:true
context_shift
boolean
Enable context shifting, which allows the model to dynamically adjust context window usage.
context_shift:true
cache_ram
integer
Set the maximum RAM cache size in MiB for KV cache. Use -1 for unlimited (default).
cache_ram:2048
parallel or n_parallel
integer
Enable parallel request processing. When set to a value greater than 1, enables continuous batching for handling multiple requests concurrently.
parallel:4
grpc_servers or rpc_servers
string
Comma-separated list of gRPC server addresses for distributed inference. Allows distributing workload across multiple llama.cpp workers.
Note: The parallel option can also be set via the LLAMACPP_PARALLEL environment variable, and grpc_servers can be set via the LLAMACPP_GRPC_SERVERS environment variable. Options specified in the YAML file take precedence over environment variables.
Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”. Both exllama and exllama2 are supported.
Model setup
Download the model as a folder inside the model directory and create a YAML file specifying the exllama backend. For instance with the TheBloke/WizardLM-7B-uncensored-GPTQ model:
The backend will automatically download the required files in order to run the model.
Usage
Use the completions endpoint by specifying the vllm backend:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "vllm",
"prompt": "Hello, my name is",
"temperature": 0.1, "top_p": 0.1
}'
Transformers
Transformers is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.
LocalAI has a built-in integration with Transformers, and it can be used to run models.
This is an extra backend - in the container images (the extra images already contains python dependencies for Transformers) is already available and there is nothing to do for the setup.
Setup
Create a YAML file for the model you want to use with transformers.
To setup a model, you need to just specify the model name in the YAML config file:
The backend will automatically download the required files in order to run the model.
Parameters
Type
Type
Description
AutoModelForCausalLM
AutoModelForCausalLM is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration
OVModelForCausalLM
for Intel CPU/GPU/NPU OpenVINO Text Generation models
OVModelForFeatureExtraction
for Intel CPU/GPU/NPU OpenVINO Embedding acceleration
N/A
Defaults to AutoModel
OVModelForCausalLM requires OpenVINO IR Text Generation models from Hugging face
OVModelForFeatureExtraction works with any Safetensors Transformer Feature Extraction model from Huggingface (Embedding Model)
Please note that streaming is currently not implemente in AutoModelForCausalLM for Intel GPU.
AMD GPU support is not implemented.
Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.
Embeddings
Use embeddings: true if the model is an embedding model
Inference device selection
Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the main_gpu parameter.
Inference Engine
Applicable Values
CUDA
cuda, cuda.X where X is the GPU device like in nvidia-smi -L output
OpenVINO
Any applicable value from Inference Modes like AUTO,CPU,GPU,NPU,MULTI,HETERO
Example for CUDA:
main_gpu: cuda.0
Example for OpenVINO:
main_gpu: AUTO:-CPU
This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models.
Inference Precision
Transformer backend automatically select the fastest applicable inference precision according to the device support.
CUDA backend can manually enable bfloat16 if your hardware support it with the following parameter:
f16: true
Quantization
Quantization
Description
bnb_8bit
8-bit quantization
bnb_4bit
4-bit quantization
xpu_8bit
8-bit quantization for Intel XPUs
xpu_4bit
4-bit quantization for Intel XPUs
Trust Remote Code
Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library.
By default it is disabled for security.
It can be manually enabled with:
trust_remote_code: true
Maximum Context Size
Maximum context size in bytes can be specified with the parameter: context_size. Do not use values higher than what your model support.
Usage example:
context_size: 8192
Auto Prompt Template
Usually chat template is defined by the model author in the tokenizer_config.json file.
To enable it use the use_tokenizer_template: true parameter in the template section.
Usage example:
template:
use_tokenizer_template: true
Custom Stop Words
Stopwords are usually defined in tokenizer_config.json file.
They can be overridden with the stopwords parameter in case of need like in llama3-Instruct model.
Usage example:
stopwords:
- "<|eot_id|>"
- "<|end_of_text|>"
Usage
Use the completions endpoint by specifying the transformers model:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "transformers",
"prompt": "Hello, my name is",
"temperature": 0.1, "top_p": 0.1
}'
Examples
OpenVINO
A model configuration file for openvion and starling model:
A reranking model, often referred to as a cross-encoder, is a core component in the two-stage retrieval systems used in information retrieval and natural language processing tasks.
Given a query and a set of documents, it will output similarity scores.
We can use then the score to reorder the documents by relevance in our RAG system to increase its overall accuracy and filter out non-relevant results.
LocalAI supports reranker models, and you can use them by using the rerankers backend, which uses rerankers.
Usage
You can test rerankers by using container images with python (this does NOT work with core images) and a model config file like this, or by installing cross-encoder from the gallery in the UI:
aplay is a Linux command. You can use other tools to play the audio file.
The model name is the filename with the extension.
The model name is case sensitive.
LocalAI must be compiled with the GO_TAGS=tts flag.
Transformers-musicgen
LocalAI also has experimental support for transformers-musicgen for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:
Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.
Vall-E-X
VALL-E-X is an open source implementation of Microsoft’s VALL-E X zero-shot TTS model.
Setup
The backend will automatically download the required files in order to run the model.
This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.
Usage
Use the tts endpoint by specifying the vall-e-x backend:
In order to use voice cloning capabilities you must create a YAML configuration file to setup a model:
name: cloned-voicebackend: vall-e-xparameters:
model: "cloned-voice"tts:
vall-e:
# The path to the audio file to be cloned# relative to the models directory# Max 15saudio_path: "audio-sample.wav"
Then you can specify the model name in the requests:
You can also use a config-file to specify TTS models and their parameters.
In the following example we define a custom config to load the xtts_v2 model, and specify a voice and language.
name: xtts_v2backend: coquiparameters:
language: frmodel: tts_models/multilingual/multi-dataset/xtts_v2tts:
voice: Ana Florence
With this config, you can now use the following curl command to generate a text-to-speech audio file:
curl -L http://localhost:8080/tts \
-H "Content-Type: application/json"\
-d '{
"model": "xtts_v2",
"input": "Bonjour, je suis Ana Florence. Comment puis-je vous aider?"
}' | aplay
Response format
To provide some compatibility with OpenAI API regarding response_format, ffmpeg must be installed (or a docker image including ffmpeg used) to leverage converting the generated wav file before the api provide its response.
Warning regarding a change in behaviour. Before this addition, the parameter was ignored and a wav file was always returned, with potential codec errors later in the integration (like trying to decode a mp3 file from a wav, which is the default format used by OpenAI)
Supported format thanks to ffmpeg are wav, mp3, aac, flac, opus, defaulting to wav if an unknown or no format is provided.
Note: To set a negative prompt, you can split the prompt with |, for instance: a cute baby sea otter|malformed.
curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
"prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
"size": "256x256"
}'
Backends
stablediffusion-ggml
This backend is based on stable-diffusion.cpp. Every model supported by that backend is supported indeed with LocalAI.
Setup
There are already several models in the gallery that are available to install and get up and running with this backend, you can for example run flux by searching it in the Model gallery (flux.1-dev-ggml) or start LocalAI with run:
local-ai run flux.1-dev-ggml
To use a custom model, you can follow these steps:
Create a model file stablediffusion.yaml in the models folder:
Download the required assets to the models repository
Start LocalAI
Diffusers
Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. LocalAI has a diffusers backend which allows image generation using the diffusers library.
This is an extra backend - in the container is already available and there is nothing to do for the setup. Do not use core images (ending with -core). If you are building manually, see the build instructions.
Model setup
The models will be downloaded the first time you use the backend from huggingface automatically.
Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:
The following parameters are available in the configuration file:
Parameter
Description
Default
f16
Force the usage of float16 instead of float32
false
step
Number of steps to run the model for
30
cuda
Enable CUDA acceleration
false
enable_parameters
Parameters to enable for the model
negative_prompt,num_inference_steps,clip_skip
scheduler_type
Scheduler type
k_dpp_sde
cfg_scale
Configuration scale
8
clip_skip
Clip skip
None
pipeline_type
Pipeline type
AutoPipelineForText2Image
lora_adapters
A list of lora adapters (file names relative to model directory) to apply
None
lora_scales
A list of lora scales (floats) to apply
None
There are available several types of schedulers:
Scheduler
Description
ddim
DDIM
pndm
PNDM
heun
Heun
unipc
UniPC
euler
Euler
euler_a
Euler a
lms
LMS
k_lms
LMS Karras
dpm_2
DPM2
k_dpm_2
DPM2 Karras
dpm_2_a
DPM2 a
k_dpm_2_a
DPM2 a Karras
dpmpp_2m
DPM++ 2M
k_dpmpp_2m
DPM++ 2M Karras
dpmpp_sde
DPM++ SDE
k_dpmpp_sde
DPM++ SDE Karras
dpmpp_2m_sde
DPM++ 2M SDE
k_dpmpp_2m_sde
DPM++ 2M SDE Karras
Pipelines types available:
Pipeline type
Description
StableDiffusionPipeline
Stable diffusion pipeline
StableDiffusionImg2ImgPipeline
Stable diffusion image to image pipeline
StableDiffusionDepth2ImgPipeline
Stable diffusion depth to image pipeline
DiffusionPipeline
Diffusion pipeline
StableDiffusionXLPipeline
Stable diffusion XL pipeline
StableVideoDiffusionPipeline
Stable video diffusion pipeline
AutoPipelineForText2Image
Automatic detection pipeline for text to image
VideoDiffusionPipeline
Video diffusion pipeline
StableDiffusion3Pipeline
Stable diffusion 3 pipeline
FluxPipeline
Flux pipeline
FluxTransformer2DModel
Flux transformer 2D model
SanaPipeline
Sana pipeline
Advanced: Additional parameters
Additional arbitrarly parameters can be specified in the option field in key/value separated by ::
name: animagine-xloptions:
- "cfg_scale:6"
Note: There is no complete parameter list. Any parameter can be passed arbitrarly and is passed to the model directly as argument to the pipeline. Different pipelines/implementations support different parameters.
The example above, will result in the following python code when generating images:
pipe(
prompt="A cute baby sea otter", # Options passed via API size="256x256", # Options passed via API cfg_scale=6# Additional parameter passed via configuration file)
Usage
Text to Image
Use the image generation endpoint with the model name from the configuration file:
LocalAI supports object detection through various backends. This feature allows you to identify and locate objects within images with high accuracy and real-time performance. Currently, RF-DETR is available as an implementation.
Overview
Object detection in LocalAI is implemented through dedicated backends that can identify and locate objects within images. Each backend provides different capabilities and model architectures.
Key Features:
Real-time object detection
High accuracy detection with bounding boxes
Support for multiple hardware accelerators (CPU, NVIDIA GPU, Intel GPU, AMD GPU)
Structured detection results with confidence scores
Easy integration through the /v1/detection endpoint
Usage
Detection Endpoint
LocalAI provides a dedicated /v1/detection endpoint for object detection tasks. This endpoint is specifically designed for object detection and returns structured detection results with bounding boxes and confidence scores.
API Reference
To perform object detection, send a POST request to the /v1/detection endpoint:
x, y: Coordinates of the bounding box top-left corner
width, height: Dimensions of the bounding box
confidence: Detection confidence score (0.0 to 1.0)
class_name: The detected object class
Backends
RF-DETR Backend
The RF-DETR backend is implemented as a Python-based gRPC service that integrates seamlessly with LocalAI. It provides object detection capabilities using the RF-DETR model architecture and supports multiple hardware configurations:
CPU: Optimized for CPU inference
NVIDIA GPU: CUDA acceleration for NVIDIA GPUs
Intel GPU: Intel oneAPI optimization
AMD GPU: ROCm acceleration for AMD GPUs
NVIDIA Jetson: Optimized for ARM64 NVIDIA Jetson devices
Setup
Using the Model Gallery (Recommended)
The easiest way to get started is using the model gallery. The rfdetr-base model is available in the official LocalAI gallery:
# Install and run the rfdetr-base modellocal-ai run rfdetr-base
You can also install it through the web interface by navigating to the Models section and searching for “rfdetr-base”.
Manual Configuration
Create a model configuration file in your models directory:
Verify model compatibility with your backend version
Low Detection Accuracy
Ensure good image quality and lighting
Check if objects are clearly visible
Consider using a larger model for better accuracy
Slow Performance
Enable GPU acceleration if available
Use a smaller model for faster inference
Optimize image resolution
Debug Mode
Enable debug logging for troubleshooting:
local-ai run --debug rfdetr-base
Object Detection Category
LocalAI includes a dedicated object-detection category for models and backends that specialize in identifying and locating objects within images. This category currently includes:
Additional object detection models and backends will be added to this category in the future. You can filter models by the object-detection tag in the model gallery to find all available object detection models.
The sentencetransformers backend is an optional backend of LocalAI and uses Python. If you are running LocalAI from the containers you are good to go and should be already configured for use.
For local execution, you also have to specify the extra backend in the EXTERNAL_GRPC_BACKENDS environment variable.
The sentencetransformers backend does support only embeddings of text, and not of tokens. If you need to embed tokens you can use the bert backend or llama.cpp.
No models are required to be downloaded before using the sentencetransformers backend. The models will be downloaded automatically the first time the API is used.
Llama.cpp embeddings
Embeddings with llama.cpp are supported with the llama-cpp backend, it needs to be enabled with embeddings set to true.
The chat endpoint supports the grammar parameter, which allows users to specify a grammar in Backus-Naur Form (BNF). This feature enables the Large Language Model (LLM) to generate outputs adhering to a user-defined schema, such as JSON, YAML, or any other format that can be defined using BNF. For more details about BNF, see Backus-Naur Form on Wikipedia.
Note
Compatibility Notice: This feature is only supported by models that use the llama.cpp backend. For a complete list of compatible models, refer to the Model Compatibility page. For technical details, see the related pull requests: PR #1773 and PR #1887.
Setup
To use this feature, follow the installation and setup instructions on the LocalAI Functions page. Ensure that your local setup meets all the prerequisites specified for the llama.cpp backend.
💡 Usage Example
The following example demonstrates how to use the grammar parameter to constrain the model’s output to either “yes” or “no”. This can be particularly useful in scenarios where the response format needs to be strictly controlled.
In this example, the grammar parameter is set to a simple choice between “yes” and “no”, ensuring that the model’s response adheres strictly to one of these options regardless of the context.
Example: JSON Output Constraint
You can also use grammars to enforce JSON output format:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Generate a person object with name and age"}],
"grammar": "root ::= \"{\" \"\\\"name\\\":\" string \",\\\"age\\\":\" number \"}\"\nstring ::= \"\\\"\" [a-z]+ \"\\\"\"\nnumber ::= [0-9]+"
}'
This functionality enables LocalAI to distribute inference requests across multiple worker nodes, improving efficiency and performance. Nodes are automatically discovered and connect via p2p by using a shared token which makes sure the communication is secure and private between the nodes of the network.
LocalAI supports two modes of distributed inferencing via p2p:
Federated Mode: Requests are shared between the cluster and routed to a single worker node in the network based on the load balancer’s decision.
Worker Mode (aka “model sharding” or “splitting weights”): Requests are processed by all the workers which contributes to the final inference result (by sharing the model weights).
A list of global instances shared by the community is available at explorer.localai.io.
Usage
Starting LocalAI with --p2p generates a shared token for connecting multiple instances: and that’s all you need to create AI clusters, eliminating the need for intricate network setups.
Simply navigate to the “Swarm” section in the WebUI and follow the on-screen instructions.
For fully shared instances, initiate LocalAI with –p2p –federated and adhere to the Swarm section’s guidance. This feature, while still experimental, offers a tech preview quality experience.
Federated mode
Federated mode allows to launch multiple LocalAI instances and connect them together in a federated network. This mode is useful when you want to distribute the load of the inference across multiple nodes, but you want to have a single point of entry for the API. In the Swarm section of the WebUI, you can see the instructions to connect multiple instances together.
To start a LocalAI server in federated mode, run:
local-ai run --p2p --federated
This will generate a token that you can use to connect other LocalAI instances to the network or others can use to join the network. If you already have a token, you can specify it using the TOKEN environment variable.
To start a load balanced server that routes the requests to the network, run with the TOKEN:
local-ai federated
To see all the available options, run local-ai federated --help.
The instructions are displayed in the “Swarm” section of the WebUI, guiding you through the process of connecting multiple instances.
Workers mode
Note
This feature is available exclusively with llama-cpp compatible models.
(Note: You can also supply the token via command-line arguments)
The server logs should indicate that new workers are being discovered.
Start inference as usual on the server initiated in step 1.
Environment Variables
There are options that can be tweaked or parameters that can be set using environment variables
Environment Variable
Description
LOCALAI_P2P
Set to “true” to enable p2p
LOCALAI_FEDERATED
Set to “true” to enable federated mode
FEDERATED_SERVER
Set to “true” to enable federated server
LOCALAI_P2P_DISABLE_DHT
Set to “true” to disable DHT and enable p2p layer to be local only (mDNS)
LOCALAI_P2P_ENABLE_LIMITS
Set to “true” to enable connection limits and resources management (useful when running with poor connectivity or want to limit resources consumption)
LOCALAI_P2P_LISTEN_MADDRS
Set to comma separated list of multiaddresses to override default libp2p 0.0.0.0 multiaddresses
LOCALAI_P2P_DHT_ANNOUNCE_MADDRS
Set to comma separated list of multiaddresses to override announcing of listen multiaddresses (useful when external address:port is remapped)
LOCALAI_P2P_BOOTSTRAP_PEERS_MADDRS
Set to comma separated list of multiaddresses to specify custom DHT bootstrap nodes
LOCALAI_P2P_TOKEN
Set the token for the p2p network
LOCALAI_P2P_LOGLEVEL
Set the loglevel for the LocalAI p2p stack (default: info)
LOCALAI_P2P_LIB_LOGLEVEL
Set the loglevel for the underlying libp2p stack (default: fatal)
Architecture
LocalAI uses https://github.com/libp2p/go-libp2p under the hood, the same project powering IPFS. Differently from other frameworks, LocalAI uses peer2peer without a single master server, but rather it uses sub/gossip and ledger functionalities to achieve consensus across different peers.
EdgeVPN is used as a library to establish the network and expose the ledger functionality under a shared token to ease out automatic discovery and have separated, private peer2peer networks.
The weights are split proportional to the memory when running into worker mode, when in federation mode each request is split to every node which have to load the model fully.
Debugging
To debug, it’s often useful to run in debug mode, for instance:
Audio to text models are models that can generate text from an audio file.
The transcription endpoint allows to convert audio files to text. The endpoint is based on whisper.cpp, a C++ library for audio transcription. The endpoint input supports all the audio formats supported by ffmpeg.
Usage
Once LocalAI is started and whisper models are installed, you can use the /v1/audio/transcriptions API endpoint.
The transcriptions endpoint then can be tested like so:
## Get an example audio filewget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg
## Send the example audio file to the transcriptions endpointcurl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@$PWD/gb1.ogg" -F model="whisper-1"## Result{"text":"My fellow Americans, this day has brought terrible news and great sadness to our country.At nine o'clock this morning, Mission Control in Houston lost contact with our Space ShuttleColumbia.A short time later, debris was seen falling from the skies above Texas.The Columbia's lost.There are no survivors.One board was a crew of seven.Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain DavidBrown, Commander William McCool, Dr. Kultna Shavla, and Elon Ramon, a colonel in the IsraeliAir Force.These men and women assumed great risk in the service to all humanity.In an age when spaceflight has come to seem almost routine, it is easy to overlook thedangers of travel by rocket and the difficulties of navigating the fierce outer atmosphere ofthe Earth.These astronauts knew the dangers, and they faced them willingly, knowing they had a highand noble purpose in life.Because of their courage and daring and idealism, we will miss them all the more.All Americans today are thinking as well of the families of these men and women who havebeen given this sudden shock and grief.You're not alone.Our entire nation agrees with you, and those you loved will always have the respect andgratitude of this country.The cause in which they died will continue.Mankind has led into the darkness beyond our world by the inspiration of discovery andthe longing to understand.Our journey into space will go on.In the skies today, we saw destruction and tragedy.As farther than we can see, there is comfort and hope.In the words of the prophet Isaiah, \"Lift your eyes and look to the heavens who createdall these, he who brings out the starry hosts one by one and calls them each by name.\"Because of his great power and mighty strength, not one of them is missing.The same creator who names the stars also knows the names of the seven souls we mourntoday.The crew of the shuttle Columbia did not return safely to Earth yet we can pray that all aresafely home.May God bless the grieving families and may God continue to bless America.[BLANK_AUDIO]"}
The functions calls maps automatically to grammars which are currently supported only by llama.cpp, however, it is possible to turn off the use of grammars, and extract tool arguments from the LLM responses, by specifying in the YAML file no_grammar and a regex to map the response from the LLM:
name: model_nameparameters:
# Model file namemodel: model/namefunction:
# set to true to not use grammarsno_grammar: true# set one or more regexes used to extract the function tool arguments from the LLM responseresponse_regex:
- "(?P<function>\w+)\s*\((?P<arguments>.*)\)"
The response regex have to be a regex with named parameters to allow to scan the function name and the arguments. For instance, consider:
(?P<function>\w+)\s*\((?P<arguments>.*)\)
will catch
function_name({ "foo": "bar"})
Parallel tools calls
This feature is experimental and has to be configured in the YAML of the model by enabling function.parallel_calls:
name: gpt-3.5-turboparameters:
# Model file namemodel: ggml-openllama.bintop_p: 80top_k: 0.9temperature: 0.1function:
# set to true to allow the model to call multiple functions in parallelparallel_calls: true
Use functions with grammar
It is possible to also specify the full function signature (for debugging, or to use with other clients).
The chat endpoint accepts the grammar_json_functions additional parameter which takes a JSON schema object.
Grammars and function tools can be used as well in conjunction with vision APIs:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llava", "grammar": "root ::= (\"yes\" | \"no\")",
"messages": [{"role": "user", "content": [{"type":"text", "text": "Is there some grass in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
💡 Examples
A full e2e example with docker-compose is available here.
💾 Stores
Stores are an experimental feature to help with querying data using similarity search. It is
a low level API that consists of only get, set, delete and find.
For example if you have an embedding of some text and want to find text with similar embeddings.
You can create embeddings for chunks of all your text then compare them against the embedding of the text you
are searching on.
An embedding here meaning a vector of numbers that represent some information about the text. The
embeddings are created from an A.I. model such as BERT or a more traditional method such as word
frequency.
Previously you would have to integrate with an external vector database or library directly.
With the stores feature you can now do it through the LocalAI API.
Note however that doing a similarity search on embeddings is just one way to do retrieval. A higher level
API can take this into account, so this may not be the best place to start.
API overview
There is an internal gRPC API and an external facing HTTP JSON API. We’ll just discuss the external HTTP API,
however the HTTP API mirrors the gRPC API. Consult pkg/store/client for internal usage.
Everything is in columnar format meaning that instead of getting an array of objects with a key and a value each.
You instead get two separate arrays of keys and values.
Keys are arrays of floating point numbers with a maximum width of 32bits. Values are strings (in gRPC they are bytes).
The key vectors must all be the same length and it’s best for search performance if they are normalized. When
addings keys it will be detected if they are not normalized and what length they are.
All endpoints accept a store field which specifies which store to operate on. Presently they are created
on the fly and there is only one store backend so no configuration is required.
topk limits the number of results returned. The result value is the same as get,
except that it also includes an array of similarities. Where 1.0 is the maximum similarity.
They are returned in the order of most similar to least.
🖼️ Model gallery
The model gallery is a curated collection of models configurations for LocalAI that enables one-click install of models directly from the LocalAI Web interface.
LocalAI to ease out installations of models provide a way to preload models on start and downloading and installing them in runtime. You can install models manually by copying them over the models directory, or use the API or the Web interface to configure, download and verify the model assets for you.
Note
The models in this gallery are not directly maintained by LocalAI. If you find a model that is not working, please open an issue on the model gallery repository.
Note
GPT and text generation models might have a license which is not permissive for commercial use or might be questionable or without any license at all. Please check the model license before using it. The official gallery contains only open licensed models.
Useful Links and resources
Open LLM Leaderboard - here you can find a list of the most performing models on the Open LLM benchmark. Keep in mind models compatible with LocalAI must be quantized in the gguf format.
How it works
Navigate the WebUI interface in the “Models” section from the navbar at the top. Here you can find a list of models that can be installed, and you can install them by clicking the “Install” button.
Add other galleries
You can add other galleries by setting the GALLERIES environment variable. The GALLERIES environment variable is a list of JSON objects, where each object has a name and a url field. The name field is the name of the gallery, and the url field is the URL of the gallery’s index file, for example:
where github:mudler/localai/gallery/index.yaml will be expanded automatically to https://raw.githubusercontent.com/mudler/LocalAI/main/index.yaml.
Note: the url are expanded automatically for github and huggingface, however https:// and http:// prefix works as well.
Note
If you want to build your own gallery, there is no documentation yet. However you can find the source of the default gallery in the LocalAI repository.
List Models
To list all the available models, use the /models/available endpoint:
Models can be installed by passing the full URL of the YAML config file, or either an identifier of the model in the gallery. The gallery is a repository of models that can be installed by passing the model name.
To install a model from the gallery repository, you can pass the model name in the id field. For instance, to install the bert-embeddings model, you can use the following command:
localai is the repository. It is optional and can be omitted. If the repository is omitted LocalAI will search the model by name in all the repositories. In the case the same model name is present in both galleries the first match wins.
bert-embeddings is the model name in the gallery
(read its config here).
How to install a model not part of a gallery
If you don’t want to set any gallery repository, you can still install models by loading a model configuration file.
In the body of the request you must specify the model configuration file URL (url), optionally a name to install the model (name), extra files to install (files), and configuration overrides (overrides). When calling the API endpoint, LocalAI will download the models files and write the configuration to the folder used to store models.
To preload models on start instead you can use the PRELOAD_MODELS environment variable.
To preload models on start, use the PRELOAD_MODELS environment variable by setting it to a JSON array of model uri:
PRELOAD_MODELS='[{"url": "<MODEL_URL>"}]'
Note: url or id must be specified. url is used to a url to a model gallery configuration, while an id is used to refer to models inside repositories. If both are specified, the id will be used.
While the API is running, you can install the model by using the /models/apply endpoint and point it to the stablediffusion model in the models-gallery:
LocalAI will create a batch process that downloads the required files from a model definition and automatically reload itself to include the new model.
Input: url or id (required), name (optional), files (optional)
An optional, list of additional files can be specified to be downloaded within files. The name allows to override the model name. Finally it is possible to override the model config file with override.
The url is a full URL, or a github url (github:org/repo/file.yaml), or a local file (file:///path/to/file.yaml).
The id is a string in the form <GALLERY>@<MODEL_NAME>, where <GALLERY> is the name of the gallery, and <MODEL_NAME> is the name of the model in the gallery. Galleries can be specified during startup with the GALLERIES environment variable.
Returns an uuid and an url to follow up the state of the process:
LocalAI now supports the Model Context Protocol (MCP), enabling powerful agentic capabilities by connecting AI models to external tools and services. This feature allows your LocalAI models to interact with various MCP servers, providing access to real-time data, APIs, and specialized tools.
What is MCP?
The Model Context Protocol is a standard for connecting AI models to external tools and data sources. It enables AI agents to:
Access real-time information from external APIs
Execute commands and interact with external systems
Use specialized tools for specific tasks
Maintain context across multiple tool interactions
Key Features
🔄 Real-time Tool Access: Connect to external MCP servers for live data
🛠️ Multiple Server Support: Configure both remote HTTP and local stdio servers
⚡ Cached Connections: Efficient tool caching for better performance
🔒 Secure Authentication: Support for bearer token authentication
🎯 OpenAI Compatible: Uses the familiar /mcp/v1/chat/completions endpoint
🧠 Advanced Reasoning: Configurable reasoning and re-evaluation capabilities
📋 Auto-Planning: Break down complex tasks into manageable steps
🎯 MCP Prompts: Specialized prompts for better MCP server interaction
🔄 Plan Re-evaluation: Dynamic plan adjustment based on results
⚙️ Flexible Agent Control: Customizable execution limits and retry behavior
Configuration
MCP support is configured in your model’s YAML configuration file using the mcp section:
enable_plan_re_evaluator: Enable plan re-evaluation (default: false)
Usage
API Endpoint
Use the MCP-enabled completion endpoint:
curl http://localhost:8080/mcp/v1/chat/completions \
-H "Content-Type: application/json"\
-d '{
"model": "my-agentic-model",
"messages": [
{"role": "user", "content": "What is the current weather in New York?"}
],
"temperature": 0.7
}'
Example Response
{
"id": "chatcmpl-123",
"created": 1699123456,
"model": "my-agentic-model",
"choices": [
{
"text": "The current weather in New York is 72°F (22°C) with partly cloudy skies. The humidity is 65% and there's a light breeze from the west at 8 mph." }
],
"object": "text_completion"}
The agent section controls how the AI model interacts with MCP tools:
Execution Control
max_attempts: Limits how many times a tool can be retried if it fails. Higher values provide more resilience but may increase response time.
max_iterations: Controls the maximum number of reasoning cycles the agent can perform. More iterations allow for complex multi-step problem solving.
Reasoning Capabilities
enable_reasoning: When enabled, the agent uses advanced reasoning to better understand tool results and plan next steps.
Planning Capabilities
enable_planning: When enabled, the agent uses auto-planning to break down complex tasks into manageable steps and execute them systematically. The agent will automatically detect when planning is needed.
enable_mcp_prompts: When enabled, the agent uses specialized prompts exposed by the MCP servers to interact with the exposed tools.
enable_plan_re_evaluator: When enabled, the agent can re-evaluate and adjust its execution plan based on intermediate results.
Tool Discovery: LocalAI connects to configured MCP servers and discovers available tools
Tool Caching: Tools are cached per model for efficient reuse
Agent Execution: The AI model uses the Cogito framework to execute tools
Response Generation: The model generates responses incorporating tool results
Supported MCP Servers
LocalAI is compatible with any MCP-compliant server.
Best Practices
Security
Use environment variables for sensitive tokens
Validate MCP server endpoints before deployment
Implement proper authentication for remote servers
Performance
Cache frequently used tools
Use appropriate timeout values for external APIs
Monitor resource usage for stdio servers
Error Handling
Implement fallback mechanisms for tool failures
Log tool execution for debugging
Handle network timeouts gracefully
With External Applications
Use MCP-enabled models in your applications:
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/mcp/v1",
api_key="your-api-key")
response = client.chat.completions.create(
model="my-agentic-model",
messages=[
{"role": "user", "content": "Analyze the latest research papers on AI"}
]
)
MCP and adding packages
It might be handy to install packages before starting the container to setup the environment. This is an example on how you can do that with docker-compose (installing and configuring docker)
Feel free to open up a Pull request (by clicking at the “Edit page” below) to get a page for your project made or if you see a error on one of the pages!
Chapter 20
Advanced
Subsections of Advanced
Advanced usage
Model Configuration with YAML Files
LocalAI uses YAML configuration files to define model parameters, templates, and behavior. You can create individual YAML files in the models directory or use a single configuration file with multiple models.
You can use a default template for every model present in your model path, by creating a corresponding file with the `.tmpl` suffix next to your model. For instance, if the model is called `foo.bin`, you can create a sibling file, `foo.bin.tmpl` which will be used as a default prompt and can be used with alpaca:
The below instruction describes a task. Write a response that appropriately completes the request.
### Instruction:
{{.Input}}
### Response:
See the prompt-templates directory in this repository for templates for some of the most popular models.
For the edit endpoint, an example template for alpaca-based models can be:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.### Instruction:{{.Instruction}}### Input:{{.Input}}### Response:
Install models using the API
Instead of installing models manually, you can use the LocalAI API endpoints and a model definition to install programmatically via API models in runtime.
A curated collection of model files is in the model-gallery. The files of the model gallery are different from the model files used to configure LocalAI models. The model gallery files contains information about the model setup, and the files necessary to run the model locally.
To install for example lunademo, you can send a POST call to the /models/apply endpoint with the model definition url (url) and the name of the model should have in LocalAI (name, optional):
PRELOAD_MODELS (or --preload-models) takes a list in JSON with the same parameter of the API calls of the /models/apply endpoint.
Similarly it can be specified a path to a YAML configuration file containing a list of models with PRELOAD_MODELS_CONFIG ( or --preload-models-config ):
LocalAI can automatically cache prompts for faster loading of the prompt. This can be useful if your model need a prompt template with prefixed text in the prompt before the input.
To enable prompt caching, you can control the settings in the model config YAML file:
prompt_cache_path: "cache"prompt_cache_all: true
prompt_cache_path is relative to the models folder. you can enter here a name for the file that will be automatically create during the first load if prompt_cache_all is set to true.
Configuring a specific backend for the model
By default LocalAI will try to autoload the model by trying all the backends. This might work for most of models, but some of the backends are NOT configured to autoload.
In order to specify a backend for your models, create a model config file in your models directory specifying the backend:
name: gpt-3.5-turboparameters:
# Relative to the models pathmodel: ...backend: llama-stable
Connect external backends
LocalAI backends are internally implemented using gRPC services. This also allows LocalAI to connect to external gRPC services on start and extend LocalAI functionalities via third-party binaries.
The --external-grpc-backends parameter in the CLI can be used either to specify a local backend (a file) or a remote URL. The syntax is <BACKEND_NAME>:<BACKEND_URI>. Once LocalAI is started with it, the new backend name will be available for all the API endpoints.
So for instance, to register a new backend which is a local file:
Special token for interacting with HuggingFace Inference API, required only when using the langchain-huggingface backend
EXTRA_BACKENDS
A space separated list of backends to prepare. For example EXTRA_BACKENDS="backend/python/diffusers backend/python/transformers" prepares the python environment on start
DISABLE_AUTODETECT
false
Disable autodetect of CPU flagset on start
LLAMACPP_GRPC_SERVERS
A list of llama.cpp workers to distribute the workload. For example LLAMACPP_GRPC_SERVERS="address1:port,address2:port"
Here is how to configure these variables:
docker run --env REBUILD=true localai
docker run --env-file .env localai
CLI Parameters
For a complete reference of all CLI parameters, environment variables, and command-line options, see the CLI Reference page.
You can control LocalAI with command line arguments to specify a binding address, number of threads, model paths, and many other options. Any command line parameter can be specified via an environment variable.
.env files
Any settings being provided by an Environment Variable can also be provided from within .env files. There are several locations that will be checked for relevant .env files. In order of precedence they are:
.env within the current directory
localai.env within the current directory
localai.env within the home directory
.config/localai.env within the home directory
/etc/localai.env
Environment variables within files earlier in the list will take precedence over environment variables defined in files later in the list.
You can use ‘Extra-Usage’ request header key presence (‘Extra-Usage: true’) to receive inference timings in milliseconds extending default OpenAI response model in the usage field:
LocalAI can be extended with extra backends. The backends are implemented as gRPC services and can be written in any language. See the backend section for more details on how to install and build new backends for LocalAI.
In runtime
When using the -core container image it is possible to prepare the python backends you are interested into by using the EXTRA_BACKENDS variable, for instance:
docker run --env EXTRA_BACKENDS="backend/python/diffusers" quay.io/go-skynet/local-ai:master
Concurrent requests
LocalAI supports parallel requests for the backends that supports it. For instance, vLLM and llama.cpp supports parallel requests, and thus LocalAI allows to run multiple requests in parallel.
In order to enable parallel requests, you have to pass --parallel-requests or set the PARALLEL_REQUEST to true as environment variable.
A list of the environment variable that tweaks parallelism is the following:
### Python backends GRPC max workers
### Default number of workers for GRPC Python backends.
### This actually controls wether a backend can process multiple requests or not.
### Define the number of parallel LLAMA.cpp workers (Defaults to 1)
### Enable to run parallel requests
Note that, for llama.cpp you need to set accordingly LLAMACPP_PARALLEL to the number of parallel processes your GPU/CPU can handle. For python-based backends (like vLLM) you can set PYTHON_GRPC_MAX_WORKERS to the number of parallel requests.
VRAM and Memory Management
For detailed information on managing VRAM when running multiple models, see the dedicated VRAM and Memory Management page.
Disable CPU flagset auto detection in llama.cpp
LocalAI will automatically discover the CPU flagset available in your host and will use the most optimized version of the backends.
If you want to disable this behavior, you can set DISABLE_AUTODETECT to true in the environment variables.
Fine-tuning LLMs for text generation
Note
Section under construction
This section covers how to fine-tune a language model for text generation and consume it in LocalAI.
Requirements
For this example you will need at least a 12GB VRAM of GPU and a Linux box.
Fine-tuning
Fine-tuning a language model is a process that requires a lot of computational power and time.
Currently LocalAI doesn’t support the fine-tuning endpoint as LocalAI but there are are plans to support that. For the time being a guide is proposed here to give a simple starting point on how to fine-tune a model and use it with LocalAI (but also with llama.cpp).
There is an e2e example of fine-tuning a LLM model to use with LocalAI written by @mudler available here.
The steps involved are:
Preparing a dataset
Prepare the environment and install dependencies
Fine-tune the model
Merge the Lora base with the model
Convert the model to gguf
Use the model with LocalAI
Dataset preparation
We are going to need a dataset or a set of datasets.
Axolotl supports a variety of formats, in the notebook and in this example we are aiming for a very simple dataset and build that manually, so we are going to use the completion format which requires the full text to be used for fine-tuning.
A dataset for an instructor model (like Alpaca) can look like the following:
[
{
"text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
},
{
"text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
}
]
Every block in the text is the whole text that is used to fine-tune. For example, for an instructor model it follows the following format (more or less):
<System prompt>
## Instruction
<Question, instruction>
## Response
<Expected response from the LLM>
The instruction format works such as when we are going to inference with the model, we are going to feed it only the first part up to the ## Instruction block, and the model is going to complete the text with the ## Response block.
Prepare a dataset, and upload it to your Google Drive in case you are using the Google colab. Otherwise place it next the axolotl.yaml file as dataset.json.
We will need to configure axolotl. In this example is provided a file to use axolotl.yaml that uses openllama-3b for fine-tuning. Copy the axolotl.yaml file and edit it to your needs. The dataset needs to be next to it as dataset.json. You can find the axolotl.yaml file here.
If you have a big dataset, you can pre-tokenize it to speedup the fine-tuning process:
python -m axolotl.cli.preprocess axolotl.yaml
Now we are ready to start the fine-tuning process:
Now you should have ended up with a custom-model-q4_0.gguf file that you can copy in the LocalAI models directory and use it with LocalAI.
VRAM and Memory Management
When running multiple models in LocalAI, especially on systems with limited GPU memory (VRAM), you may encounter situations where loading a new model fails because there isn’t enough available VRAM. LocalAI provides two mechanisms to automatically manage model memory allocation and prevent VRAM exhaustion.
The Problem
By default, LocalAI keeps models loaded in memory once they’re first used. This means:
If you load a large model that uses most of your VRAM, subsequent requests for other models may fail
Models remain in memory even when not actively being used
There’s no automatic mechanism to unload models to make room for new ones, unless done manually via the web interface
This is a common issue when working with GPU-accelerated models, as VRAM is typically more limited than system RAM. For more context, see issues #6068, #7269, and #5352.
Solution 1: Single Active Backend
The simplest approach is to ensure only one model is loaded at a time. When a new model is requested, LocalAI will automatically unload the currently active model before loading the new one.
For more flexible memory management, LocalAI provides watchdog mechanisms that automatically unload models based on their activity state. This allows multiple models to be loaded simultaneously, but automatically frees memory when models become inactive or stuck.
Idle Watchdog
The idle watchdog monitors models that haven’t been used for a specified period and automatically unloads them to free VRAM.
The busy watchdog monitors models that have been processing requests for an unusually long time and terminates them if they exceed a threshold. This is useful for detecting and recovering from stuck or hung backends.
See Backend Flags for all available backend configuration options
Model Configuration
LocalAI uses YAML configuration files to define model parameters, templates, and behavior. This page provides a complete reference for all available configuration options.
Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the backends, compatible models families and the associated repository.
Note
LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See the advanced section for more details.
NVIDIA CUDA: CUDA 11.7, CUDA 12.0 support across most backends
AMD ROCm: HIP-based acceleration for AMD GPUs
Intel oneAPI: SYCL-based acceleration for Intel GPUs (F16/F32 precision)
Vulkan: Cross-platform GPU acceleration
Metal: Apple Silicon GPU acceleration (M1/M2/M3+)
Specialized Hardware
NVIDIA Jetson (L4T): ARM64 support for embedded AI
Apple Silicon: Native Metal acceleration for Mac M1/M2/M3+
Darwin x86: Intel Mac support
CPU Optimization
AVX/AVX2/AVX512: Advanced vector extensions for x86
Quantization: 4-bit, 5-bit, 8-bit integer quantization support
Mixed Precision: F16/F32 mixed precision support
Note: any backend name listed above can be used in the backend field of the model configuration file (See the advanced section).
* Only for CUDA and OpenVINO CPU/XPU acceleration.
Architecture
LocalAI is an API written in Go that serves as an OpenAI shim, enabling software already developed with OpenAI SDKs to seamlessly integrate with LocalAI. It can be effortlessly implemented as a substitute, even on consumer-grade hardware. This capability is achieved by employing various C++ backends, including ggml, to perform inference on LLMs using both CPU and, if desired, GPU. Internally LocalAI backends are just gRPC server, indeed you can specify and build your own gRPC server and extend LocalAI in runtime as well. It is possible to specify external gRPC server and/or binaries that LocalAI will manage internally.
LocalAI uses a mixture of backends written in various languages (C++, Golang, Python, …). You can check the model compatibility table to learn about all the components of LocalAI.
Backstory
As much as typical open source projects starts, I, mudler, was fiddling around with llama.cpp over my long nights and wanted to have a way to call it from go, as I am a Golang developer and use it extensively. So I’ve created LocalAI (or what was initially known as llama-cli) and added an API to it.
But guess what? The more I dived into this rabbit hole, the more I realized that I had stumbled upon something big. With all the fantastic C++ projects floating around the community, it dawned on me that I could piece them together to create a full-fledged OpenAI replacement. So, ta-da! LocalAI was born, and it quickly overshadowed its humble origins.
Now, why did I choose to go with C++ bindings, you ask? Well, I wanted to keep LocalAI snappy and lightweight, allowing it to run like a champ on any system and avoid any Golang penalties of the GC, and, most importantly built on shoulders of giants like llama.cpp. Go is good at backends and API and is easy to maintain. And hey, don’t forget that I’m all about sharing the love. That’s why I made LocalAI MIT licensed, so everyone can hop on board and benefit from it.
As if that wasn’t exciting enough, as the project gained traction, mkellerman and Aisuko jumped in to lend a hand. mkellerman helped set up some killer examples, while Aisuko is becoming our community maestro. The community now is growing even more with new contributors and users, and I couldn’t be happier about it!
Oh, and let’s not forget the real MVP here—llama.cpp. Without this extraordinary piece of software, LocalAI wouldn’t even exist. So, a big shoutout to the community for making this magic happen!
CLI Reference
Complete reference for all LocalAI command-line interface (CLI) parameters and environment variables.
Note: All CLI flags can also be set via environment variables. Environment variables take precedence over CLI flags. See .env files for configuration file support.
Global Flags
Parameter
Default
Description
Environment Variable
-h, --help
Show context-sensitive help
--log-level
info
Set the level of logs to output [error,warn,info,debug,trace]
$LOCALAI_LOG_LEVEL
--debug
false
DEPRECATED - Use --log-level=debug instead. Enable debug logging
$LOCALAI_DEBUG, $DEBUG
Storage Flags
Parameter
Default
Description
Environment Variable
--models-path
BASEPATH/models
Path containing models used for inferencing
$LOCALAI_MODELS_PATH, $MODELS_PATH
--generated-content-path
/tmp/generated/content
Location for assets generated by backends (e.g. stablediffusion, images, audio, videos)
If not empty, add that string to Machine-Tag header in each response. Useful to track response from different machines using multiple P2P federated nodes
$LOCALAI_MACHINE_TAG, $MACHINE_TAG
Hardening Flags
Parameter
Default
Description
Environment Variable
--disable-predownload-scan
false
If true, disables the best-effort security scanner before downloading any files
$LOCALAI_DISABLE_PREDOWNLOAD_SCAN
--opaque-errors
false
If true, all error responses are replaced with blank 500 errors. This is intended only for hardening against information leaks and is normally not recommended
$LOCALAI_OPAQUE_ERRORS
--use-subtle-key-comparison
false
If true, API Key validation comparisons will be performed using constant-time comparisons rather than simple equality. This trades off performance on each request for resilience against timing attacks
$LOCALAI_SUBTLE_KEY_COMPARISON
--disable-api-key-requirement-for-http-get
false
If true, a valid API key is not required to issue GET requests to portions of the web UI. This should only be enabled in secure testing environments
If --disable-api-key-requirement-for-http-get is overridden to true, this is the list of endpoints to exempt. Only adjust this in case of a security incident or as a result of a personal security posture review
$LOCALAI_HTTP_GET_EXEMPTED_ENDPOINTS
P2P Flags
Parameter
Default
Description
Environment Variable
--p2p
false
Enable P2P mode
$LOCALAI_P2P, $P2P
--p2p-dht-interval
360
Interval for DHT refresh (used during token generation)
$LOCALAI_P2P_DHT_INTERVAL, $P2P_DHT_INTERVAL
--p2p-otp-interval
9000
Interval for OTP refresh (used during token generation)
$LOCALAI_P2P_OTP_INTERVAL, $P2P_OTP_INTERVAL
--p2ptoken
Token for P2P mode (optional)
$LOCALAI_P2P_TOKEN, $P2P_TOKEN, $TOKEN
--p2p-network-id
Network ID for P2P mode, can be set arbitrarily by the user for grouping a set of instances
$LOCALAI_P2P_NETWORK_ID, $P2P_NETWORK_ID
--federated
false
Enable federated instance
$LOCALAI_FEDERATED, $FEDERATED
Other Commands
LocalAI supports several subcommands beyond run:
local-ai models - Manage LocalAI models and definitions
local-ai backends - Manage LocalAI backends and definitions
local-ai tts - Convert text to speech
local-ai sound-generation - Generate audio files from text or audio
local-ai transcript - Convert audio to text
local-ai worker - Run workers to distribute workload (llama.cpp-only)
local-ai util - Utility commands
local-ai explorer - Run P2P explorer
local-ai federated - Run LocalAI in federated mode
Use local-ai <command> --help for more information on each command.
Examples
Basic Usage
./local-ai run
./local-ai run --models-path /path/to/models --address :9090
./local-ai run --f16
Environment Variables
export LOCALAI_MODELS_PATH=/path/to/models
export LOCALAI_ADDRESS=:9090
export LOCALAI_F16=true
./local-ai run
LocalAI binaries are available for both Linux and MacOS platforms and can be executed directly from your command line. These binaries are continuously updated and hosted on our GitHub Releases page. This method also supports Windows users via the Windows Subsystem for Linux (WSL).
macOS Download
You can download the DMG and install the application:
Binaries do have limited support compared to container images:
Python-based backends are not shipped with binaries (e.g. bark, diffusers or transformers)
MacOS binaries and Linux-arm64 do not ship TTS nor stablediffusion-cpp backends
Linux binaries do not ship stablediffusion-cpp backend
Running on Nvidia ARM64
LocalAI can be run on Nvidia ARM64 devices, such as the Jetson Nano, Jetson Xavier NX, and Jetson AGX Xavier. The following instructions will guide you through building the LocalAI container for Nvidia ARM64 devices.
Run the LocalAI container on Nvidia ARM64 devices using the following command, where /data/models is the directory containing the models:
docker run -e DEBUG=true -p 8080:8080 -v /data/models:/models -ti --restart=always --name local-ai --runtime nvidia --gpus all quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64-core
Note: /data/models is the directory containing the models. You can replace it with the directory containing your models.
FAQ
Frequently asked questions
Here are answers to some of the most common questions.
How do I get models?
Most gguf-based models should work, but newer models may require additions to the API. If a model doesn’t work, please feel free to open up issues. However, be cautious about downloading models from the internet and directly onto your machine, as there may be security vulnerabilities in lama.cpp or ggml that could be maliciously exploited. Some models can be found on Hugging Face: https://huggingface.co/models?search=gguf, or models from gpt4all are compatible too: https://github.com/nomic-ai/gpt4all.
Where are models stored?
LocalAI stores downloaded models in the following locations by default:
Command line: ./models (relative to current working directory)
Docker: /models (inside the container, typically mounted to ./models on host)
Launcher application: ~/.localai/models (in your home directory)
You can customize the model storage location using the LOCALAI_MODELS_PATH environment variable or --models-path command line flag. This is useful if you want to store models outside your home directory for backup purposes or to avoid filling up your home directory with large model files.
How much storage space do models require?
Model sizes vary significantly depending on the model and quantization level:
Ensure you have at least 2-3x the model size available for downloads and temporary files
Use SSD storage for better performance
Consider the model size relative to your system RAM - models larger than your RAM may not run efficiently
Benchmarking LocalAI and llama.cpp shows different results!
LocalAI applies a set of defaults when loading models with the llama.cpp backend, one of these is mirostat sampling - while it achieves better results, it slows down the inference. You can disable this by setting mirostat: 0 in the model config file. See also the advanced section (/advanced/) for more information and this issue.
What’s the difference with Serge, or XXX?
LocalAI is a multi-model solution that doesn’t focus on a specific model type (e.g., llama.cpp or alpaca.cpp), and it handles all of these internally for faster inference, easy to set up locally and deploy to Kubernetes.
Everything is slow, how is it possible?
There are few situation why this could occur. Some tips are:
Don’t use HDD to store your models. Prefer SSD over HDD. In case you are stuck with HDD, disable mmap in the model config file so it loads everything in memory.
Watch out CPU overbooking. Ideally the --threads should match the number of physical cores. For instance if your CPU has 4 cores, you would ideally allocate <= 4 threads to a model.
Run LocalAI with DEBUG=true. This gives more information, including stats on the token inference speed.
Check that you are actually getting an output: run a simple curl request with "stream": true to see how fast the model is responding.
Can I use it with a Discord bot, or XXX?
Yes! If the client uses OpenAI and supports setting a different base URL to send requests to, you can use the LocalAI endpoint. This allows to use this with every application that was supposed to work with OpenAI, but without changing the application!
Can this leverage GPUs?
There is GPU support, see /features/gpu-acceleration/.
Where is the webUI?
There is the availability of localai-webui and chatbot-ui in the examples section and can be setup as per the instructions. However as LocalAI is an API you can already plug it into existing projects that provides are UI interfaces to OpenAI’s APIs. There are several already on Github, and should be compatible with LocalAI already (as it mimics the OpenAI API)
Enable the debug mode by setting DEBUG=true in the environment variables. This will give you more information on what’s going on.
You can also specify --debug in the command line.
I’m getting ‘invalid pitch’ error when running with CUDA, what’s wrong?
This typically happens when your prompt exceeds the context size. Try to reduce the prompt size, or increase the context size.
I’m getting a ‘SIGILL’ error, what’s wrong?
Your CPU probably does not have support for certain instructions that are compiled by default in the pre-built binaries. If you are running in a container, try setting REBUILD=true and disable the CPU instructions that are not compatible with your CPU. For instance: CMAKE_ARGS="-DGGML_F16C=OFF -DGGML_AVX512=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF" make build