Subsections of References

Model compatibility table

Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the backends, compatible models families and the associated repository.

Note

LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See the advanced section for more details.

Text Generation & Language Models

Backend and BindingsCompatible modelsCompletion/Chat endpointCapabilityEmbeddings supportToken stream supportAcceleration
llama.cppLLama, Mamba, RWKV, Falcon, Starcoder, GPT-2, and many othersyesGPT and FunctionsyesyesCUDA 11/12, ROCm, Intel SYCL, Vulkan, Metal, CPU
vLLMVarious GPTs and quantization formatsyesGPTnonoCUDA 12, ROCm, Intel
transformersVarious GPTs and quantization formatsyesGPT, embeddings, Audio generationyesyes*CUDA 11/12, ROCm, Intel, CPU
exllama2GPTQyesGPT onlynonoCUDA 12
MLXVarious LLMsyesGPTnonoMetal (Apple Silicon)
MLX-VLMVision-Language ModelsyesMultimodal GPTnonoMetal (Apple Silicon)
langchain-huggingfaceAny text generators available on HuggingFace through APIyesGPTnonoN/A

Audio & Speech Processing

Backend and BindingsCompatible modelsCompletion/Chat endpointCapabilityEmbeddings supportToken stream supportAcceleration
whisper.cppwhispernoAudio transcriptionnonoCUDA 12, ROCm, Intel SYCL, Vulkan, CPU
faster-whisperwhispernoAudio transcriptionnonoCUDA 12, ROCm, Intel, CPU
piper (binding)Any piper onnx modelnoText to voicenonoCPU
barkbarknoAudio generationnonoCUDA 12, ROCm, Intel
bark-cppbarknoAudio-OnlynonoCUDA, Metal, CPU
coquiCoqui TTSnoAudio generation and Voice cloningnonoCUDA 12, ROCm, Intel, CPU
kokoroKokoro TTSnoText-to-speechnonoCUDA 12, ROCm, Intel, CPU
chatterboxChatterbox TTSnoText-to-speechnonoCUDA 11/12, CPU
kitten-ttsKitten TTSnoText-to-speechnonoCPU
silero-vad with Golang bindingsSilero VADnoVoice Activity DetectionnonoCPU
neuttsNeuTTSAirnoText-to-speech with voice cloningnonoCUDA 12, ROCm, CPU
mlx-audioMLXnoText-tospeechnonoMetal (Apple Silicon)

Image & Video Generation

Backend and BindingsCompatible modelsCompletion/Chat endpointCapabilityEmbeddings supportToken stream supportAcceleration
stablediffusion.cppstablediffusion-1, stablediffusion-2, stablediffusion-3, flux, PhotoMakernoImagenonoCUDA 12, Intel SYCL, Vulkan, CPU
diffusersSD, various diffusion models,…noImage/Video generationnonoCUDA 11/12, ROCm, Intel, Metal, CPU
transformers-musicgenMusicGennoAudio generationnonoCUDA, CPU

Specialized AI Tasks

Backend and BindingsCompatible modelsCompletion/Chat endpointCapabilityEmbeddings supportToken stream supportAcceleration
rfdetrRF-DETRnoObject DetectionnonoCUDA 12, Intel, CPU
rerankersReranking APInoRerankingnonoCUDA 11/12, ROCm, Intel, CPU
local-storeVector databasenoVector storageyesnoCPU
huggingfaceHuggingFace API modelsyesVarious AI tasksyesyesAPI-based

Acceleration Support Summary

GPU Acceleration

  • NVIDIA CUDA: CUDA 11.7, CUDA 12.0 support across most backends
  • AMD ROCm: HIP-based acceleration for AMD GPUs
  • Intel oneAPI: SYCL-based acceleration for Intel GPUs (F16/F32 precision)
  • Vulkan: Cross-platform GPU acceleration
  • Metal: Apple Silicon GPU acceleration (M1/M2/M3+)

Specialized Hardware

  • NVIDIA Jetson (L4T): ARM64 support for embedded AI
  • Apple Silicon: Native Metal acceleration for Mac M1/M2/M3+
  • Darwin x86: Intel Mac support

CPU Optimization

  • AVX/AVX2/AVX512: Advanced vector extensions for x86
  • Quantization: 4-bit, 5-bit, 8-bit integer quantization support
  • Mixed Precision: F16/F32 mixed precision support

Note: any backend name listed above can be used in the backend field of the model configuration file (See the advanced section).

  • * Only for CUDA and OpenVINO CPU/XPU acceleration.

Architecture

LocalAI is an API written in Go that serves as an OpenAI shim, enabling software already developed with OpenAI SDKs to seamlessly integrate with LocalAI. It can be effortlessly implemented as a substitute, even on consumer-grade hardware. This capability is achieved by employing various C++ backends, including ggml, to perform inference on LLMs using both CPU and, if desired, GPU. Internally LocalAI backends are just gRPC server, indeed you can specify and build your own gRPC server and extend LocalAI in runtime as well. It is possible to specify external gRPC server and/or binaries that LocalAI will manage internally.

LocalAI uses a mixture of backends written in various languages (C++, Golang, Python, …). You can check the model compatibility table to learn about all the components of LocalAI.

localai localai

Backstory

As much as typical open source projects starts, I, mudler, was fiddling around with llama.cpp over my long nights and wanted to have a way to call it from go, as I am a Golang developer and use it extensively. So I’ve created LocalAI (or what was initially known as llama-cli) and added an API to it.

But guess what? The more I dived into this rabbit hole, the more I realized that I had stumbled upon something big. With all the fantastic C++ projects floating around the community, it dawned on me that I could piece them together to create a full-fledged OpenAI replacement. So, ta-da! LocalAI was born, and it quickly overshadowed its humble origins.

Now, why did I choose to go with C++ bindings, you ask? Well, I wanted to keep LocalAI snappy and lightweight, allowing it to run like a champ on any system and avoid any Golang penalties of the GC, and, most importantly built on shoulders of giants like llama.cpp. Go is good at backends and API and is easy to maintain. And hey, don’t forget that I’m all about sharing the love. That’s why I made LocalAI MIT licensed, so everyone can hop on board and benefit from it.

As if that wasn’t exciting enough, as the project gained traction, mkellerman and Aisuko jumped in to lend a hand. mkellerman helped set up some killer examples, while Aisuko is becoming our community maestro. The community now is growing even more with new contributors and users, and I couldn’t be happier about it!

Oh, and let’s not forget the real MVP here—llama.cpp. Without this extraordinary piece of software, LocalAI wouldn’t even exist. So, a big shoutout to the community for making this magic happen!

CLI Reference

Complete reference for all LocalAI command-line interface (CLI) parameters and environment variables.

Note: All CLI flags can also be set via environment variables. Environment variables take precedence over CLI flags. See .env files for configuration file support.

Global Flags

ParameterDefaultDescriptionEnvironment Variable
-h, --helpShow context-sensitive help
--log-levelinfoSet the level of logs to output [error,warn,info,debug,trace]$LOCALAI_LOG_LEVEL
--debugfalseDEPRECATED - Use --log-level=debug instead. Enable debug logging$LOCALAI_DEBUG, $DEBUG

Storage Flags

ParameterDefaultDescriptionEnvironment Variable
--models-pathBASEPATH/modelsPath containing models used for inferencing$LOCALAI_MODELS_PATH, $MODELS_PATH
--generated-content-path/tmp/generated/contentLocation for assets generated by backends (e.g. stablediffusion, images, audio, videos)$LOCALAI_GENERATED_CONTENT_PATH, $GENERATED_CONTENT_PATH
--upload-path/tmp/localai/uploadPath to store uploads from files API$LOCALAI_UPLOAD_PATH, $UPLOAD_PATH
--localai-config-dirBASEPATH/configurationDirectory for dynamic loading of certain configuration files (currently api_keys.json and external_backends.json)$LOCALAI_CONFIG_DIR
--localai-config-dir-poll-intervalTime duration to poll the LocalAI Config Dir if your system has broken fsnotify events (example: 1m)$LOCALAI_CONFIG_DIR_POLL_INTERVAL
--models-config-fileYAML file containing a list of model backend configs (alias: --config-file)$LOCALAI_MODELS_CONFIG_FILE, $CONFIG_FILE

Backend Flags

ParameterDefaultDescriptionEnvironment Variable
--backends-pathBASEPATH/backendsPath containing backends used for inferencing$LOCALAI_BACKENDS_PATH, $BACKENDS_PATH
--backends-system-path/usr/share/localai/backendsPath containing system backends used for inferencing$LOCALAI_BACKENDS_SYSTEM_PATH, $BACKEND_SYSTEM_PATH
--external-backendsA list of external backends to load from gallery on boot$LOCALAI_EXTERNAL_BACKENDS, $EXTERNAL_BACKENDS
--external-grpc-backendsA list of external gRPC backends (format: BACKEND_NAME:URI)$LOCALAI_EXTERNAL_GRPC_BACKENDS, $EXTERNAL_GRPC_BACKENDS
--backend-galleriesJSON list of backend galleries$LOCALAI_BACKEND_GALLERIES, $BACKEND_GALLERIES
--autoload-backend-galleriestrueAutomatically load backend galleries on startup$LOCALAI_AUTOLOAD_BACKEND_GALLERIES, $AUTOLOAD_BACKEND_GALLERIES
--parallel-requestsfalseEnable backends to handle multiple requests in parallel if they support it (e.g.: llama.cpp or vllm)$LOCALAI_PARALLEL_REQUESTS, $PARALLEL_REQUESTS
--single-active-backendfalseAllow only one backend to be run at a time$LOCALAI_SINGLE_ACTIVE_BACKEND, $SINGLE_ACTIVE_BACKEND
--preload-backend-onlyfalseDo not launch the API services, only the preloaded models/backends are started (useful for multi-node setups)$LOCALAI_PRELOAD_BACKEND_ONLY, $PRELOAD_BACKEND_ONLY
--enable-watchdog-idlefalseEnable watchdog for stopping backends that are idle longer than the watchdog-idle-timeout$LOCALAI_WATCHDOG_IDLE, $WATCHDOG_IDLE
--watchdog-idle-timeout15mThreshold beyond which an idle backend should be stopped$LOCALAI_WATCHDOG_IDLE_TIMEOUT, $WATCHDOG_IDLE_TIMEOUT
--enable-watchdog-busyfalseEnable watchdog for stopping backends that are busy longer than the watchdog-busy-timeout$LOCALAI_WATCHDOG_BUSY, $WATCHDOG_BUSY
--watchdog-busy-timeout5mThreshold beyond which a busy backend should be stopped$LOCALAI_WATCHDOG_BUSY_TIMEOUT, $WATCHDOG_BUSY_TIMEOUT

For more information on VRAM management, see VRAM and Memory Management.

Models Flags

ParameterDefaultDescriptionEnvironment Variable
--galleriesJSON list of galleries$LOCALAI_GALLERIES, $GALLERIES
--autoload-galleriestrueAutomatically load galleries on startup$LOCALAI_AUTOLOAD_GALLERIES, $AUTOLOAD_GALLERIES
--preload-modelsA list of models to apply in JSON at start$LOCALAI_PRELOAD_MODELS, $PRELOAD_MODELS
--modelsA list of model configuration URLs to load$LOCALAI_MODELS, $MODELS
--preload-models-configA list of models to apply at startup. Path to a YAML config file$LOCALAI_PRELOAD_MODELS_CONFIG, $PRELOAD_MODELS_CONFIG
--load-to-memoryA list of models to load into memory at startup$LOCALAI_LOAD_TO_MEMORY, $LOAD_TO_MEMORY

Note: You can also pass model configuration URLs as positional arguments: local-ai run MODEL_URL1 MODEL_URL2 ...

Performance Flags

ParameterDefaultDescriptionEnvironment Variable
--f16falseEnable GPU acceleration$LOCALAI_F16, $F16
-t, --threadsNumber of threads used for parallel computation. Usage of the number of physical cores in the system is suggested$LOCALAI_THREADS, $THREADS
--context-sizeDefault context size for models$LOCALAI_CONTEXT_SIZE, $CONTEXT_SIZE

API Flags

ParameterDefaultDescriptionEnvironment Variable
--address:8080Bind address for the API server$LOCALAI_ADDRESS, $ADDRESS
--corsfalseEnable CORS (Cross-Origin Resource Sharing)$LOCALAI_CORS, $CORS
--cors-allow-originsComma-separated list of allowed CORS origins$LOCALAI_CORS_ALLOW_ORIGINS, $CORS_ALLOW_ORIGINS
--csrffalseEnable Fiber CSRF middleware$LOCALAI_CSRF
--upload-limit15Default upload-limit in MB$LOCALAI_UPLOAD_LIMIT, $UPLOAD_LIMIT
--api-keysList of API Keys to enable API authentication. When this is set, all requests must be authenticated with one of these API keys$LOCALAI_API_KEY, $API_KEY
--disable-webuifalseDisables the web user interface. When set to true, the server will only expose API endpoints without serving the web interface$LOCALAI_DISABLE_WEBUI, $DISABLE_WEBUI
--disable-gallery-endpointfalseDisable the gallery endpoints$LOCALAI_DISABLE_GALLERY_ENDPOINT, $DISABLE_GALLERY_ENDPOINT
--disable-metrics-endpointfalseDisable the /metrics endpoint$LOCALAI_DISABLE_METRICS_ENDPOINT, $DISABLE_METRICS_ENDPOINT
--machine-tagIf not empty, add that string to Machine-Tag header in each response. Useful to track response from different machines using multiple P2P federated nodes$LOCALAI_MACHINE_TAG, $MACHINE_TAG

Hardening Flags

ParameterDefaultDescriptionEnvironment Variable
--disable-predownload-scanfalseIf true, disables the best-effort security scanner before downloading any files$LOCALAI_DISABLE_PREDOWNLOAD_SCAN
--opaque-errorsfalseIf true, all error responses are replaced with blank 500 errors. This is intended only for hardening against information leaks and is normally not recommended$LOCALAI_OPAQUE_ERRORS
--use-subtle-key-comparisonfalseIf true, API Key validation comparisons will be performed using constant-time comparisons rather than simple equality. This trades off performance on each request for resilience against timing attacks$LOCALAI_SUBTLE_KEY_COMPARISON
--disable-api-key-requirement-for-http-getfalseIf true, a valid API key is not required to issue GET requests to portions of the web UI. This should only be enabled in secure testing environments$LOCALAI_DISABLE_API_KEY_REQUIREMENT_FOR_HTTP_GET
--http-get-exempted-endpoints^/$,^/browse/?$,^/talk/?$,^/p2p/?$,^/chat/?$,^/text2image/?$,^/tts/?$,^/static/.*$,^/swagger.*$If --disable-api-key-requirement-for-http-get is overridden to true, this is the list of endpoints to exempt. Only adjust this in case of a security incident or as a result of a personal security posture review$LOCALAI_HTTP_GET_EXEMPTED_ENDPOINTS

P2P Flags

ParameterDefaultDescriptionEnvironment Variable
--p2pfalseEnable P2P mode$LOCALAI_P2P, $P2P
--p2p-dht-interval360Interval for DHT refresh (used during token generation)$LOCALAI_P2P_DHT_INTERVAL, $P2P_DHT_INTERVAL
--p2p-otp-interval9000Interval for OTP refresh (used during token generation)$LOCALAI_P2P_OTP_INTERVAL, $P2P_OTP_INTERVAL
--p2ptokenToken for P2P mode (optional)$LOCALAI_P2P_TOKEN, $P2P_TOKEN, $TOKEN
--p2p-network-idNetwork ID for P2P mode, can be set arbitrarily by the user for grouping a set of instances$LOCALAI_P2P_NETWORK_ID, $P2P_NETWORK_ID
--federatedfalseEnable federated instance$LOCALAI_FEDERATED, $FEDERATED

Other Commands

LocalAI supports several subcommands beyond run:

  • local-ai models - Manage LocalAI models and definitions
  • local-ai backends - Manage LocalAI backends and definitions
  • local-ai tts - Convert text to speech
  • local-ai sound-generation - Generate audio files from text or audio
  • local-ai transcript - Convert audio to text
  • local-ai worker - Run workers to distribute workload (llama.cpp-only)
  • local-ai util - Utility commands
  • local-ai explorer - Run P2P explorer
  • local-ai federated - Run LocalAI in federated mode

Use local-ai <command> --help for more information on each command.

Examples

Basic Usage

./local-ai run

./local-ai run --models-path /path/to/models --address :9090

./local-ai run --f16

Environment Variables

export LOCALAI_MODELS_PATH=/path/to/models
export LOCALAI_ADDRESS=:9090
export LOCALAI_F16=true
./local-ai run

Advanced Configuration

./local-ai run \
  --models model1.yaml model2.yaml \
  --enable-watchdog-idle \
  --watchdog-idle-timeout=10m \
  --p2p \
  --federated

LocalAI binaries

LocalAI binaries are available for both Linux and MacOS platforms and can be executed directly from your command line. These binaries are continuously updated and hosted on our GitHub Releases page. This method also supports Windows users via the Windows Subsystem for Linux (WSL).

macOS Download

You can download the DMG and install the application:

Download LocalAI for macOS

Note: the DMGs are not signed by Apple as quarantined. See https://github.com/mudler/LocalAI/issues/6268 for a workaround, fix is tracked here: https://github.com/mudler/LocalAI/issues/6244

Otherwise, use the following one-liner command in your terminal to download and run LocalAI on Linux or MacOS:

curl -Lo local-ai "https://github.com/mudler/LocalAI/releases/download/v3.7.0/local-ai-$(uname -s)-$(uname -m)" && chmod +x local-ai && ./local-ai

Otherwise, here are the links to the binaries:

OSLink
Linux (amd64)Download
Linux (arm64)Download
MacOS (arm64)Download
Details

Binaries do have limited support compared to container images:

  • Python-based backends are not shipped with binaries (e.g. bark, diffusers or transformers)
  • MacOS binaries and Linux-arm64 do not ship TTS nor stablediffusion-cpp backends
  • Linux binaries do not ship stablediffusion-cpp backend

Running on Nvidia ARM64

LocalAI can be run on Nvidia ARM64 devices, such as the Jetson Nano, Jetson Xavier NX, and Jetson AGX Xavier. The following instructions will guide you through building the LocalAI container for Nvidia ARM64 devices.

Prerequisites

Build the container

Build the LocalAI container for Nvidia ARM64 devices using the following command:

git clone https://github.com/mudler/LocalAI

cd LocalAI

docker build --build-arg SKIP_DRIVERS=true --build-arg BUILD_TYPE=cublas --build-arg BASE_IMAGE=nvcr.io/nvidia/l4t-jetpack:r36.4.0 --build-arg IMAGE_TYPE=core -t quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64-core .

Otherwise images are available on quay.io and dockerhub:

docker pull quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64-core

Usage

Run the LocalAI container on Nvidia ARM64 devices using the following command, where /data/models is the directory containing the models:

docker run -e DEBUG=true -p 8080:8080 -v /data/models:/models  -ti --restart=always --name local-ai --runtime nvidia --gpus all quay.io/go-skynet/local-ai:master-nvidia-l4t-arm64-core

Note: /data/models is the directory containing the models. You can replace it with the directory containing your models.