Advanced usage
Model Configuration with YAML Files
LocalAI uses YAML configuration files to define model parameters, templates, and behavior. You can create individual YAML files in the models directory or use a single configuration file with multiple models.
Quick Example:
For a complete reference of all available configuration options, see the Model Configuration page.
Configuration File Locations:
- Individual files: Create
.yamlfiles in your models directory (e.g.,models/gpt-3.5-turbo.yaml) - Single config file: Use
--models-config-fileorLOCALAI_MODELS_CONFIG_FILEto specify a file containing multiple models - Remote URLs: Specify a URL to a YAML configuration file at startup:
See also chatbot-ui as an example on how to use config files.
Prompt templates
The API doesn’t inject a default prompt for talking to the model. You have to use a prompt similar to what’s described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release.
See the prompt-templates directory in this repository for templates for some of the most popular models.
For the edit endpoint, an example template for alpaca-based models can be:
Install models using the API
Instead of installing models manually, you can use the LocalAI API endpoints and a model definition to install programmatically via API models in runtime.
A curated collection of model files is in the model-gallery. The files of the model gallery are different from the model files used to configure LocalAI models. The model gallery files contains information about the model setup, and the files necessary to run the model locally.
To install for example lunademo, you can send a POST call to the /models/apply endpoint with the model definition url (url) and the name of the model should have in LocalAI (name, optional):
Preloading models during startup
In order to allow the API to start-up with all the needed model on the first-start, the model gallery files can be used during startup.
PRELOAD_MODELS (or --preload-models) takes a list in JSON with the same parameter of the API calls of the /models/apply endpoint.
Similarly it can be specified a path to a YAML configuration file containing a list of models with PRELOAD_MODELS_CONFIG ( or --preload-models-config ):
Automatic prompt caching
LocalAI can automatically cache prompts for faster loading of the prompt. This can be useful if your model need a prompt template with prefixed text in the prompt before the input.
To enable prompt caching, you can control the settings in the model config YAML file:
prompt_cache_path is relative to the models folder. you can enter here a name for the file that will be automatically create during the first load if prompt_cache_all is set to true.
Configuring a specific backend for the model
By default LocalAI will try to autoload the model by trying all the backends. This might work for most of models, but some of the backends are NOT configured to autoload.
The available backends are listed in the model compatibility table.
In order to specify a backend for your models, create a model config file in your models directory specifying the backend:
Connect external backends
LocalAI backends are internally implemented using gRPC services. This also allows LocalAI to connect to external gRPC services on start and extend LocalAI functionalities via third-party binaries.
The --external-grpc-backends parameter in the CLI can be used either to specify a local backend (a file) or a remote URL. The syntax is <BACKEND_NAME>:<BACKEND_URI>. Once LocalAI is started with it, the new backend name will be available for all the API endpoints.
So for instance, to register a new backend which is a local file:
Or a remote URI:
For example, to start vllm manually after compiling LocalAI (also assuming running the command from the root of the repository):
Note that first is is necessary to create the environment with:
Environment variables
When LocalAI runs in a container, there are additional environment variables available that modify the behavior of LocalAI on startup:
| Environment variable | Default | Description |
|---|---|---|
REBUILD | false | Rebuild LocalAI on startup |
BUILD_TYPE | Build type. Available: cublas, openblas, clblas, intel (intel core), sycl_f16, sycl_f32 (intel backends) | |
GO_TAGS | Go tags. Available: stablediffusion | |
HUGGINGFACEHUB_API_TOKEN | Special token for interacting with HuggingFace Inference API, required only when using the langchain-huggingface backend | |
EXTRA_BACKENDS | A space separated list of backends to prepare. For example EXTRA_BACKENDS="backend/python/diffusers backend/python/transformers" prepares the python environment on start | |
DISABLE_AUTODETECT | false | Disable autodetect of CPU flagset on start |
LLAMACPP_GRPC_SERVERS | A list of llama.cpp workers to distribute the workload. For example LLAMACPP_GRPC_SERVERS="address1:port,address2:port" |
Here is how to configure these variables:
CLI Parameters
For a complete reference of all CLI parameters, environment variables, and command-line options, see the CLI Reference page.
You can control LocalAI with command line arguments to specify a binding address, number of threads, model paths, and many other options. Any command line parameter can be specified via an environment variable.
.env files
Any settings being provided by an Environment Variable can also be provided from within .env files. There are several locations that will be checked for relevant .env files. In order of precedence they are:
- .env within the current directory
- localai.env within the current directory
- localai.env within the home directory
- .config/localai.env within the home directory
- /etc/localai.env
Environment variables within files earlier in the list will take precedence over environment variables defined in files later in the list.
An example .env file is:
Request headers
You can use ‘Extra-Usage’ request header key presence (‘Extra-Usage: true’) to receive inference timings in milliseconds extending default OpenAI response model in the usage field:
Extra backends
LocalAI can be extended with extra backends. The backends are implemented as gRPC services and can be written in any language. See the backend section for more details on how to install and build new backends for LocalAI.
In runtime
When using the -core container image it is possible to prepare the python backends you are interested into by using the EXTRA_BACKENDS variable, for instance:
Concurrent requests
LocalAI supports parallel requests for the backends that supports it. For instance, vLLM and llama.cpp supports parallel requests, and thus LocalAI allows to run multiple requests in parallel.
In order to enable parallel requests, you have to pass --parallel-requests or set the PARALLEL_REQUEST to true as environment variable.
A list of the environment variable that tweaks parallelism is the following:
Note that, for llama.cpp you need to set accordingly LLAMACPP_PARALLEL to the number of parallel processes your GPU/CPU can handle. For python-based backends (like vLLM) you can set PYTHON_GRPC_MAX_WORKERS to the number of parallel requests.
VRAM and Memory Management
For detailed information on managing VRAM when running multiple models, see the dedicated VRAM and Memory Management page.
Disable CPU flagset auto detection in llama.cpp
LocalAI will automatically discover the CPU flagset available in your host and will use the most optimized version of the backends.
If you want to disable this behavior, you can set DISABLE_AUTODETECT to true in the environment variables.