💥 OpenAI Proxy Server

LiteLLM Server manages:

Calling 100+ LLMs Huggingface/Bedrock/TogetherAI/etc. in the OpenAI ChatCompletions & Completions format
Set custom prompt templates + model-specific configs (temperature, max_tokens, etc.)

Quick Start

View all the supported args for the Proxy CLI here

$ litellm --model huggingface/bigcode/starcoder

#INFO: Proxy running on http://0.0.0.0:8000

Test

In a new shell, run, this will make an openai.ChatCompletion request

litellm --test

This will now automatically route any requests for gpt-3.5-turbo to bigcode starcoder, hosted on huggingface inference endpoints.

Replace openai base

import openai 

openai.api_base = "http://0.0.0.0:8000"

print(openai.ChatCompletion.create(model="test", messages=[{"role":"user", "content":"Hey!"}]))

Supported LLMs

$ export AWS_ACCESS_KEY_ID=""
$ export AWS_REGION_NAME="" # e.g. us-west-2
$ export AWS_SECRET_ACCESS_KEY=""

$ litellm --model bedrock/anthropic.claude-v2

$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]

$ litellm --model huggingface/<your model name> --api_base https://k58ory32yinf1ly0.us-east-1.aws.endpoints.huggingface.cloud

$ export ANTHROPIC_API_KEY=my-api-key

$ litellm --model claude-instant-1

Assuming you're running vllm locally

$ litellm --model vllm/facebook/opt-125m

$ litellm --model openai/<model_name> --api_base <your-api-base>

$ export TOGETHERAI_API_KEY=my-api-key

$ litellm --model together_ai/lmsys/vicuna-13b-v1.5-16k

$ export REPLICATE_API_KEY=my-api-key

$ litellm \
  --model replicate/meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3

$ litellm --model petals/meta-llama/Llama-2-70b-chat-hf

$ export PALM_API_KEY=my-palm-key

$ litellm --model palm/chat-bison

$ export AZURE_API_KEY=my-api-key
$ export AZURE_API_BASE=my-api-base

$ litellm --model azure/my-deployment-name

$ export AI21_API_KEY=my-api-key

$ litellm --model j2-light

$ export COHERE_API_KEY=my-api-key

$ litellm --model command-nightly

Server Endpoints

POST /chat/completions - chat completions endpoint to call 100+ LLMs
POST /completions - completions endpoint
POST /embeddings - embedding endpoint for Azure, OpenAI, Huggingface endpoints
GET /models - available models on server

Using with OpenAI compatible projects

LiteLLM allows you to set openai.api_base to the proxy server and use all LiteLLM supported LLMs in any OpenAI supported project

This tutorial assumes you're using the `big-refactor` branch of LM Harness https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor

Step 1: Start the local proxy

$ litellm --model huggingface/bigcode/starcoder

Using a custom api base

$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model huggingface/tinyllama --api_base https://k58ory32yinf1ly0.us-east-1.aws.endpoints.huggingface.cloud

OpenAI Compatible Endpoint at http://0.0.0.0:8000

Step 2: Set OpenAI API Base & Key

$ export OPENAI_API_BASE=http://0.0.0.0:8000

LM Harness requires you to set an OpenAI API key OPENAI_API_SECRET_KEY for running benchmarks

export OPENAI_API_SECRET_KEY=anything

Step 3: Run LM-Eval-Harness

python3 -m lm_eval \
  --model openai-completions \
  --model_args engine=davinci \
  --task crows_pairs_english_age

FLASK - Fine-grained Language Model Evaluation Use litellm to evaluate any LLM on FLASK https://github.com/kaistAI/FLASK

Step 1: Start the local proxy

$ litellm --model huggingface/bigcode/starcoder

Step 2: Set OpenAI API Base & Key

$ export OPENAI_API_BASE=http://0.0.0.0:8000

Step 3 Run with FLASK

git clone https://github.com/kaistAI/FLASK

cd FLASK/gpt_review

Run the eval

python gpt4_eval.py -q '../evaluation_set/flask_evaluation.jsonl'

Continue-Dev brings ChatGPT to VSCode. See how to install it here.

In the config.py set this as your default model.

  default=OpenAI(
      api_key="IGNORED",
      model="fake-model-name",
      context_length=2048, # customize if needed for your model
      api_base="http://localhost:8000" # your proxy server url
  ),

Credits @vividfog for this tutorial.

$ pip install aider 

$ aider --openai-api-base http://0.0.0.0:8000 --openai-api-key fake-key

pip install pyautogen

from autogen import AssistantAgent, UserProxyAgent, oai
config_list=[
    {
        "model": "my-fake-model",
        "api_base": "http://localhost:8000",  #litellm compatible endpoint
        "api_type": "open_ai",
        "api_key": "NULL", # just a placeholder
    }
]

response = oai.Completion.create(config_list=config_list, prompt="Hi")
print(response) # works fine

llm_config={
    "config_list": config_list,
}

assistant = AssistantAgent("assistant", llm_config=llm_config)
user_proxy = UserProxyAgent("user_proxy")
user_proxy.initiate_chat(assistant, message="Plot a chart of META and TESLA stock price change YTD.", config_list=config_list)

Credits @victordibia for this tutorial.

A guidance language for controlling large language models. https://github.com/guidance-ai/guidance

NOTE: Guidance sends additional params like stop_sequences which can cause some models to fail if they don't support it.

Fix: Start your proxy using the --drop_params flag

litellm --model ollama/codellama --temperature 0.3 --max_tokens 2048 --drop_params

import guidance

# set api_base to your proxy
# set api_key to anything
gpt4 = guidance.llms.OpenAI("gpt-4", api_base="http://0.0.0.0:8000", api_key="anything")

experts = guidance('''
{{#system~}}
You are a helpful and terse assistant.
{{~/system}}

{{#user~}}
I want a response to the following question:
{{query}}
Name 3 world-class experts (past or present) who would be great at answering this?
Don't answer the question yet.
{{~/user}}

{{#assistant~}}
{{gen 'expert_names' temperature=0 max_tokens=300}}
{{~/assistant}}
''', llm=gpt4)

result = experts(query='How can I be more productive?')
print(result)

Advanced

Set Custom Prompt Templates

LiteLLM by default checks if a model has a prompt template and applies it (e.g. if a huggingface model has a saved chat template in it's tokenizer_config.json). However, you can also set a custom prompt template on your proxy in the config.yaml:

Step 1: Save your prompt template in a config.yaml

# Model-specific parameters
model_list:
  - model_name: mistral-7b # model alias
    litellm_params: # actual params for litellm.completion()
      model: "huggingface/mistralai/Mistral-7B-Instruct-v0.1" 
      api_base: "<your-api-base>"
      api_key: "<your-api-key>" # [OPTIONAL] for hf inference endpoints
      initial_prompt_value: "\n"
      roles: {"system":{"pre_message":"<|im_start|>system\n", "post_message":"<|im_end|>"}, "assistant":{"pre_message":"<|im_start|>assistant\n","post_message":"<|im_end|>"}, "user":{"pre_message":"<|im_start|>user\n","post_message":"<|im_end|>"}}
      final_prompt_value: "\n"
      bos_token: "<s>"
      eos_token: "</s>"
      max_tokens: 4096

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Using Multiple Models

If you have 1 model running on a local GPU and another that's hosted (e.g. on Runpod), you can call both via the same litellm server by listing them in your config.yaml.

model_list:
  - model_name: zephyr-alpha
    litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
      model: huggingface/HuggingFaceH4/zephyr-7b-alpha
      api_base: http://0.0.0.0:8001
  - model_name: zephyr-beta
    litellm_params:
      model: huggingface/HuggingFaceH4/zephyr-7b-beta
      api_base: https://<my-hosted-endpoint>

$ litellm --config /path/to/config.yaml

Call specific model

If you're repo let's you set model name, you can call the specific model by just passing in that model's name -

import openai 
openai.api_base = "http://0.0.0.0:8000" 

completion = openai.ChatCompletion.create(model="zephyr-alpha", messages=[{"role": "user", "content": "Hello world"}])
print(completion.choices[0].message.content)

If you're repo only let's you specify api base, then you can add the model name to the api base passed in -

import openai 
openai.api_base = "http://0.0.0.0:8000/openai/deployments/zephyr-alpha/chat/completions" # zephyr-alpha will be used 

completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hello world"}])
print(completion.choices[0].message.content)

Save Model-specific params (API Base, API Keys, Temperature, etc.)

Use the router_config_template.yaml to save model-specific information like api_base, api_key, temperature, max_tokens, etc.

Step 1: Create a config.yaml file

model_list:
  - model_name: gpt-3.5-turbo
    litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
      model: azure/chatgpt-v-2 # azure/<your-deployment-name>
      api_key: your_azure_api_key
      api_version: your_azure_api_version
      api_base: your_azure_api_base
  - model_name: mistral-7b
    litellm_params:
      model: ollama/mistral
      api_base: your_ollama_api_base

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Model Alias

Set a model alias for your deployments.

In the config.yaml the model_name parameter is the user-facing name to use for your deployment.

E.g.: If we want to save a Huggingface TGI Mistral-7b deployment, as 'mistral-7b' for our users, we might save it as:

model_list:
  - model_name: mistral-7b # ALIAS
    litellm_params:
      model: huggingface/mistralai/Mistral-7B-Instruct-v0.1 # ACTUAL NAME
      api_key: your_huggingface_api_key # [OPTIONAL] if deployed on huggingface inference endpoints
      api_base: your_api_base # url where model is deployed 

Proxy CLI Arguments

--host

Default: '0.0.0.0'
The host for the server to listen on.
Usage:
```
litellm --host 127.0.0.1
```

--port

Default: 8000
The port to bind the server to.
Usage:
```
litellm --port 8080
```

--num_workers

Default: 1
The number of uvicorn workers to spin up.
Usage:
```
litellm --num_workers 4
```

--api_base

Default: None
The API base for the model litellm should call.

Usage:

litellm --model huggingface/tinyllama --api_base https://k58ory32yinf1ly0.us-east-1.aws.endpoints.huggingface.cloud

--api_version

Default: None
For Azure services, specify the API version.

Usage:

litellm --model azure/gpt-deployment --api_version 2023-08-01 --api_base https://<your api base>"

--model or -m

Default: None
The model name to pass to Litellm.
Usage:
```
litellm --model gpt-3.5-turbo
```

--test

Type: bool (Flag)
Proxy chat completions URL to make a test request.
Usage:
```
litellm --test
```

--alias

Default: None
An alias for the model, for user-friendly reference.
Usage:
```
litellm --alias my-gpt-model
```

--debug

Default: False
Type: bool (Flag)
Enable debugging mode for the input.
Usage:
```
litellm --debug
```

--temperature

Default: None
Type: float
Set the temperature for the model.
Usage:
```
litellm --temperature 0.7
```

--max_tokens

Default: None
Type: int
Set the maximum number of tokens for the model output.
Usage:
```
litellm --max_tokens 50
```

--request_timeout

Default: 600
Type: int
Set the timeout in seconds for completion calls.
Usage:
```
litellm --request_timeout 300
```

--drop_params

Type: bool (Flag)
Drop any unmapped params.
Usage:
```
litellm --drop_params
```

--add_function_to_prompt

Type: bool (Flag)
If a function passed but unsupported, pass it as a part of the prompt.
Usage:
```
litellm --add_function_to_prompt
```

--config

Configure Litellm by providing a configuration file path.
Usage:
```
litellm --config path/to/config.json
```

--telemetry

Default: True
Type: bool
Help track usage of this feature.
Usage:
```
litellm --telemetry False
```

💥 OpenAI Proxy Server

Quick Start​

Test​

Replace openai base​

Supported LLMs​

Server Endpoints​

Using with OpenAI compatible projects​

Advanced​

Set Custom Prompt Templates​

Using Multiple Models​

Call specific model​

Save Model-specific params (API Base, API Keys, Temperature, etc.)​

Model Alias​

Proxy CLI Arguments​

--host​

--port​

--num_workers​

--api_base​

--api_version​

--model or -m​

--test​

--alias​

--debug​

--temperature​

--max_tokens​

--request_timeout​

--drop_params​

--add_function_to_prompt​

--config​

--telemetry​