Skip to main content

Docker

warning

🚧 Cortex.cpp is currently in development. The documentation describes the intended functionality, which may not yet be fully implemented.

Setting Up Cortex with Docker​

This guide walks you through the setup and running of Cortex using Docker.

Prerequisites​

  • Docker or Docker Desktop
  • nvidia-container-toolkit (for GPU support)

Setup Instructions​

Build Cortex Docker Image from source or Pull from Docker Hub​

Pull Cortex Docker Image from Docker Hub​

# Pull the latest image
docker pull menloltd/cortex:latest
# Pull a specific version
docker pull menloltd/cortex:nightly-1.0.1-224

Build and Run Cortex Docker Container from Dockerfile​
  1. Clone the Cortex Repository


    git clone https://github.com/janhq/cortex.cpp.git
    cd cortex.cpp
    git submodule update --init

  2. Build the Docker Image


docker build -t cortex --build-arg CORTEX_CPP_VERSION=$(git rev-parse HEAD) -f docker/Dockerfile .

Run Cortex Docker Container​

  1. Run the Docker Container
  • Create a Docker volume to store models and data:


    docker volume create cortex_data


    # requires nvidia-container-toolkit
    docker run --gpus all -it -d --name cortex -v cortex_data:/root/cortexcpp -p 39281:39281 cortex

  1. Check Logs (Optional)


    docker logs cortex

  2. Access the Cortex Documentation API

  3. Access the Container and Try Cortex CLI


    docker exec -it cortex bash
    cortex --help

Usage​

With Docker running, you can use the following commands to interact with Cortex. Ensure the container is running and curl is installed on your machine.

1. List Available Engines​


curl --request GET --url http://localhost:39281/v1/engines --header "Content-Type: application/json"

  • Example Response

    {
    "data": [
    {
    "description": "This extension enables chat completion API calls using the Onnx engine",
    "format": "ONNX",
    "name": "onnxruntime",
    "status": "Incompatible"
    },
    {
    "description": "This extension enables chat completion API calls using the LlamaCPP engine",
    "format": "GGUF",
    "name": "llama-cpp",
    "status": "Ready",
    "variant": "linux-amd64-avx2",
    "version": "0.1.37"
    }
    ],
    "object": "list",
    "result": "OK"
    }

2. Pull Models from Hugging Face​

  • Open a terminal and run websocat ws://localhost:39281/events to capture download events, follow this instruction to install websocat.

  • In another terminal, pull models using the commands below.


    # requires nvidia-container-toolkit
    curl --request POST --url http://localhost:39281/v1/models/pull --header 'Content-Type: application/json' --data '{"model": "tinyllama:gguf"}'

  • After pull models successfully, run command below to list models.


    curl --request GET --url http://localhost:39281/v1/models

3. Start a Model and Send an Inference Request​

  • Start the model:


    curl --request POST --url http://localhost:39281/v1/models/start --header 'Content-Type: application/json' --data '{"model": "tinyllama:gguf"}'

  • Send an inference request:


    curl --request POST --url http://localhost:39281/v1/chat/completions --header 'Content-Type: application/json' --data '{
    "frequency_penalty": 0.2,
    "max_tokens": 4096,
    "messages": [{"content": "Tell me a joke", "role": "user"}],
    "model": "tinyllama:gguf",
    "presence_penalty": 0.6,
    "stop": ["End"],
    "stream": true,
    "temperature": 0.8,
    "top_p": 0.95
    }'

4. Stop a Model​

  • To stop a running model, use:

    curl --request POST --url http://localhost:39281/v1/models/stop --header 'Content-Type: application/json' --data '{"model": "tinyllama:gguf"}'