Setting up vLLM with Hugging Face for generative AI projects
August 3, 2024
Installing
Just run:
git clone https://github.com/vllm-project/vllm.gitcd vllmdocker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .Once you get it installed, you can login to Hugging Face:
huggingface-cli loginYou will pull the models from Hugging Face, so you need to be logged in.
Start vLLM server
Run:
docker run -it \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --rm \ --network=host \ vllm-cpu-env \ --model facebook/opt-125m \ --api-key 1234bananaThis will install Llama-3-Groq-8B-Tool-Use model and start the server.
Testing
You can curl the server to test it:
curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "facebook/opt-125m", "prompt": "Dario Meira seria muito", "max_tokens": 50, "temperature": 0, "api_key": "1234banana" }'Bonus: using LiteLLM to consume vLLM
base_url = "http://localhost:8000/v1"model_name = "facebook/opt-125m"api_key = "s3cr3t"
# vLLM uses the OpenAI API, so we need to set the provider to "openai"PROVIDER = "openai"
completion( model=f"{PROVIDER}/{model_name}", api_key=api_key, base_url=base_url, messages=messages,)Voila! You have vLLM running on your machine.
Tags: