API Documentation
Integrate with the Qube Compute GPU Cloud and Groq LPX Inference APIs. Provision bare-metal GPU instances or run sub-second LLM inference — all from a single API key.
API Key Authentication
All requests must include your API key in the Authorization header. You can generate keys from the Control Panel under Settings → API Keys.
Authorization: Bearer qb_live_sk_7f3a2b1c9d4e... # Base URL https://api.qubecompute.com
GPU Instances API
Create, list, inspect, and terminate bare-metal GPU instances.
Create a new GPU instance
{
"gpu_type": "H200_SXM",
"gpu_count": 8,
"region": "kz-almaty-1",
"image": "ubuntu-22.04-cuda-12.4",
"ssh_key_id": "key_8f3a2b1c"
}{
"id": "inst_7xKp2mNqR4",
"status": "provisioning",
"gpu_type": "H200_SXM",
"gpu_count": 8,
"region": "kz-almaty-1",
"ip_address": null,
"created_at": "2026-09-15T08:30:00Z"
}List all instances
{
"data": [
{
"id": "inst_7xKp2mNqR4",
"status": "running",
"gpu_type": "H200_SXM",
"gpu_count": 8,
"ip_address": "185.120.44.12",
"created_at": "2026-09-15T08:30:00Z"
}
],
"has_more": false
}Get instance details
{
"id": "inst_7xKp2mNqR4",
"status": "running",
"gpu_type": "H200_SXM",
"gpu_count": 8,
"region": "kz-almaty-1",
"ip_address": "185.120.44.12",
"image": "ubuntu-22.04-cuda-12.4",
"ssh_key_id": "key_8f3a2b1c",
"created_at": "2026-09-15T08:30:00Z",
"hourly_rate": "31.68"
}Terminate an instance
{
"id": "inst_7xKp2mNqR4",
"status": "terminating",
"terminated_at": "2026-09-15T12:45:00Z"
}Groq LPX Inference API
Ultra-low-latency LLM inference powered by Groq LPU hardware. OpenAI-compatible endpoint — swap your base URL and go.
POST/v1/completions
curl -X POST https://api.qubecompute.com/v1/completions \
-H "Authorization: Bearer qb_live_sk_7f3a..." \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-70b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in one paragraph."}
],
"max_tokens": 256,
"temperature": 0.7
}'import requests
response = requests.post(
"https://api.qubecompute.com/v1/completions",
headers={
"Authorization": "Bearer qb_live_sk_7f3a...",
"Content-Type": "application/json",
},
json={
"model": "llama-3.1-70b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in one paragraph."},
],
"max_tokens": 256,
"temperature": 0.7,
},
)
print(response.json())const response = await fetch("https://api.qubecompute.com/v1/completions", {
method: "POST",
headers: {
"Authorization": "Bearer qb_live_sk_7f3a...",
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "llama-3.1-70b",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing in one paragraph." },
],
max_tokens: 256,
temperature: 0.7,
}),
});
const data = await response.json();
console.log(data);{
"id": "cmpl_9qWvX3mK",
"object": "chat.completion",
"model": "llama-3.1-70b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Quantum computing leverages..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 28,
"completion_tokens": 87,
"total_tokens": 115
}
}POST/v1/embeddings
curl -X POST https://api.qubecompute.com/v1/embeddings \
-H "Authorization: Bearer qb_live_sk_7f3a..." \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-70b",
"input": "Qube Compute provides GPU cloud infrastructure."
}'{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0023, -0.0091, 0.0154, ... ]
}
],
"model": "llama-3.1-70b",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}Rate Limits
Instance management endpoints (create, list, get, terminate). Burst up to 1,500 with backoff.
Completions and embeddings endpoints. Higher limits available on Enterprise plans.
X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset.Official SDKs
SDKs coming soon. We are building first-class libraries for the most popular languages.
Ready to Build?
Get your API key and start provisioning GPU instances or running inference in minutes.