This is the shortest HTTP-only path through the shipped colsearch
reference API.
Read this first if you want:
- a single-machine retrieval service
- base64-ready request examples
- dense, late-interaction, multimodal, and shard collection basics
Then continue with:
docs/guides/max-performance-reference-api.mdfor worker and CPU/GPU tuningdocs/full_feature_cookbook.mdfor the broader surface
pip install "colsearch[full]"
pip install "colsearch[full,gpu]"
pip install "colsearch[server,shard]"
pip install "colsearch[server,shard,solver]" # adds Tabu Search solver
pip install "colsearch[server,shard,native]" # adds both public native wheels
pip install "colsearch[server,shard,latence-graph]" # adds the optional Latence graph laneThe server extra includes the supported document-rendering stack for
POST /reference/preprocess/documents.
The supported native extras are:
shard-native:latence_shard_enginefor the fused Rust shard CPU fast-pathsolver:latence_solverfordense_hybrid_mode="tabu"and/reference/optimizenative: both public native wheels together
The optional latence-graph extra enables the premium Latence graph sidecar.
Without it, graph-aware search requests fall back to the OSS retrieval path.
Local development:
colsearch-serverSingle-host production-style start:
HOST=0.0.0.0 WORKERS=4 colsearch-serverOpenAPI:
http://127.0.0.1:8080/docs
http://127.0.0.1:8080/redoc
Useful probes:
curl http://127.0.0.1:8080/health
curl http://127.0.0.1:8080/ready
curl http://127.0.0.1:8080/metricsCreate a collection:
curl -X POST http://127.0.0.1:8080/collections/tutorial-dense \
-H "Content-Type: application/json" \
-d '{"dimension": 4, "kind": "dense"}'Insert points with the preferred base64 transport:
import requests
from colsearch import encode_vector_payload
body = {
"points": [
{
"id": "invoice",
"vector": encode_vector_payload([1, 0, 0, 0], dtype="float16"),
"payload": {"text": "invoice total due", "doc_type": "invoice"},
},
{
"id": "report",
"vector": encode_vector_payload([0, 1, 0, 0], dtype="float16"),
"payload": {"text": "board report summary", "doc_type": "report"},
},
]
}
requests.post(
"http://127.0.0.1:8080/collections/tutorial-dense/points",
json=body,
timeout=30,
).raise_for_status()Search with dense + BM25 fusion:
import requests
from colsearch import encode_vector_payload
response = requests.post(
"http://127.0.0.1:8080/collections/tutorial-dense/search",
json={
"vector": encode_vector_payload([1, 0, 0, 0], dtype="float16"),
"query_text": "invoice",
"filter": {"doc_type": "invoice"},
"dense_hybrid_mode": "rrf",
"top_k": 2,
},
timeout=30,
)
response.raise_for_status()
print(response.json()["results"][0])Use solver refinement when latence_solver is installed:
{
"dense_hybrid_mode": "tabu"
}Use the optional Latence graph lane when the sidecar is installed and the collection payloads carry graph-aware metadata:
{
"graph_mode": "auto",
"graph_local_budget": 4,
"graph_community_budget": 4,
"graph_evidence_budget": 8,
"graph_explain": true
}The graph lane runs after the dense and BM25 first stage and is merged additively. Inspect the sidecar lifecycle with:
curl http://127.0.0.1:8080/collections/tutorial-dense/infoLook for graph_health, graph_dataset_id, graph_sync_status, and
graph_last_successful_sync_at in the response.
Public transparency note: the graph lane works on Latence graph data derived from the indexed corpus and linked back to collection targets. The API exposes sync, health, and provenance metadata without publishing proprietary extraction heuristics.
Create the collection:
curl -X POST http://127.0.0.1:8080/collections/tutorial-li \
-H "Content-Type: application/json" \
-d '{"dimension": 4, "kind": "late_interaction"}'Insert and query multivectors:
import requests
from colsearch import encode_vector_payload
doc_vectors = [[1, 0, 0, 0], [1, 0, 0, 0]]
requests.post(
"http://127.0.0.1:8080/collections/tutorial-li/points",
json={
"points": [
{
"id": "doc-1",
"vectors": encode_vector_payload(doc_vectors, dtype="float16"),
"payload": {"text": "invoice total due", "label": "invoice"},
}
]
},
timeout=30,
).raise_for_status()
response = requests.post(
"http://127.0.0.1:8080/collections/tutorial-li/search",
json={
"vectors": encode_vector_payload(doc_vectors, dtype="float16"),
"filter": {"label": "invoice"},
"with_vector": True,
"top_k": 2,
},
timeout=30,
)
response.raise_for_status()
print(response.json()["results"][0])Late-interaction collections can use the same graph_mode,
graph_local_budget, graph_community_budget, graph_evidence_budget, and
graph_explain knobs. The base late-interaction order is preserved and graph
rescues are appended additively.
The multimodal collection API stores precomputed embeddings, but the reference server also provides the preprocessing step.
Render source documents:
curl -X POST http://127.0.0.1:8080/reference/preprocess/documents \
-H "Content-Type: application/json" \
-d '{
"source_paths": ["/data/source/invoice.pdf"],
"output_dir": "/data/rendered-pages"
}'Create the collection:
curl -X POST http://127.0.0.1:8080/collections/tutorial-mm \
-H "Content-Type: application/json" \
-d '{"dimension": 4, "kind": "multimodal"}'Insert and search patch embeddings:
import requests
from colsearch import encode_vector_payload
page_vectors = [[1, 0, 0, 0], [1, 0, 0, 0]]
requests.post(
"http://127.0.0.1:8080/collections/tutorial-mm/points",
json={
"points": [
{
"id": "page-1",
"vectors": encode_vector_payload(page_vectors, dtype="float16"),
"payload": {"doc_id": "invoice.pdf", "page_number": 1, "kind": "invoice"},
}
]
},
timeout=30,
).raise_for_status()
response = requests.post(
"http://127.0.0.1:8080/collections/tutorial-mm/search",
json={
"vectors": encode_vector_payload(page_vectors, dtype="float16"),
"filter": {"kind": "invoice"},
"with_vector": True,
"top_k": 2,
},
timeout=30,
)
response.raise_for_status()
print(response.json()["results"][0])Practical guidance:
multimodal_optimize_mode="auto"is the safe default- explicit solver orderings are for targeted experiments
- pure multimodal retrieval usually wants exact MaxSim first, not solver-first packing
- the optional graph lane is available here too and follows the same additive merge contract
Shard collections are the max-performance public retrieval path.
Create one:
curl -X POST http://127.0.0.1:8080/collections/tutorial-shard \
-H "Content-Type: application/json" \
-d '{
"dimension": 128,
"kind": "shard",
"n_shards": 256,
"compression": "rroq158",
"rroq158_k": 8192,
"rroq158_group_size": 128,
"rroq158_seed": 42,
"quantization_mode": "fp8",
"transfer_mode": "pinned",
"router_device": "cpu",
"use_colbandit": true
}'For the no-degradation safe-fallback lane (Riemannian 4-bit asymmetric),
swap to compression="rroq4_riem" and the related rroq4_riem_* knobs:
curl -X POST http://127.0.0.1:8080/collections/tutorial-shard-safe \
-H "Content-Type: application/json" \
-d '{
"dimension": 128,
"kind": "shard",
"n_shards": 256,
"compression": "rroq4_riem",
"rroq4_riem_k": 8192,
"rroq4_riem_group_size": 32,
"rroq4_riem_seed": 42,
"quantization_mode": "rroq4_riem",
"router_device": "cpu",
"use_colbandit": true
}'Shard search is vector-only over HTTP:
import numpy as np
import requests
from colsearch import encode_vector_payload
query = np.random.default_rng(7).normal(size=(16, 128)).astype("float32")
response = requests.post(
"http://127.0.0.1:8080/collections/tutorial-shard/search",
json={
"vectors": encode_vector_payload(query, dtype="float16"),
"top_k": 10,
"quantization_mode": "fp8",
"transfer_mode": "pinned",
"use_colbandit": True,
},
timeout=30,
)
response.raise_for_status()
print(response.json()["results"][0])Enable the optional graph lane on shard collections with the same endpoint:
response = requests.post(
"http://127.0.0.1:8080/collections/tutorial-shard/search",
json={
"vectors": encode_vector_payload(query, dtype="float16"),
"top_k": 10,
"quantization_mode": "fp8",
"graph_mode": "auto",
"graph_local_budget": 4,
"graph_community_budget": 4,
"graph_evidence_budget": 8,
"graph_explain": True,
"query_payload": {
"ontology_terms": ["Service C", "Export Control"],
"workflow_type": "compliance",
},
},
timeout=30,
)
response.raise_for_status()
print(response.json()["metadata"]["graph"])Important truth-in-advertising note:
- shard HTTP search does not take
query_text - dense BM25 hybrid stays on
densecollections over HTTP - shard + BM25 fusion is an in-process
HybridSearchManagerworkflow - shard collections can still use the optional graph lane after first-stage retrieval
- on shard HTTP search, use
query_payloadrather thanquery_textto steer graph policy
Post-generation hallucination scoring is provided by the optional
Latence Trace sidecar from latence.ai, which
runs alongside the reference API. The sidecar exposes
POST /groundedness with the same chunk_ids / raw_context contract,
calibrated risk bands, and NLI / semantic-entropy / structured-source
peers. See the Groundedness sidecar guide
for the deployment story.
Collections persist under the configured storage root. Useful endpoints:
curl http://127.0.0.1:8080/collections
curl http://127.0.0.1:8080/collections/tutorial-shard/info
curl http://127.0.0.1:8080/health
curl http://127.0.0.1:8080/readyWhen the graph lane is enabled, collection info also exposes sidecar health and freshness metadata. Readiness will report degraded or failed graph sync states without taking down the base retrieval service.
Shard-only admin endpoints:
POST /collections/{name}/compactPOST /collections/{name}/checkpointGET /collections/{name}/wal/statusGET /collections/{name}/shardsPOST /collections/{name}/scrollPOST /collections/{name}/retrieve
Check availability:
curl http://127.0.0.1:8080/reference/optimize/health/reference/optimize is the stateless solver endpoint for:
- dense
- dense + BM25
- late-interaction
- multimodal
- mixed candidate pools you want to pack or refine explicitly
Use it when you already have a candidate pool and want optimization, not when you simply need standard exact multimodal retrieval.
The reference HTTP server exposes one consistent collection contract plus a
small set of cross-collection helpers. Full request and response schemas are
in the live OpenAPI docs at http://127.0.0.1:8080/docs; this table is the
quick map.
| Endpoint | Purpose |
|---|---|
POST /collections/{name} |
Create collection (dense, late-interaction, multimodal, or shard) |
GET /collections |
List collections |
GET /collections/{name}/info |
Inspect collection tuning, health, and graph sync state |
DELETE /collections/{name} |
Drop a collection |
| Endpoint | Purpose |
|---|---|
POST /collections/{name}/points |
Add or upsert documents |
DELETE /collections/{name}/points |
Delete documents by ID |
POST /collections/{name}/retrieve |
Retrieve documents by ID |
GET /collections/{name}/scroll |
Scroll through all stored documents |
| Endpoint | Purpose |
|---|---|
POST /collections/{name}/search |
Single-query search (dense, late-interaction, multimodal, shard) |
POST /collections/{name}/search/batch |
Batched multi-query search |
POST /rerank |
Rerank a candidate pool |
POST /encode |
Encode text or images to vectors via the active provider |
Graph-aware search uses the same POST /search endpoint and adds the
optional graph_mode, graph_local_budget, graph_community_budget,
graph_evidence_budget, graph_explain, and query_payload fields.
| Endpoint | Purpose |
|---|---|
POST /collections/{name}/checkpoint |
Force WAL checkpoint |
GET /collections/{name}/wal/status |
WAL health and replay status |
GET /health |
Liveness |
GET /ready |
Readiness |
| Endpoint | Purpose |
|---|---|
POST /reference/preprocess/documents |
PDF / DOCX / XLSX / image preprocessing |
POST /reference/optimize |
Tabu Search context packing on a supplied candidate pool |
GET /reference/optimize/health |
Solver lane availability |
Outside the OSS HTTP contract:
- collection-specific ad hoc optimize endpoints
- built-in remote embedding hosting
- distributed control-plane features
- internal research backends that are not part of the shard-first public story
docs/full_feature_cookbook.mddocs/guides/max-performance-reference-api.mddocs/guides/shard-engine.mddocs/benchmarks.md