Reference API Tutorial

This is the shortest HTTP-only path through the shipped colsearch reference API.

Read this first if you want:

a single-machine retrieval service
base64-ready request examples
dense, late-interaction, multimodal, and shard collection basics

Then continue with:

docs/guides/max-performance-reference-api.md for worker and CPU/GPU tuning
docs/full_feature_cookbook.md for the broader surface

1. Install

pip install "colsearch[full]"
pip install "colsearch[full,gpu]"
pip install "colsearch[server,shard]"
pip install "colsearch[server,shard,solver]"  # adds Tabu Search solver
pip install "colsearch[server,shard,native]"  # adds both public native wheels
pip install "colsearch[server,shard,latence-graph]"  # adds the optional Latence graph lane

The server extra includes the supported document-rendering stack for POST /reference/preprocess/documents.

The supported native extras are:

shard-native: latence_shard_engine for the fused Rust shard CPU fast-path
solver: latence_solver for dense_hybrid_mode="tabu" and /reference/optimize
native: both public native wheels together

The optional latence-graph extra enables the premium Latence graph sidecar. Without it, graph-aware search requests fall back to the OSS retrieval path.

2. Start The Server

Local development:

colsearch-server

Single-host production-style start:

HOST=0.0.0.0 WORKERS=4 colsearch-server

OpenAPI:

http://127.0.0.1:8080/docs
http://127.0.0.1:8080/redoc

Useful probes:

curl http://127.0.0.1:8080/health
curl http://127.0.0.1:8080/ready
curl http://127.0.0.1:8080/metrics

3. Dense Collection

Create a collection:

curl -X POST http://127.0.0.1:8080/collections/tutorial-dense \
  -H "Content-Type: application/json" \
  -d '{"dimension": 4, "kind": "dense"}'

Insert points with the preferred base64 transport:

import requests

from colsearch import encode_vector_payload

body = {
    "points": [
        {
            "id": "invoice",
            "vector": encode_vector_payload([1, 0, 0, 0], dtype="float16"),
            "payload": {"text": "invoice total due", "doc_type": "invoice"},
        },
        {
            "id": "report",
            "vector": encode_vector_payload([0, 1, 0, 0], dtype="float16"),
            "payload": {"text": "board report summary", "doc_type": "report"},
        },
    ]
}

requests.post(
    "http://127.0.0.1:8080/collections/tutorial-dense/points",
    json=body,
    timeout=30,
).raise_for_status()

Search with dense + BM25 fusion:

import requests

from colsearch import encode_vector_payload

response = requests.post(
    "http://127.0.0.1:8080/collections/tutorial-dense/search",
    json={
        "vector": encode_vector_payload([1, 0, 0, 0], dtype="float16"),
        "query_text": "invoice",
        "filter": {"doc_type": "invoice"},
        "dense_hybrid_mode": "rrf",
        "top_k": 2,
    },
    timeout=30,
)
response.raise_for_status()
print(response.json()["results"][0])

Use solver refinement when latence_solver is installed:

{
  "dense_hybrid_mode": "tabu"
}

Use the optional Latence graph lane when the sidecar is installed and the collection payloads carry graph-aware metadata:

{
  "graph_mode": "auto",
  "graph_local_budget": 4,
  "graph_community_budget": 4,
  "graph_evidence_budget": 8,
  "graph_explain": true
}

The graph lane runs after the dense and BM25 first stage and is merged additively. Inspect the sidecar lifecycle with:

curl http://127.0.0.1:8080/collections/tutorial-dense/info

Look for graph_health, graph_dataset_id, graph_sync_status, and graph_last_successful_sync_at in the response.

Public transparency note: the graph lane works on Latence graph data derived from the indexed corpus and linked back to collection targets. The API exposes sync, health, and provenance metadata without publishing proprietary extraction heuristics.

4. Late-Interaction Collection

Create the collection:

curl -X POST http://127.0.0.1:8080/collections/tutorial-li \
  -H "Content-Type: application/json" \
  -d '{"dimension": 4, "kind": "late_interaction"}'

Insert and query multivectors:

import requests

from colsearch import encode_vector_payload

doc_vectors = [[1, 0, 0, 0], [1, 0, 0, 0]]

requests.post(
    "http://127.0.0.1:8080/collections/tutorial-li/points",
    json={
        "points": [
            {
                "id": "doc-1",
                "vectors": encode_vector_payload(doc_vectors, dtype="float16"),
                "payload": {"text": "invoice total due", "label": "invoice"},
            }
        ]
    },
    timeout=30,
).raise_for_status()

response = requests.post(
    "http://127.0.0.1:8080/collections/tutorial-li/search",
    json={
        "vectors": encode_vector_payload(doc_vectors, dtype="float16"),
        "filter": {"label": "invoice"},
        "with_vector": True,
        "top_k": 2,
    },
    timeout=30,
)
response.raise_for_status()
print(response.json()["results"][0])

Late-interaction collections can use the same graph_mode, graph_local_budget, graph_community_budget, graph_evidence_budget, and graph_explain knobs. The base late-interaction order is preserved and graph rescues are appended additively.

5. Multimodal Collection

The multimodal collection API stores precomputed embeddings, but the reference server also provides the preprocessing step.

Render source documents:

curl -X POST http://127.0.0.1:8080/reference/preprocess/documents \
  -H "Content-Type: application/json" \
  -d '{
    "source_paths": ["/data/source/invoice.pdf"],
    "output_dir": "/data/rendered-pages"
  }'

Create the collection:

curl -X POST http://127.0.0.1:8080/collections/tutorial-mm \
  -H "Content-Type: application/json" \
  -d '{"dimension": 4, "kind": "multimodal"}'

Insert and search patch embeddings:

import requests

from colsearch import encode_vector_payload

page_vectors = [[1, 0, 0, 0], [1, 0, 0, 0]]

requests.post(
    "http://127.0.0.1:8080/collections/tutorial-mm/points",
    json={
        "points": [
            {
                "id": "page-1",
                "vectors": encode_vector_payload(page_vectors, dtype="float16"),
                "payload": {"doc_id": "invoice.pdf", "page_number": 1, "kind": "invoice"},
            }
        ]
    },
    timeout=30,
).raise_for_status()

response = requests.post(
    "http://127.0.0.1:8080/collections/tutorial-mm/search",
    json={
        "vectors": encode_vector_payload(page_vectors, dtype="float16"),
        "filter": {"kind": "invoice"},
        "with_vector": True,
        "top_k": 2,
    },
    timeout=30,
)
response.raise_for_status()
print(response.json()["results"][0])

Practical guidance:

multimodal_optimize_mode="auto" is the safe default
explicit solver orderings are for targeted experiments
pure multimodal retrieval usually wants exact MaxSim first, not solver-first packing
the optional graph lane is available here too and follows the same additive merge contract

6. Shard Collection

Shard collections are the max-performance public retrieval path.

Create one:

curl -X POST http://127.0.0.1:8080/collections/tutorial-shard \
  -H "Content-Type: application/json" \
  -d '{
    "dimension": 128,
    "kind": "shard",
    "n_shards": 256,
    "compression": "rroq158",
    "rroq158_k": 8192,
    "rroq158_group_size": 128,
    "rroq158_seed": 42,
    "quantization_mode": "fp8",
    "transfer_mode": "pinned",
    "router_device": "cpu",
    "use_colbandit": true
  }'

For the no-degradation safe-fallback lane (Riemannian 4-bit asymmetric), swap to compression="rroq4_riem" and the related rroq4_riem_* knobs:

curl -X POST http://127.0.0.1:8080/collections/tutorial-shard-safe \
  -H "Content-Type: application/json" \
  -d '{
    "dimension": 128,
    "kind": "shard",
    "n_shards": 256,
    "compression": "rroq4_riem",
    "rroq4_riem_k": 8192,
    "rroq4_riem_group_size": 32,
    "rroq4_riem_seed": 42,
    "quantization_mode": "rroq4_riem",
    "router_device": "cpu",
    "use_colbandit": true
  }'

Shard search is vector-only over HTTP:

import numpy as np
import requests

from colsearch import encode_vector_payload

query = np.random.default_rng(7).normal(size=(16, 128)).astype("float32")

response = requests.post(
    "http://127.0.0.1:8080/collections/tutorial-shard/search",
    json={
        "vectors": encode_vector_payload(query, dtype="float16"),
        "top_k": 10,
        "quantization_mode": "fp8",
        "transfer_mode": "pinned",
        "use_colbandit": True,
    },
    timeout=30,
)
response.raise_for_status()
print(response.json()["results"][0])

Enable the optional graph lane on shard collections with the same endpoint:

response = requests.post(
    "http://127.0.0.1:8080/collections/tutorial-shard/search",
    json={
        "vectors": encode_vector_payload(query, dtype="float16"),
        "top_k": 10,
        "quantization_mode": "fp8",
        "graph_mode": "auto",
        "graph_local_budget": 4,
        "graph_community_budget": 4,
        "graph_evidence_budget": 8,
        "graph_explain": True,
        "query_payload": {
            "ontology_terms": ["Service C", "Export Control"],
            "workflow_type": "compliance",
        },
    },
    timeout=30,
)
response.raise_for_status()
print(response.json()["metadata"]["graph"])

Important truth-in-advertising note:

shard HTTP search does not take query_text
dense BM25 hybrid stays on dense collections over HTTP
shard + BM25 fusion is an in-process HybridSearchManager workflow
shard collections can still use the optional graph lane after first-stage retrieval
on shard HTTP search, use query_payload rather than query_text to steer graph policy

7. Groundedness Tracker (sidecar)

Post-generation hallucination scoring is provided by the optional Latence Trace sidecar from latence.ai, which runs alongside the reference API. The sidecar exposes POST /groundedness with the same chunk_ids / raw_context contract, calibrated risk bands, and NLI / semantic-entropy / structured-source peers. See the Groundedness sidecar guide for the deployment story.

8. Persistence And Inspection

Collections persist under the configured storage root. Useful endpoints:

curl http://127.0.0.1:8080/collections
curl http://127.0.0.1:8080/collections/tutorial-shard/info
curl http://127.0.0.1:8080/health
curl http://127.0.0.1:8080/ready

When the graph lane is enabled, collection info also exposes sidecar health and freshness metadata. Readiness will report degraded or failed graph sync states without taking down the base retrieval service.

Shard-only admin endpoints:

POST /collections/{name}/compact
POST /collections/{name}/checkpoint
GET /collections/{name}/wal/status
GET /collections/{name}/shards
POST /collections/{name}/scroll
POST /collections/{name}/retrieve

9. Optional Solver Surface

Check availability:

curl http://127.0.0.1:8080/reference/optimize/health

/reference/optimize is the stateless solver endpoint for:

dense
dense + BM25
late-interaction
multimodal
mixed candidate pools you want to pack or refine explicitly

Use it when you already have a candidate pool and want optimization, not when you simply need standard exact multimodal retrieval.

10. API Endpoint Reference

The reference HTTP server exposes one consistent collection contract plus a small set of cross-collection helpers. Full request and response schemas are in the live OpenAPI docs at http://127.0.0.1:8080/docs; this table is the quick map.

Collections

Endpoint	Purpose
`POST /collections/{name}`	Create collection (dense, late-interaction, multimodal, or shard)
`GET /collections`	List collections
`GET /collections/{name}/info`	Inspect collection tuning, health, and graph sync state
`DELETE /collections/{name}`	Drop a collection

Points

Endpoint	Purpose
`POST /collections/{name}/points`	Add or upsert documents
`DELETE /collections/{name}/points`	Delete documents by ID
`POST /collections/{name}/retrieve`	Retrieve documents by ID
`GET /collections/{name}/scroll`	Scroll through all stored documents

Search

Endpoint	Purpose
`POST /collections/{name}/search`	Single-query search (dense, late-interaction, multimodal, shard)
`POST /collections/{name}/search/batch`	Batched multi-query search
`POST /rerank`	Rerank a candidate pool
`POST /encode`	Encode text or images to vectors via the active provider

Graph-aware search uses the same POST /search endpoint and adds the optional graph_mode, graph_local_budget, graph_community_budget, graph_evidence_budget, graph_explain, and query_payload fields.

Durability and admin

Endpoint	Purpose
`POST /collections/{name}/checkpoint`	Force WAL checkpoint
`GET /collections/{name}/wal/status`	WAL health and replay status
`GET /health`	Liveness
`GET /ready`	Readiness

Post-generation and document tooling

Endpoint	Purpose
`POST /reference/preprocess/documents`	PDF / DOCX / XLSX / image preprocessing
`POST /reference/optimize`	Tabu Search context packing on a supplied candidate pool
`GET /reference/optimize/health`	Solver lane availability

11. Honest Boundaries

Outside the OSS HTTP contract:

collection-specific ad hoc optimize endpoints
built-in remote embedding hosting
distributed control-plane features
internal research backends that are not part of the shard-first public story

12. Next Stops

docs/full_feature_cookbook.md
docs/guides/max-performance-reference-api.md
docs/guides/shard-engine.md
docs/benchmarks.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reference API Tutorial

1. Install

2. Start The Server

3. Dense Collection

4. Late-Interaction Collection

5. Multimodal Collection

6. Shard Collection

7. Groundedness Tracker (sidecar)

8. Persistence And Inspection

9. Optional Solver Surface

10. API Endpoint Reference

Collections

Points

Search

Durability and admin

Post-generation and document tooling

11. Honest Boundaries

12. Next Stops

FilesExpand file tree

reference_api_tutorial.md

Latest commit

History

reference_api_tutorial.md

File metadata and controls

Reference API Tutorial

1. Install

2. Start The Server

3. Dense Collection

4. Late-Interaction Collection

5. Multimodal Collection

6. Shard Collection

7. Groundedness Tracker (sidecar)

8. Persistence And Inspection

9. Optional Solver Surface

10. API Endpoint Reference

Collections

Points

Search

Durability and admin

Post-generation and document tooling

11. Honest Boundaries

12. Next Stops