Quick start
Send your first compression request in 30 seconds.
By the end of this page you will have made your first compressed request and seen the token savings.
Install the SDK
Install the client for your language. The cURL path needs nothing - it ships with your OS.
bashGet an API key
Create a key in the dashboard, copy its
cmp_...value, and export it asCOMPRESR_API_KEY. See Authentication for the full key model and security guidance.Send a compression request
Pass the long
contextyou would otherwise send to your LLM, plus thequeryyou want it to answer. Always setcompression_model_name: "latte_v1".pythonInspect the result
The response contains the compressed text plus token-savings stats. This is the actual response from the live API for the call above (numbers will vary slightly run-to-run as
duration_msdepends on load):textWhat the fields mean
compressed_context— the shortened text. Forward this to your LLM exactly as you would the original input.[N tokens dropped]markers show where spans were cut; passdisable_placeholders=trueif you want a clean concatenation without them.actual_compression_ratio— fraction of input tokens removed (here0.6375= ~64% removed). It is not an Nx factor.target_compression_ratio— the value you asked for, echoed back.0–1= removal strength;>1= Nx factor (max200).tokens_saved—original_tokens−compressed_tokens.duration_ms— server-side compression time. Network round-trip is on top of this.
What just happened?
latte_v1 scored every paragraph in your context against the query, kept the ones that answer it, and dropped the rest. The paragraph with the mirror-diameter sentence stayed; the L2 orbit, sunshield, launch, and operator paragraphs did not. You sent ~64% fewer input tokens to your downstream model without losing the answer.
Want a cleaner output without the [N tokens dropped] placeholders? Pass disable_placeholders=True. For finer-grained, sentence-level cuts inside a paragraph, pass coarse=False. See the models reference for the full parameter list.
Next steps
- Python SDK - full method reference, async variants, streaming, batching.
- TypeScript SDK - same surface, camelCase params.
- cURL / HTTP - raw REST reference.
- Models - tune
target_compression_ratioand other latte-only options. - Agent client - drop-in for
anthropic.Anthropic()/openai.OpenAI()with automatic tool-output compression. - Web search - add Tavily or Brave to your agent loop in one line.
- LangChain integration — first-party middleware for tool outputs, history, and outbound prompts, plus a
BaseDocumentCompressorfor RAG. - LangGraph integration — state-graph node, lossy checkpoint serializer, store wrapper, and multi-agent handoff tool.
- LlamaIndex integration — query-engine postprocessor, tool wrapper, and Memory API block.
- LiteLLM integration — drop the
compresrguardrail into the proxy and compress tool messages across every provider transparently.