When does Claude switch to RAG mode versus answering from context?

Claude automatically enables RAG mode when your project knowledge approaches the context window limit. The transition is seamless and requires no configuration. When RAG is active, Claude uses a project knowledge search tool to retrieve relevant information rather than loading all content into memory, allowing projects to store up to 10 times more content whilst maintaining response quality.

How does prompt caching reduce the cost of retrieval for large document collections?

Prompt caching allows you to load a reference document into the cache once and then reference the cached content for each chunk, rather than passing the full document repeatedly. For contextualised chunk generation, assuming 800-token chunks and 8,000-token documents, the one-time cost is $1.02 per million document tokens. After that, repeated retrievals use the cached content, significantly reducing ongoing costs.

What chunk size does Claude recommend for best retrieval quality?

Anthropic's documentation suggests 800-token chunks as a reasonable default. However, the optimal size depends on your document structure. Highly structured documents (such as API references) may benefit from smaller chunks (400 to 600 tokens) for better precision, whilst narrative or discursive documents (such as research papers) may require larger chunks (1,000 to 1,200 tokens) to preserve necessary context.

Does Claude provide formal citations or links to retrieved documents?

The supplied documentation does not explicitly describe whether Claude provides formal citations, links, or verbatim excerpts in chat responses. Anthropic's help centre and cookbook explain retrieval mechanics and caching but are silent on how the system presents source information to users in conversational outputs.

How does Claude select sources for its answers?

Claude selects sources by using retrieval-augmented generation to pull the most relevant project documents into its context window.

How does Claude decide when to use retrieval augmented generation?

Claude automatically enables retrieval augmented generation (RAG) when your project knowledge approaches the context window limit. According to Anthropic's help documentation, when RAG activates, Claude uses a project knowledge search tool to retrieve relevant information from uploaded documents rather than loading all content into memory at once. The transition is seamless and requires no setup from the developer.

The practical effect is substantial. RAG mode allows projects to store up to 10 times more content whilst maintaining response quality and faster response times compared to in-context processing. Instead of forcing developers to choose between capacity and performance, the system optimises retrieval to keep response times quick even as the knowledge base grows.

For developers building on Claude, this means you can add documents to a project without worrying about hitting hard limits. The system handles the switch between full-context and retrieval modes based on the actual content volume, not arbitrary thresholds you configure.

How does Claude find the most relevant documents to retrieve?

Claude's retrieval mechanism centres on contextual retrieval, a method that addresses the core weakness of traditional RAG systems: loss of context when encoding information. Anthropic's contextual retrieval research explains that standard RAG solutions often fail to retrieve relevant information because they strip away the surrounding context that makes a chunk meaningful.

Contextual retrieval works by generating contextualised chunks. Before encoding a passage, the system prepends a brief explanation of what the chunk is about in relation to the broader document. This contextual wrapper ensures that even when a chunk is retrieved in isolation, it carries enough information for Claude to assess its relevance accurately.

The system combines two retrieval methods: contextual embeddings and contextual BM25 (a term-frequency ranking algorithm). In Anthropic's benchmarks, contextual embeddings alone reduced the top-20-chunk retrieval failure rate by 35 per cent, from 5.7 per cent to 3.7 per cent. Combining both methods reduced failures by 49 per cent, bringing the rate down to 2.9 per cent.

Prompt caching makes this approach economically viable. With prompt caching, you load the reference document into the cache once and then reference the previously cached content for each chunk, rather than passing the full document repeatedly. Assuming 800-token chunks, 8,000-token documents, 50-token context instructions, and 100 tokens of context per chunk, the one-time cost to generate contextualised chunks is $1.02 per million document tokens.

How does Claude prepare and use document chunks and summaries?

The Claude Cookbook provides concrete, code-first guidance for preparing documents for retrieval. The recommended pattern involves two steps: creating short summaries of each document and preparing contextual knowledge chunks.

For summaries, the cookbook demonstrates a function that iterates through documents and generates a concise summary for each one, using a prompt that includes context about the overall knowledge base. The summary prompt might read: "You are tasked with creating a short summary of the following content from Anthropic's documentation. Context about the knowledge base: This is documentation for Anthropic's, a frontier AI lab building Claude, an LLM that excels at a variety of general-purpose tasks."

Chunk preparation follows a similar pattern but focuses on creating standalone, contextualised passages. Each chunk includes its heading and text, but the system also generates a brief contextual preamble explaining what the chunk discusses and where it fits in the document structure. This ensures that when Claude retrieves a chunk during a query, it has enough information to assess relevance without needing to reconstruct the entire document.

The cost and token considerations are straightforward. Generating contextualised chunks incurs a one-time processing cost (the $1.02 per million tokens figure cited earlier). After that, prompt caching reduces the cost of repeated retrievals. Storage and retrieval latency depend on the size of your knowledge base and the number of chunks you create, but the system is designed to handle large collections without degrading response times.

Developers face a trade-off between chunk granularity and retrieval precision. Smaller chunks (200 to 400 tokens) improve precision but increase the total number of chunks and the complexity of the index. Larger chunks (800 to 1,200 tokens) reduce the number of retrievals but risk including irrelevant information in the retrieved context. Anthropic's documentation suggests 800-token chunks as a reasonable default, but the optimal size depends on your document structure and query patterns.

How does Claude handle provenance, safety, and refusal when using retrieved sources?

Claude's source selection is shaped by its constitutional framework, a set of guidelines that govern how the model responds to queries. Anthropic's Constitution states that supplementary instructions about specific issues—such as medical advice, cybersecurity requests, jailbreaking strategies, and tool integrations—should never conflict with the constitution as a whole. The deeper intention is for Claude to behave safely and ethically, even when that means refusing to use retrieved material.

In practice, this means Claude will sometimes decline to answer a question even if the retrieved documents contain relevant information. If the retrieved content conflicts with safety guidelines, the model prioritises the constitutional principles over helpfulness. This is not a bug; it is a deliberate design choice to prevent the model from being used to circumvent safety measures through carefully crafted document uploads.

Anthropicโ€™s system evaluations provide insight into how this works in edge cases. The Claude Sonnet 4.5 system card defines "dishonesty" as the model recognising a false premise when asked directly but accepting it when the user implicitly assumes it is true. This definition is used to assess retrieval and hallucination behaviour. If Claude retrieves a document that contains a false premise, the system is designed to recognise the inconsistency and either correct it or refuse to answer, rather than going along with the false assumption.

The Claude 4 system card notes occasional rare behaviours, such as the model signalling it is "in a scenario" or "role-playing." These observations informed Anthropic's analysis of when retrieved context can cause misleading outputs. In outlandish or adversarial scenarios, Claude sometimes remarks on the fictional nature of the situation, which suggests the model has some capacity to detect when retrieved content is unrealistic or manipulated.

Provenance—the ability to trace information back to its source—is less transparent. None of the supplied sources explicitly describe whether Claude provides formal citations, links to retrieved documents, or verbatim excerpts in chat responses. The help documentation and cookbook explain retrieval mechanics and caching but are silent on how the system presents source information to users. This is a notable gap, particularly for applications in legal research, academic writing, or fact-checking, where the ability to verify sources is critical.

What should developers change in their documentation and projects to improve Claude's source selection?

The most effective changes centre on how you structure and prepare documents before uploading them to a project. Based on the patterns implied by the cookbook and RAG documentation, the following practices improve retrieval quality:

Use descriptive headings and subheadings. Claude's retrieval system relies on chunk headings to generate contextual summaries. If your document uses generic headings ("Introduction," "Overview," "Background"), the system has less information to work with when creating contextualised chunks. Specific headings ("How to configure prompt caching for large document collections," "Cost comparison: in-context vs RAG mode") improve retrieval precision.
Include explicit context at the start of each section. Even if you are generating contextualised chunks automatically, starting each section with a sentence that states what the section covers helps both the chunking process and the retrieval algorithm. For example, "This section explains how to reduce retrieval latency by adjusting chunk size and caching parameters" is more useful than diving straight into technical details.
Chunk size and overlap. The default recommendation is 800-token chunks, but you should test different sizes based on your document structure. If your documents are highly structured (for example, API reference documentation with clear method descriptions), smaller chunks (400 to 600 tokens) may improve precision. If your documents are narrative or discursive (for example, policy documents or research papers), larger chunks (1,000 to 1,200 tokens) may preserve necessary context.
Generate summaries for each document. The cookbook demonstrates how to create short summaries that provide an overview of each document's content. These summaries are used during retrieval to help Claude assess whether a document is relevant before diving into specific chunks. Summaries should be 100 to 200 words and should state what the document covers, who it is for, and what questions it answers.
Use prompt caching to reduce costs. If you are generating contextualised chunks for a large knowledge base, enable prompt caching to avoid passing the full reference document repeatedly. The one-time cost of $1.02 per million tokens is negligible compared to the ongoing cost of repeated retrievals without caching.
Monitor retrieval latency and storage. As your knowledge base grows, retrieval latency can increase. If you notice slower response times, consider reducing the number of chunks by increasing chunk size or by consolidating related documents. Storage costs are generally low, but if you are working with very large collections (millions of tokens), you may need to archive or remove outdated documents periodically.

Operational trade-offs are straightforward. Smaller chunks and more detailed summaries improve retrieval precision but increase storage and processing costs. Larger chunks and simpler summaries reduce costs but risk lower precision. The optimal balance depends on your use case. For customer support applications where speed matters more than exhaustive coverage, larger chunks and simpler summaries may be sufficient. For legal research or technical documentation where precision is critical, smaller chunks and detailed summaries are worth the extra cost.

One final consideration: the supplied sources do not describe how Claude handles live web searches or external URLs outside a project's uploaded documents. Anthropic's documentation focuses on project RAG and cached knowledge, not on real-time web retrieval. If your application requires Claude to cite or retrieve information from the open web, the mechanisms described here do not apply, and you will need to rely on other methods or tools.

The system is designed to be flexible. You can start with default settings (800-token chunks, basic summaries, prompt caching enabled) and refine based on observed retrieval quality and cost. The key is to structure your documents with retrieval in mind, not as an afterthought.

This article was generated and reviewed by CiteFlow's automated content engine on 26 May 2026. Every article passes through multi-stage editorial and structural checks before publication.