Autonomous Research

Agentic Loops, Citations, Evaluator-Generator — The Patterns Behind Perplexity Pro Research and OpenAI Deep Research

The 2024 generation of "Deep Research" tools (Perplexity Pro Research, OpenAI Deep Research, Google Gemini Research, You.com Research) all follow the same recipe: an LLM plans a research agenda, fans out queries to a search index, reads the retrieved pages, cites them in the synthesized answer, and iterates if the answer is still thin. The novelty isn't any single LLM call — it's the orchestration: when to write, when to fork another agent, when to verify with an evaluator, and how to keep citations traceable end-to-end.

The Agentic Research Loop

Plan, search, read, write, evaluate. Repeat until the evaluator says "done."

Key Numbers

5-30 min

Typical wall-clock for "Deep Research" runs

50-200

Pages fetched in a single research session

100-500K

Tokens consumed (input dominated by retrieved content)

5-10

Subquestions a plan typically produces

4-8

Parallel search/fetch threads · diminishing returns past this

2-3

Evaluator iterations before convergence

~70%

Citation accuracy of the best agents (claim-source match)

1. The Planning Step

Decompose the question. Write a research agenda. This is where shape gets set.

The first LLM call ingests the user's question and produces an explicit list of subquestions and search queries:

# Planner prompt template (paraphrased)
"""
You are a research planner. Decompose the question into 5-10
specific subquestions, each one answerable from a small set of
web pages. For each, write a search query optimized for a
keyword search engine.

Question: {user_question}

Output JSON:
[
  {"subq": "...", "query": "..."},
  ...
]
"""

Good plans are orthogonal — subquestions don't overlap, so retrieved pages don't repeat. Bad plans are too narrow ("what's the weather?") or too broad ("how does the universe work?"). Models like GPT-4 and Claude 3.5 produce useful plans; smaller models tend to ask one big query.

2. Retrieval-Augmented Reading

Fetch pages, chunk them, summarize. Keep source URLs attached to every chunk.

# For each subquestion:
results = search_engine.query(subq.query, k=10)

for url in results:
    html = fetch(url)
    text = readability(html)            # boilerplate removal
    chunks = split(text, max_tokens=1000)

    # Each chunk carries provenance
    for c in chunks:
        c.source = url
        c.position = chunk_index
        # Per-chunk relevance score, optional
        c.score = embed(c.text) @ embed(subq.text)

# Top-K chunks across all sources, ranked by relevance
top = sorted(all_chunks, key=lambda c: -c.score)[:K]

The art is in the chunking and ranking. Long chunks preserve context but cost tokens. Short chunks lose surrounding context but rank better against narrow queries. Most production systems use 500-1500 token chunks with 100-token overlap, then re-rank with a cross-encoder before stuffing into the writer's context.

3. Source-Grounded Synthesis (Citations)

Every claim must be traceable back to a source URL. The writer LLM is constrained to cite, not generate from its weights alone.

# Writer prompt
"""
Write an answer using ONLY the sources below. After every
factual claim, append a citation marker like [1] or [2].
Do not state facts that are not directly supported.

SOURCES:
[1] {url1}
{chunk1.text}

[2] {url2}
{chunk2.text}

QUESTION: {user_question}
"""

# Output is parsed for [N] markers, mapped back to URLs.
# Inline footnotes in the rendered output.

The hard part is verification. The model often cites a source that doesn't actually contain the claim — "hallucinated citation." Verifier passes (a separate LLM call per claim) reduce this from ~30% to ~5%, at the cost of doubling token usage.

Anthropic's Claude, OpenAI's responses API, and Perplexity all return structured citation arrays alongside the prose, so the rendering layer can highlight which sentence cites which source.

4. The Evaluator-Generator Pattern

After a draft, a separate LLM critiques it. If it's not good enough, the planner gets feedback and tries again.

# Evaluator prompt
"""
You are evaluating a research answer for completeness, citation
accuracy, and depth. Score 1-10 on each axis. List specific gaps.

Question: {user_question}
Answer:   {draft}
Sources:  {citations}

Output JSON:
{
  "completeness": <1-10>,
  "citation_accuracy": <1-10>,
  "depth": <1-10>,
  "gaps": ["...", "..."]
}
"""

# If any score below threshold, send gaps back to the planner
# to generate fresh subquestions.

This generator-evaluator loop (Schick et al., Saunders et al., and many more) consistently improves quality at the cost of more tokens. The evaluator must be a different model or at least a different prompt from the generator — otherwise it shares the same blind spots.

Cf. Anthropic's "Claude as judge" pattern, OpenAI's reflection in o1, and the Self-RAG paper (Asai 2024).

5. Parallel Exploration

Independent subquestions can be researched in parallel by separate sub-agents. Where to fork?

Two patterns dominate:

Fan-out search: parallel search calls, single writer. Cheap and fast — the typical Perplexity Pro approach.
Fork agents: each subquestion spawns its own mini-agent that runs the full plan-search-read loop, then returns a sub-answer. The orchestrator merges. More expensive but produces deeper sub-coverage. OpenAI Deep Research does this.

The decision is empirical. For factual queries ("what is X?"), fan-out is enough. For comparative or synthesis queries ("compare X, Y, Z including their tradeoffs"), forked sub-agents produce noticeably better coverage.

# Forked sub-agent in pseudocode
async def research_subquestion(subq, depth=2):
    if depth == 0:
        return await one_shot_search_and_answer(subq)
    sub_plan = await plan(subq)
    sub_answers = await asyncio.gather(*[
        research_subquestion(s, depth-1) for s in sub_plan
    ])
    return synthesize(subq, sub_answers)

6. When to Write vs When to Fork

The orchestrator has to decide at every step: do I have enough context to answer? Or do I need to dispatch another search/agent?

Heuristics that work in practice:

Confidence threshold: if the writer's draft has high uncertainty markers ("possibly", "I'm not sure"), trigger another round.
Coverage check: if any subquestion has zero citations, search harder for it.
Time budget: cap at a fixed wall-clock or token budget; ship the best draft when it expires. The user expects a 30-minute Deep Research, not a 4-hour one.
Conflict detection: if two sources contradict, surface the disagreement instead of resolving it silently.

Tradeoffs

Choice	Pro	Con
Many short LLM calls	Composable, debuggable, retryable	Latency from sequential calls
One long LLM call (single-shot)	Lower latency, fewer round trips	Hard to debug; no mid-flight correction
Cite everything	Trust, traceability	Slows synthesis, increases tokens
Forked sub-agents	Deeper coverage	2-5× the cost; harder orchestration
Evaluator pass	Higher quality output	Doubles inference cost

FAQ

How is this different from RAG?

RAG is a single retrieve-then-generate pass. Autonomous research is iterative: it plans subqueries, retrieves over rounds, evaluates the draft, and may re-search. Think of RAG as one LLM call with a vector DB attached; research agents are full agentic loops.

Why do agents hallucinate citations?

The writer is rewarded for citing whatever supports its draft. Without an explicit verifier, it picks the closest-seeming source even when the claim isn't actually supported. Verifiers help, but the only fully reliable solution is constrained generation that can only emit text grounded in retrieved chunks.

Can I run this locally?

Yes — frameworks like LangChain, Llamaindex, and Mirascope have research-agent templates. The bottleneck is the search index. Without a high-recall web index, even a perfect orchestrator produces shallow research.

How does OpenAI Deep Research differ from Perplexity Pro?

OpenAI Deep Research uses o1/o3-class models and forks more aggressively into sub-agents — runs ~30 minutes, produces multi-page reports. Perplexity Pro Research is faster (3-5 min), shallower, optimized for the chat use case.

What's the role of search engine quality?

Critical. The agent's output ceiling is set by what its search index can find. Perplexity built its own index because Bing/Google APIs are too slow and rate-limited. Most other agents use Bing API with custom re-ranking.

Is this just chain-of-thought with extra steps?

Sort of. Chain-of-thought is reasoning over context the model already has. Research adds retrieval mid-reasoning — the agent can pull in new context based on what it's currently uncertain about. The combination is more powerful than either alone.