Intelligence Is Forgetting

Why the best systems are the ones that throw away the most

Apr 15, 2026

Here is the finding that does not get enough attention.

When researchers gave an AI agent access to its complete history, every conversation it had ever had, every decision it had ever made, every document it had ever read, its performance got worse. Not a little worse. Worse than an agent with no memory at all. Feeding the agent its own past made it stupider than amnesia.

Every intuition says more information should help. A richer context window should produce better answers. But the researchers found the opposite: raw history (unfiltered transcripts, verbatim trajectories, complete records) actively crowded out the agent’s ability to reason about the present task. The noise was not neutral. It was toxic.

The fix was not better retrieval. It was destruction. The agent got smarter by losing information. The compiled memory, summaries, distilled claims, structured artifacts, outperformed total recall not because summarization is a clever trick, but because most of what the agent remembered was noise, and the noise was worse than silence.

I keep running into this. Every system I look at, not just agents, but caches, databases, schedulers, operating systems, has the same property: the system that forgets well outperforms the system that remembers everything. And the way a system forgets is the most revealing design decision it makes.

Seven forgetting strategies, all different, all load-bearing

1. The cache forgets by approximate recency.

Redis does not maintain exact LRU ordering across millions of keys. That would require too much bookkeeping on every access. Instead, it samples a small set of random keys, places them into a tiny pool sorted by idle time, and evicts the longest-idle key from that pool. The pool converges to something very close to optimal after enough evictions. It is an approximation of forgetting, not exact forgetting, and the approximation works because which key you evict matters less than that you evict promptly.

The forgetting strategy reveals the value model. Recency is the proxy for future usefulness. The cache assumes the world is temporally local, that what was useful recently will be useful again. That is wrong often enough to matter, and right often enough to work.

2. The CPU forgets by exponential decay.

The Linux scheduler tracks how much CPU each task is using through PELT, a running sum where each newer slice of history matters slightly more than the one before it. A task that ran hard 10 seconds ago contributes almost nothing to its current utilization score. The history is not deleted. It is decayed. The information is still there, but exponentially diminished into irrelevance.

The forgetting strategy reveals the timescale of relevance. In scheduling, the half-life is short, about 32 milliseconds. Anything older than a few hundred milliseconds is already below the noise floor. The scheduler cares about the last third of a second and almost nothing else. That is not a limitation. It is the right temporal resolution for the decisions it makes.

3. The knowledge graph forgets by contradiction.

When a temporal knowledge graph learns “Alice now works at Bolt,” it does not erase “Alice works at Acme.” It marks the old edge with a termination timestamp: valid_at=2022, invalid_at=2025. The old fact still exists. You can query it, trace it, reason over it. But it has been filed into the past.

The forgetting strategy reveals the epistemic model. Facts have lifetimes, and old facts do not vanish. They stop being current. The knowledge graph treats the world as a sequence of states, each one replacing the last. Forgetting here is not loss. It is reclassification.

4. The LLM forgets by attention.

A transformer’s self-attention mechanism is, among other things, a forgetting strategy. At every layer, the model attends strongly to some tokens and barely at all to others. The tokens it ignores are not deleted. They are still in the context window. But they contribute almost nothing to the output. The “attention sink” phenomenon means the first few tokens stay heavily attended so the model keeps its bearings, while the middle of the context is subject to a soft, learned curve of forgetting.

The forgetting strategy reveals the structure of language. Most context is redundant. Through training, the model learns which positions carry signal and which positions carry noise. It forgets the noise by not attending to it.

5. The disk forgets by compaction.

The Linux kernel’s memory compaction algorithm uses two scanners moving toward each other, one finding free pages, one finding movable pages. When they meet, pages are migrated to create contiguous free regions. The compacted pages are not lost. They are rearranged so that the gaps between them disappear. The system forgets the arrangement and keeps the content.

The forgetting strategy reveals the cost model. Arrangement matters more than content in this case. The same pages in a different order can enable a 2MB huge page allocation that was impossible before. Forgetting the old layout is what makes the new layout possible.

6. The file system forgets by readahead.

When you read a file sequentially, the kernel plants a flag a few pages ahead of your current position. When your reads reach that flag, it triggers the next prefetch. The readahead window grows with each confirmed sequential access: 4 pages, then 8, then 16, up to 256. But if you make a random access, reading from a place that is not where the flag expected you to be, the window resets to 4. The system forgets that you were sequential. One random read erases the accumulated evidence of a thousand sequential ones.

The forgetting strategy reveals the fragility model. One counterexample destroys accumulated confidence. That is the right strategy for I/O access patterns, where a single random read often does mean the sequential pattern has ended. The readahead system has chosen to forget quickly rather than remember wrongly.

7. The submission ring forgets by design.

io_uring’s completion ring posts results to shared memory. The kernel fills completion entries into a pre-claimed cache, then publishes them with a single memory barrier. The moment the application reads a completion and advances the ring head, the slot is gone and can be overwritten by the next completion. There is no history, no replay, no audit. The ring forgets each completion the instant it is consumed.

The forgetting strategy reveals the commitment model. Once a result is consumed, it has no value. The system has decided that the only moment a completion matters is the moment it is read. Everything before that is queueing. Everything after that is noise.

The product nobody would build

Here is the product idea that falls out of these seven examples, the one nobody is going to build, and maybe should not:

A database where every record has a decay function.

Not TTL. Not hard expiry. A continuous decay curve where each record’s importance diminishes according to a configurable function, exponential, linear, step, sigmoid, custom. Queries return results weighted by their current importance. A record with importance 0.001 is still retrievable by explicit ID lookup, but it is invisible to scans and rankings. The storage engine reclaims space by compacting records whose importance has dropped below a threshold, archiving them to cold storage rather than deleting them.

The schema would look like this:

CREATE TABLE observations (
    id UUID,
    content JSONB,
    created_at TIMESTAMPTZ,
    decay_function TEXT DEFAULT ‘exponential’,
    half_life INTERVAL DEFAULT ‘30 days’,
    importance GENERATED ALWAYS AS (
        decay(decay_function, half_life, created_at, NOW())
    ) STORED
);

-- “What do I know about Alice, weighted by freshness?”
SELECT * FROM observations
WHERE content @> ‘{”subject”: “Alice”}’
ORDER BY importance DESC
LIMIT 10;

-- Result: recent observations first, old observations faded
-- Record from yesterday: importance 0.98
-- Record from last month: importance 0.50
-- Record from last year: importance 0.03

Why nobody will build it: every feature here can already be approximated with existing tools. ORDER BY created_at DESC gets you most of the value. A materialized view with a recency score gets you most of the rest. The final stretch, the continuous decay, the per-record function, the automatic compaction by importance, is a lot of infrastructure for a relatively small gain.

Why it is still interesting: it makes the forgetting strategy explicit and configurable. Every other system I described already has a forgetting strategy baked in. Redis chose approximate LRU. PELT chose exponential decay with a short half-life. The knowledge graph chose temporal supersession. The decay database would let you choose. And that choice would force you to answer a question that most system designers never ask directly:

What is the half-life of relevance in your domain?

For a customer service system, maybe 30 minutes. The current conversation matters; yesterday’s often does not. For a medical record, maybe 10 years. Old diagnoses still matter a great deal. For a security log, maybe 90 days, tied to a compliance window. For a personal knowledge system, it could differ by domain. Work memories may decay faster than personal memories. Technical facts may decay faster than relationship context.

The half-life of relevance is one of the most important parameters in any information system, and almost nobody configures it. People default to “keep everything,” which is wasteful and noisy, or “delete after N days,” which is coarse and lossy. The space between those two extremes is where the interesting systems live.

What this actually tells us

I do not think anyone should build the decay database. But I do think every system designer should answer the question it forces you to ask.

When you build a cache, you are choosing a forgetting strategy. When you set a context window, you are choosing a forgetting strategy. When you pick an eviction policy, a retention period, a compression ratio, a summary length, you are deciding what to forget and how fast.

The systems that work well are the ones where the forgetting strategy matches the information structure of the domain. Redis works because web request patterns are temporally local. PELT works because CPU utilization changes on a millisecond timescale. The knowledge graph works because facts supersede each other. The LLM works because most of a document is redundant.

The systems that work badly are the ones that forget in the wrong way, or do not forget at all. An agent with total recall performs worse than an amnesiac because the noise overwhelms the signal. A cache that never evicts runs out of memory. A readahead system that never resets on random access wastes bandwidth on pages that will never be read. A completion ring that tried to keep history would stall under load.

The claim I want to make is simple, and I think it is true: intelligence is not the accumulation of information. It is the curation of information. And curation is mostly forgetting.

The cache that knows which keys to evict. The scheduler that knows which history to decay. The knowledge graph that knows which facts have been superseded. The LLM that knows which tokens to ignore. The compactor that knows which arrangement to abandon. The readahead logic that knows when to stop predicting. The ring that knows the moment to let go.

These are all the same thing. They are systems that got smarter by learning what to throw away. And the design decision that made them smart, the forgetting function, is the one that is almost never discussed, almost never configured, and almost never recognized as the most important parameter in the system.

The half-life of relevance. The decay curve of importance. The moment after which information stops being signal and starts being noise.

That is where the intelligence lives. Not in remembering. In forgetting.

Hannu Varjoranta

Discussion about this post

Ready for more?