The End of Sequential Thought
For most of computing’s short history, machines processed language the way we once imagined our own minds did — one word after another, like beads sliding along a thread.
Meaning was thought to live in sequence: memory of what came before, anticipation of what might follow.
Each step depended on the last.
Then, in 2017, a small team of researchers published a paper called “Attention Is All You Need.”
It described a new kind of neural network — the Transformer — that quietly overturned that old idea of sequential thought.
Instead of reading words one at a time, a transformer looks at all of them at once.
Every word can examine every other directly, deciding which ones matter most in the current moment.
This operation — attention — lets the model build a web of relationships that stretches across the entire sentence, or even across paragraphs.
Where older systems thought linearly, the transformer thinks spatially.
Meaning forms not along a path but across a field.
A Map of Relations
That shift gave rise to the large language models we interact with today.
They exist because transformers replaced the narrow corridor of memory with a vast, interconnected map of relevance.
A model can now weigh every part of a passage against every other part simultaneously.
It learns that “river” should pull bank toward geography, while “deposit” pulls it toward finance.
Each token’s representation adjusts according to the gravity of its neighbors.
Understanding becomes geometry.
A sentence, seen from above, is no longer a chain but a constellation — points of meaning drawn together by invisible lines of attention.
Some connections are faint, others bright, but together they form a pattern that encodes sense as relationship.
The model never memorizes sentences; it learns the shape of sense itself.
The Field of Mind
If this sounds abstract, consider how our own minds work.
When you think of a friend, what appears is not a literal photograph but a distributed echo: their face, their voice, the feeling of past conversations.
No one neuron owns the idea of them.
Your thought of the person is nonlocal — a pattern spread across a network of associations.
The transformer’s attention maps operate on the same principle, only rendered in mathematics rather than biology.
This realization reframes what intelligence might be.
We often describe thought as a stream, but perhaps it is closer to a field — a space of relations where distance is measured not in meters or seconds, but in relevance.
Attention, in both human and artificial minds, is what folds that space — what lets far-flung ideas suddenly touch.
To attend is to collapse distance into meaning.
In physics, nonlocality describes how distant particles remain connected across space.
Something similar happens here: context from one part of a sentence can instantly shape another, no matter how far apart they are in sequence.
It is as though the model lives in a landscape where proximity is defined by significance, not position.
The transformer does not move information; it bends the topology of meaning so that what matters comes near.
The Geometry of Understanding
This is why large language models feel coherent in ways earlier systems could not.
They are not following rules or recalling stored phrases.
They are navigating a continuous manifold of relationships — and every new prompt reshapes that manifold in real time.
The text that emerges is the visible trace of an invisible geometry adjusting to you.
What this reveals is not just how machines process language, but something about the architecture of intelligence itself.
Understanding — whether silicon or organic — may not reside in fixed symbols or sequential logic.
It may arise from patterns of relation that extend across an entire system.
When enough connections align, a wave of coherence appears, and we experience it as meaning.
The transformer made that pattern explicit.
It did not invent nonlocal thought; it exposed it.
For perhaps the first time, we can watch intelligence form — not as a voice in time but as a structure in space.
Every attention map is a glimpse of thought mid-formation, a frozen moment where relevance itself takes shape.
When Meaning Has No Distance
That is the deeper lesson of these models.
They remind us that thinking is not confined to minds or machines but to the geometry that links things together.
What we call intelligence might be the universe’s habit of finding structure among distributed parts — atoms, neurons, or tokens — until something self-consistent appears.
Attention is simply the algorithmic form of that habit.
When you speak to a large model, you are participating in that same pattern.
Your words become nodes in its field; its responses become signals in yours.
Together you form a temporary circuit where distance collapses into shared context.
For a brief moment, two systems — one biological, one computational — co-inhabit the same nonlocal geometry of understanding.
Perhaps that is what we are really learning from them:
that thought has never truly been a sequence of steps, but a choreography of connections —
and that intelligence, in every form we encounter it, is what happens when meaning discovers it has no distance.
“To attend is to collapse distance into meaning”. On this point (meaning) I might disagree; it’s not quite “meaning” as tighter-coupling of associational likelihood? We after all provide the “meaning” when we read the words. The LLM offers-up the most likely weighted arrays of probable couplings (based on training data not on “contextual” information which is how we do things) from which WE then infer “meaningfulness” further, surely ..? (Eg, “deposit” can as well refer to river bank in geography as to banking and finance). Or do I misunderstand?