Models do not Have a Thinking Hat — It is Engineering at its Finest
We have reached a point where AI engineering resembles the frantic dungeon inventors of Chitty Chitty Bang Bang. Locked away under the intense pressure of corporate barons demanding immediate magic, the industry has traded engineering discipline for wizard robes. They throw smoke bombs of vocabulary—reasoning, understanding, cognition—to hide the fact that they are trying to summon a ghost out of an autocomplete engine.
Inevitably, the atmosphere becomes thick, leaving us with an aching question — "Are we building actual machines, or are we just singing in the dungeon?"
My frustration is understandable.
The Linguistic Wild West
Many of us continue to grapple with the idea of using this awesome force available to Software Engineering. Marketing hype has hijacked the language of cognition treating LLMs like "thinking beings" with "reasoning", "planning,"..."autonomy."
In a recent user group Meetup at a leading organization, such as REDACTED, the featured presentation talked about "Teams of Autonomous Agents." This overreach allows the audience to form a mental model expecting "digital employees" taking the weight of an entire system to solve problems. Magic.
Now, deterministically, the linear calculation of frustration resembles more a polynomial function. Nuts!
RAG: A Case Study
Calling what happens in Retrieval-Augmented Generation (RAG) "reasoning" is a massive stretch of the word. It implies a conscious, logical processing of new facts, when in reality, it's clever math combined with advanced pattern matching.
At its core, RAG is a three-step algorithm: Retrieve, Augment, and Generate. It doesn't teach the model new thinking skills; it simply provides a custom, highly relevant cheat sheet right before the model answers.
Here is how I understand the algorithm works behind the scenes, broken down step-by-step.
The Setup — Turning Words into Math — Vectorization
Before any searching happens, the external knowledge base (documents, PDFs, databases) must be converted into a format the system can compare mathematically.
Why Vectors?
Vectors allow for the representation of various types of data such as text, images, audio, in a high-dimensional space capturing their semantic relationships. This is the key — Semantic Relationships.
The massive text, the information, is broken down into smaller, manageable pieces — sentences. paragraphs, etc. This process is called Chunking.
The "chunks" are processed by an embedding model which converts the text into vector, a long string of numbers representing its semantic meaning. In turn, these vectors are stored in specialized vector databases mapping out a semantic space where pieces of text with similar meaning sit close to each other.
This is the reason cancer research vector databases are often specialized around cancer-related subjects: the embeddings are derived from oncology literature. A general-purpose vector database would be semantically less effective.
Step 1 — Retrieval
When a user submits a query, the system doesn't just hand it to the Large Language Model (LLM) right away. First, it goes looking for resources.
From what we learned above it is easy to understand that the user's query is converted into a vector using the exact same embedding model.
Similarity Search
Remember we mentioned that in the vector database pieces of text with similar meaning sit close to each other? Ah! The algorithm now computes the distance between the query vector and all the document vectors in the database.
The Top-K Cut
The system pulls the top K most mathematically similar chunks (say, the top 3 or 5 most relevant paragraphs) from the database.
Step 2 — Augmentation — The Cheat Sheet
This is where the "augmented" part happens. The system takes the user's original query and structurally wraps it inside the retrieved context.
Instead of sending just the question, the backend algorithm constructs a hidden prompt that looks something like this:
Context:
[Retrieved Paragraph 1]
[Retrieved Paragraph 2]
User Question: [Original Query]
Instruction: Answer the question using only the context provided above. If the answer cannot be found in the context, state that you do not know.
Step 3 — Generation — The Pattern Matching
Finally, this massive, newly constructed prompt is sent to the LLM.
This is where the misconception of "reasoning" usually happens. The LLM does not read the context, understand it as a novel human truth, and ponder a conclusion. Instead, it executes its core function: predicting the next most probable token (word).
Because the retrieved facts are now sitting directly inside its immediate context window, those specific words and concepts have a massive statistical weight. The model uses its pre-trained statistical representations of language to synthesize, rephrase, and extract the text from the cheat sheet into a coherent response.
Conclusion
It is mapping, not reasoning. Merlin did not get to issue his "O grakon, e male..." It is not magic either.
It is Engineering at its Finest.
In a nutshell, this is the engineering behind the "reasoning" word in RAG:
[User Query] ➔ [Vector/Keyword Search] ➔ [Fetch Relevant Text Blocks] ➔ [Stuff into LLM Prompt Context] ➔ [Generate Answer]
Final Thought
I have referred to WORKS Commons in past articles, a hybrid between "AI First" and "AI Native" application where models are used as contributors.
WORKS Commons is not RAG although it uses AI at its core.
In WORKS Commons, LLMs converts unstructured input into context aware responses and classifications, structured signals, which the application uses to build a state. The key distinction is that LLMs do not control execution. They transform input into data that a deterministic system, WORKS Commons, uses to drive behavior.
This writing emanates from the core of my frustration while learning to use this new "Printing Press." However, I have come to the realization that the closer you get to the APIs, the less mystical the system becomes — and the more impressive the engineering actually is.

Comments
Post a Comment