Module 218 min read · AI in Finance

How LLMs Actually Handle Numbers

Here is the single most important thing to understand before using AI for any financial work: language models do not do math the way a calculator does. They predict text. When an LLM tells you that 23% of $4.2M is $966,000, it isn't calculating — it's generating the most statistically likely sequence of tokens that looks like an answer. Sometimes that's correct. Sometimes it's confidently, catastrophically wrong. This module explains exactly how this works and how to get reliable numerical output.

Why a language model is not a calculator

A calculator executes arithmetic deterministically — the same inputs always produce the same correct output because actual computation happens. A language model generates text by predicting likely token sequences based on patterns in its training data. When you ask it to multiply two numbers, it's drawing on having seen millions of examples of arithmetic in text — not performing the operation.

For small, common calculations the model has effectively memorized the patterns, so it's usually right. For larger or more unusual numbers, the prediction breaks down. The model might produce an answer that's off by an order of magnitude, transpose digits, or simply invent a plausible-looking result.

Think of it this way

Imagine someone who has read every math textbook ever written but has never actually done arithmetic — they've only seen the worked examples. Ask them "what's 7 times 8?" and they'll instantly say 56 because they've seen it ten thousand times. Ask them "what's 4,847 times 392?" and they'll confidently produce a number that looks right but may be completely wrong, because they're recalling the shape of an answer rather than computing it.

The critical solution: tools that actually compute

The fix transformed AI's usefulness in finance. Modern AI tools can execute real code to do math, rather than predicting it. The most important is ChatGPT's Advanced Data Analysis (formerly Code Interpreter), which runs actual Python in a sandbox.

When you upload a spreadsheet and ask ChatGPT to calculate the compound annual growth rate, with Advanced Data Analysis enabled it writes Python code, executes it on your actual data, and returns the real computed result. This is fundamentally different from asking it to "estimate" the CAGR from the numbers in the chat — the former is real computation, the latter is text prediction.

∞

The difference between predicting and computingA model predicting math has no upper bound on how wrong it can be. A model executing code returns the same correct answer a calculator would. For any financial figure that matters, you want computation, not prediction.

How to force reliable numerical work

Use code execution for any real calculation

If the number matters, use ChatGPT with Advanced Data Analysis or ask any model to "write and run code to calculate this." When code executes, the result is trustworthy. When the model does it "in its head," it is not.

Provide the raw numbers explicitly

Don't ask the model to recall a company's revenue — give it the actual figures from the source. The model is far better at operating on numbers you provide than at retrieving them from training data, where hallucination risk is highest.

Ask it to show its work

Request the formula and each intermediate step, not just the final answer. This lets you catch errors and verify the logic. Chain-of-thought also genuinely improves accuracy on multi-step calculations.

Sanity-check the magnitude

Always ask yourself: is this number even plausible? If an LLM says a company with $50M revenue has $2B in net income, the error is obvious. Magnitude checks catch the most dangerous hallucinations instantly.

A concrete cautionary example

Ask a language model without code execution to compute a 30-year mortgage amortization, or to sum a column of 40 specific figures, and there's a real chance it produces a confident wrong answer. The output will be formatted perfectly — clean tables, professional language — which makes the error harder to catch, not easier. Polish is not accuracy.

The right division of labor

Once you understand this, the correct workflow becomes obvious. Use the language model for what it's genuinely good at — understanding context, structuring analysis, explaining concepts, drafting narrative — and use code execution or a real spreadsheet for the actual numbers.

Give to the LLM's "reasoning"	Give to code execution / spreadsheet
Explaining what a metric means	Calculating the metric
Structuring a financial model's logic	Running the model's numbers
Interpreting what a ratio implies	Computing the ratio from raw data
Drafting the narrative around results	Producing the results themselves

The rule to never forget

If a financial number an AI gives you would influence a real decision, either it came out of executed code on data you provided, or you verify it against a primary source. There is no third acceptable option. A predicted number is a guess wearing a suit.

Now that you understand the numerical foundation, Module 3 covers how to use AI for market and company research — where the live-web tools become essential and where the training-cutoff trap catches careless users.

How LLMs Actually Handle Numbers

Why a language model is not a calculator

The critical solution: tools that actually compute

How to force reliable numerical work

The right division of labor

Next