Home Data EngineeringHow many jelly beans are in the jar?

How many jelly beans are in the jar?

by Marc

Here is a jar of jelly beans. How many are in it?

Take a guess and hold onto it. We’ll come back to it — and to the slightly unsettling fact that nobody, including the people who set the puzzle, actually knows the answer.

A glass jar packed with multicoloured jelly beans
The actual image from the competition. You can see the front layer. You cannot see the middle.

This was the centrepiece of a weekend hackathon called the Google Guessing Game. The rules are almost insultingly simple: you’re shown eleven images, each with a question — how many jelly beans, how many triangles, how many sheep — and you submit one number per image. Your score is how close you get, averaged across all eleven. Simple to state. Much more interesting to do, because the eleven images are deliberately nothing like each other.

Five puzzles wearing the same costume

“Count the things” sounds like one problem. It’s at least five, and they fight each other:

  • A jar of candy is a volume problem — most of the beans are hidden, so counting what you can see is hopeless.
  • A grid of triangles of every size is combinatorics — you have to reason, not look.
  • A field of coffee beans is dense enumeration — hundreds of nearly-identical specks.
  • A photo of sheep is ordinary object-spotting.
  • And several are trick questions — “how many filled coffee cups”, “how many yellow balls” — where half the skill is ignoring the things that don’t count.

No single instruction to an AI does all of these well. Tell it to “carefully enumerate every shape” and it nails the triangles but absurdly over-thinks a photo of eight sheep. So the whole game is really about recognising which kind of puzzle you’re looking at and reaching for the right tool.

The idea: let the model look, let code do the maths

The agent is a router. A quick first call sorts each image into a category, then hands it to the method that wins that category. The principle underneath all of them collapsed into one sentence: let the model do the perceiving, and keep the arithmetic in code.

For the jar, the model never counts — it reads off the jar’s proportions in bean-units (about nine across, fourteen tall) and Python turns that into a volume. For the grids, “how many squares of all sizes” is a known series; the model just says “it’s a grid”, code measures the exact rows and columns from the dark grid-lines, and the formula does the rest. For dense fields of separated objects, no model call is needed at all — threshold the image and let a connected-components algorithm count the blobs, exact and instant even into the thousands. For ordinary objects, ask several times and take the median.

That last move has a century of pedigree. At a country fair in 1906, Francis Galton had hundreds of villagers guess the weight of an ox; no one was close, but the median of their guesses was within a pound. The organiser hinted at exactly this on day one. So the jar estimate is the median of fifteen independent readings — fifteen fairgoers in a trench coat.

The things I built and then threw away

The clean version above is the survivor. Plenty didn’t make it, and the failures were as instructive as the wins.

The first attempt at dense counting was recursive tiling: chop a dense image into tiles, count each, sum. It seemed obviously right and it was a disaster — to avoid double-counting I told the model to skip any object whose centre fell outside a tile, and it dutifully skipped so many edge objects that 1500 dots came back as 30. It also fired ~85 model calls per image. Deleted.

The second was point-localisation for objects — have the model drop a pin on each item and count the pins. On simple photos it was lovely. Then it met “how many numbers are in blue” and confidently pinned every number, because pinning and colour-filtering at the same time is apparently a bridge too far. The fix wasn’t to throw it out but to distrust it: only use the pin-count when it agrees with the plain estimate, and fall back the moment they diverge. Disagreement became the signal that the question was a trick.

The pattern in the wreckage: the methods that computed something exact survived; the methods that imposed a clever rule kept breaking on the cases the rule didn’t anticipate. Hold that thought.

How do you test a guess when the answer is a secret?

The competition’s true counts are hidden, so I manufactured ground truth — images where I knew the count exactly because I’d placed every object myself.

A synthetic jar of coloured balls rendered with depth
A synthetic jar: 770 balls packed into a 3-D cylinder and drawn front-to-back so the front layer hides the rest. About 150 are visible. Only volume reasoning recovers 770.

The jar generator packs balls into a 3-D cylinder and paints them back-to-front, so the front layer genuinely occludes the interior — just like the real thing. Naive “count what you see” gets ~150; the truth is 770. If your method climbs from one to the other, it’s actually reasoning about volume.

A field of 120 separated coloured dots
A field of separated dots — the easy case for code. Connected-components counted these exactly, 37 through 1500, in milliseconds.

Results: ninth of twenty-four

On the day, the agent landed 9th of 24, at 86.8%.

Image Truth Guess Accuracy
jelly beans (jar) 1623 2297 0.58
triangles 48 27 0.56
paintings 34 33 0.97
sheep 15 15 1.00
coffee cups (filled) 24 23 0.96
yellow balls 9 9 1.00
butterflies 23 25 0.91
books 198 202 0.98
coffee beans 166 139 0.84
snails 70 65 0.93
coffee beans 210 170 0.81

The ordinary objects came in around 0.96. The “yellow balls” trick was answered perfectly. Everything that went wrong went wrong on the two hardest images: the jar (a big over-guess) and the triangle puzzle (a big under-count). The winner scored 96.9%.

The experiment that explained everything

The competition ran on one AI model. Afterwards a single question wouldn’t let go: how much of the score was the agent, and how much was just the model? There’s a clean way to find out — change one line, the model, run the identical agent again, and score it against the now-public answers.

A geometric figure full of overlapping triangles of many sizes
The triangle puzzle. The weaker model found 27 of 48. The stronger model, same prompt, found exactly 48.
Image Weaker model Stronger model
jar 2297 (0.58) 3233 (0.01)
triangles 27 (0.56) 48 (1.00)
coffee beans 139 (0.84) 162 (0.98)
snails 65 (0.93) 70 (1.00)
cups / paintings / sheep / balls 0.96–1.00 ≈ exact

The triangle puzzle the first model got half-right, the better model solved exactly — same prompt, not a line changed. Dense beans, snails, cups all snapped to near-perfect. The hard cases evaporated.

With one spectacular exception. The jar got worse — 0.58 down to 0.01. My carefully-built volume formula, the part I was proudest of, turned a good model into a confidently wrong one: a sharper eye reads larger dimensions, and a formula already leaning high multiplied them into a bigger overshoot (3233 against a true 1623). The crutch had become a cap. The same agent, on the better model, scores 87.6% with the jar formula and 96.3% without it. Ten of eleven images average 99%; one hand-written formula drags the whole thing down nine points.

Why it didn’t win

Three reasons, and they nest.

1. I engineered the one thing I should have left alone. The jar was the image I spent the most effort on, and it’s the only one where the effort actively hurt. Asked to simply estimate the jar, the model lands in the right neighbourhood. Handed a volume formula with a baked-in packing assumption, it produces a confident, precise, wrong number. The assumption was the problem, and an assumption doesn’t get smarter when the model does.

2. There was no real answer for the triangles. Freeform “count every triangle” needs either a genuine geometric solver or a model strong enough to reason it out. I had neither on the day — only a holistic fallback that found 27 of 48. That’s not a tuning miss; it’s a missing capability, and the better model supplied it for free.

3. I optimised the wrong layer. This is the one that stings and the one worth keeping. A whole weekend went into the scaffolding — routing, formulas, ensembles, crowds — on a fixed model. The single most powerful lever in the entire system was the model itself, and I never touched it. The A/B puts a number on it: the model upgrade was worth about ten points across the board, and my cleverest piece of scaffolding was worth minus six on the jar. The teams at the top of the board mostly didn’t out-engineer anyone. They picked a better model and got out of its way.

Where the leverage actually lives

The clean line through the two columns is the keeper. Every method that defers to the model — read the scene, count it, classify the shape — improves the instant the model improves, for free. The one method that overrides the model with a fixed assumption works against the upgrade. The code tricks that survived — measuring a grid by its lines, counting a field by its blobs — survived because they compute something exact, not something assumed.

That’s a usable map of where to spend effort when you build with AI. A stronger model lifts everything you let it reason about; a tuned heuristic is forever only as good as the assumption inside it. Scaffolding earns its place only where it computes a fact the model can’t — and the instinct to add more of it, precisely where the model is already capable, is the instinct to watch. There’s even a bill: the lean version (strong model, a few calls, no scaffolding) is both cheaper and more accurate than the heavily-engineered weaker one.

So — the jar. 1623 jelly beans, a number reached not by counting but by a crowd of guesses meeting in the middle. The most honest way to solve it was never a formula at all. It was to look, estimate, and trust the median — which is exactly what a room full of people at a county fair has been doing since 1906.

You may also like

Leave a Comment