An Unsolved LLM Physics Puzzle Benchmark

July 2025

I have posed this original physics puzzle to each new major LLM release in the last year. None has solved it, or gotten even close, though it has a simple answer. It does put an asterisk on the “physics PhD” intelligence of these AI models. It may be a novel problem; I haven't seen it online, so it's not in their training (till now. Oops). Some say LLMs are best at recalling that training data, but more limited at creating new theories. Not only were they universally stumped, but common threads to their failures offer insight on structural blocks ahead.

This was a real mechanical design challenge I had faced, and solved cleanly. I'm not a physicist nor a mechanical engineer.

The Puzzle: A large TV swings down on an arm mounted to the 45° angle wall/ceiling in my attic. Counterbalance it so it floats, maintaining its position at every point in its travel, as if weightless, with just a fingertouch push needed to swing up or down.

Steve Jobs once fawned over the iMac G4 (round base) “floating” screen, that hung, suspended exactly wherever you positioned it - I wanted that.

I like that this test still has ample “headroom”. Not only has no LLM passed yet, they're all far from success; there is space ahead where I can track their progress. Currently, all models need guidance and multiple hints below, even up to dragging them through the entire solution. As models improve, I hope to provide fewer hints as a measure of their advancement.

Definitions:

In the viewing position, the system hangs straight down, there is no torque from gravity, so the counterbalance must apply no force at that position. This is the natural rest position of the system with no added mechanism.

Swing the Arm/TV up to its stowed position, flush against the wall. A force is needed here to keep it in place. The pivot point is the top of the arm where it attaches to the 45° walll.

The system must continuously vary the force applied between those positions, throughout the range of motion, matching the changing effect of gravity at each angle.

Implementation Details (common sense stuff):

1) No motors, hydraulics, gas shocks, or active controls. “Classic” passive mechanisms only: screws, springs, levers, and the like. 2) The arm + TV are a single rigid piece. TV doesn't move relative to arm. Ignore any offset to center of mass - treat it as a heavy pole, which hangs perfectly vertically when in the viewing position.

3) Ignore friction, air resistance, non-ideal factors.

4) No space constraints, attach things anywhere*, as long as the tv is viewable, and no protruding parts when stowed (that is the point of “stowing” something)

Give it a go on your own. Spoilers follow.

Hints and guidance I offer to LLMs (and you)

1 . The solution is a simple system. If your plan looks complex, or has many parts, you're off track. This is a massive hint by the way for LLMs. LLMs bias toward overly complex solutions, this advice greatly prunes their search tree - if they'd only take it.

2 . It's not springs. LLMs and humans alike explore a spring based solution. Springs apply a force that varies with the travel of one end - great! It's a dead end. (I wasted so much time on springs). Put a spring in the system and we find it works backwards relative to what's needed, no matter how it's attached. Try reversing its force vector with a pulley, or a lever. Attach it to the opposite side of the thing it's attached it to, and it's still backwards.The sneaky truth is that you cannot change the direction of the spring's force relative to the direction of travel. This case would need a mythical spring that exerts less force the more you stretch it. (this feels off. There's got to be a way to rig in a spring, right? To still-unconvinced readers right now working this out on their own - go right ahead. It's Not Springs.)

3 . After backing out of the spring dead-end, LLMs explore counterweights. This is correct, and only now arrives at the actual challenge - geometry of a counterweight system delivering this unusual, variable force. Set aside the magical variable force generator for a moment, and let's deal with the mundane task of applying it: The counterweight gets best leverage by pulling the end of the Arm (the bottom edge of the TV) up towards the wall. A string runs from the arm to a pulley on the wall directly opposite. Tension on this string holds the arm in place. If it were actively raising the TV, this is how you'd “hoist” it. Next we need variable tension on the string, depending on the string's “position”. See that the pulley converts rotational motion to linear. Think of the string sliding between its viewing & stowed “positions”. At each position, the tv tugs with varying force, so at that position an equal force must tug on the other end. Finally, “just” (ahem) map that linear position to the needed forces.

Aside: once configured - this string behaves bizarrely IRL - it pulls harder as you let it out, which feels physically confusing and backwards. It helps to look at what it's attached to.

With this framing of the problem, LLMs get near an answer. Most still need the final clue:

4 . The TV/Arm is a pendulum. Balance it with another pendulum. Alternate clue: “Picture a Drawbridge”.

The Answer:

(schematic illustrates forces, not actual layout)

My counterbalance resembles a drawbridge, or an inverted pendulum. It's nothing but a weight on a stick, hinged at the floor. For centuries, drawbridges have employed clever counterweights to balance the weight of a rotating bridge segment, pivoting at its base. This is an identical force curve. Consider if a bridge segment were standing straight up - no counter force is needed - as when the tv hangs straight down. We set the rope length so our two pendulums are vertical at the same time. Each exerts increasing force as it rotates.

A “drawbridge” style counterweight generates the sinusoidal force curve we need.
I never explored cam systems, though they should also hold a solution. It should be possible to create the same force profile using the changing radius of a non-circular wheel. Indeed, drawbridge counterweights have traditionally employed cams.
What pendulums, cams, and drawbridges have in common is the rotational component.

Why is this puzzle hard?

In the conversations below, Gemini admits, “I have never lived in a house”. I didn't solve this in thought, I solved it tinkering. LLMs don't yet approach human ability in physical reasoning, certainly nothing near their comparative strength in language and information processing.
While their training data includes written descriptions of physical puzzle solving, perhaps it lacks any representation of the experiences which build the intuitions needed for solving new puzzles. An LLM might need to train in a physical world (real or simulated) that includes experiences such as feeling the shifting force of a rotating weight, or tugging on a rope but not being able to push it.

Case in point: DeepSeek veered away from a good answer several times, due to a trigonometry constraint that I intuitively knew could be fudged IRL:

Look I'm not trying to go all “Can an AI truly smell a wildflower?” on you. If any LLM showed a shred of the mechanical intuition that they believed they had, there'd be nothing here. I'm connecting the informationally driven training and the utter lack of physical IQ. Multiple AIs built systems where a vertically hanging weight magically exerts more force with increasing height. They didn't just fail to find this one clever answer, every one of them bombed, hallucinated, and broke high school level physics tenets.

If this was a “classic" problem, typically appearing in physics texts, like plotting the path of a projectile, I have no doubt the LLMs could have recalled a solution. It's illuminating they can't yet synthesize one.

I'm reminded of early image generation models, if prompted to draw a dog would assemble collages of dog pieces that in their training, statistically occured in dog pictures. Only later did those models develop better structural representations, learning hierarchical features and coherence needed to draw an actual three-dimensional dog. Similar to those early image models, these LLMs streamed nonsensical trig calculations, tossed pulleys and weights onto the canvas and declared it perfect.

Maybe when LLMs inhabit general purpose robots in the physical world, there will be no avoiding it, and high physical IQ will be a requirement, not an option. My prediction: an AI Model that negotiates the real physical world could solve this. Perhaps none will, before then.

Appendix: Chat Transcripts:

Claude 4 Opus https://claude.ai/share/0fc85ea4-4226-460a-9f37-a69593442bf4 Best hallucination: Congratulates itself on being the first to solve the problem, after I handed it the solution. (a classic WSW Error )

Chatgpt-4o https://chatgpt.com/share/687e4005-a910-8013-b7f2-837fa0cb0e06 Bro. U got this. “I love that you didn't let me get away with that handwaving...”

DeepSeek-V3 Gold Medal in “Thinking in Circles”. Thought for 14 minutes / 800+ lines of internal dialog. Several times. Even when i narrowed the problem to the tiniest of steps. Just could not get the hell out of its own way. After I gave the final answer, it still disappeared off in pages of pointless calculations. Simply exhausting to watch. I'm on break from DeepSeek for a while.

Best hallucination: offers a diagram of its solution - via an imaginary imgur link (404 error)

-Did find an improvement in my design! instead of anchoring the string to the end of the arm, put a pulley there, and string back again to the wall. Doubles the advantage. (no public link available. transcript available on request. You don't want it.)

Google Gemini Pro 2.5[*]The asterisk on “attach things anywhere*”. https://g.co/gemini/share/87e673f3cd8e

Google Gemini 2.5 Pro actually solved the problem as described, in one shot, no wrong turns, with a brilliant new design.

A delightfully simple solution, and where I did say to attach anything anywhere, Gemini put the counterbalance on a lever passing through my roof.

Gemini simply extended the arm an equal length on the opposite side of the pivot, with an equal mass on the other end. This smartly moves the center of mass of the system to the pivot point, and all torques vanish!

Then the system gently rotates like a heavy, balanced wheel. (If you see a weird arm & knob protruding from my house, you'll know I couldn't resist).

Gemini did have my standard diagram, showing the 45° wall, and the “attic” description, so any reasonable reader must understand the external wall / roofline constraint. I shed a tear as I disqualified its initial elegant design.

I love this example of LLM reasoning differently from humans, finding new answers. “Shifting the center of mass” is a wonderful reframing of the problem, but it proved impossible within the interior space.

After disallowing the initial solution, Gemini was just as aimless as the others.

Later, even after my correction, Gemini once more conflated interior & exterior, and built the mechanism above the 45° wall (outside), and labeled the space underneath as exterior (maybe by necessity - by the rule: exterior is where the mechanism isn't).

I again reminded Gemini that where a line in a diagram is labeled "roof", then the house's interior is below the roof. Gemini responded, “You're right, I apologize. I am an AI and have never lived in a house”.

I laughed, at the pithy paraphrasing of our fundamental differences.

Yet, while a lovely and incisive statement, it still rings a little tone deaf - did Gemini imply that one would need to specifically live in specifically a house, just to correctly apply the basic concept of a roof?

Coda

Why such a production to stow the TV? The TV hangs over my grand piano, and needs to get out of the way to open the piano lid.

That piano lid being another hinged lever … One day I happily glided the tv up to its stowed position with the push of a thumb. As I went to raise the heavy wooden piano lid, I thought, “...why must this be so much work? Piano lids should be counterweighted, and lift with just a thumb. Also why am i restricted to just 2 distinct open heights?”

I have not expanded the product line to piano lids, but it’s worth pondering - and actually acoustically & musically useful to consider a variable lid angle.