The worst thing about trying to beat the market is that the boring people are probably right. Buy the S&P 500. Keep buying it. Do nothing heroic. Touch grass.
I believe this advice. I also do not think I am secretly Jim Simons.
And yet, because the market has not personally humbled me enough, I still had the thought: what if I built something that could do slightly better?
More likely, I'll have fun trying. That is what led me to build Balance Wheel, my always-on trading agent.
I still think the sensible thing would have been to buy the index and close my laptop. Instead, I built a tiny hedge fund run by LLMs.
I built Balance Wheel around three analyst agents, one portfolio manager agent, and a persistent memory layer that acts like a trading journal. The goal was to set it up like a real hedge fund, where analyst-level expertise is separated from the portfolio manager who makes the final trading decisions.
The analysts specialize in different sectors. Right now, they are split into tech, energy, and healthcare. Each one has its own coverage universe, market context, and decision framework. The job of each analyst is to produce a recommendation with reasoning, confidence, risks, and suggested sizing.
The portfolio manager is the overall operator. It takes in all of the analyst recommendations, understands the current portfolio position, looks at the broader market context, and then makes a decision for the portfolio.
Balance Wheel
The main flow makes the trade. The memory layer underneath keeps a persistent record of what happened and feeds that context into the next run.
Market data, news, and the current portfolio.
Turns analyst opinions into an actual portfolio decision.
Checks sizing, concentration, cash reserves, and trade limits.
Approved trades are placed and portfolio outcomes are recorded.
Memory layer
This is the persistent memory of the system — what happened, what worked, and what should carry into the next run.
Captures what was bought, sold, rejected, and how the portfolio actually moved.
T dayLooks back at prior calls, scores them, and asks whether the agents were actually right.
T+1 dayRecords the full running memory of the system: recent trades, prior recommendations, critiques, patterns, and useful lessons from earlier runs.
The diagram above gets most of the system across. I'm not going to go box by box. All the code and prompts are here if you are curious.
The three parts that felt most important to get right were:
This is what makes it more than prompting an LLM: the agentic system has roles, memory, constraints, execution logic, and an audit trail.
LLMs can go rogue, and I wanted to set up a good set of guardrails to prevent this. I went with a deliberately boring choice for this: a deterministic risk engine.
The agents can recommend trades, but they do not get the final say. Before anything goes through, the proposed orders pass through a deterministic Python risk engine. It checks position size, sector exposure, cash reserves, number of trades, and total cash deployed in a single cycle. All of these are based on a set of predetermined rules.
This part of the system is intentionally unglamorous and not as autonomous. It keeps the whole system on the rails. The agents reason through messy market information, and the risk engine decides whether the trade is allowed.
The second thing I wanted was continuity. What good is an agent that starts from scratch every time it runs? There is a ton of value in carrying forward context: what it previously believed, what surprised it, where it was too early, and where it was overconfident.
This is also kind of how humans operate. We naturally build context over time. I remember what I thought last week, what surprised me, where I was too early, and where I was overconfident. That running context matters.
I gave the agent a lightweight version of this. Think of this as an investment journal for the agent.
After each run, the system records what happened: the recommendations, the portfolio manager's decisions, the trades that went through, the trades that were rejected, and how the portfolio moved. The reviewer then looks back at prior calls and writes down what seemed useful, what was wrong, and what patterns are starting to show up.
All of this gets fed into future runs, so the agents are not starting from zero every day. The tech analyst can carry forward lessons from prior tech calls. The portfolio manager can see where previous sizing decisions helped or hurt. The system starts to build a track record for itself.
Finally, the all-important question: does this thing actually work?
There are two parts to that question.
The first is whether the agent system works mechanically. Does the whole loop run end to end? Do the analysts produce useful recommendations? Does the portfolio manager make sensible decisions from those recommendations? Does the risk engine catch things it should catch? Can I look back across runs and understand why the system did what it did?
The second question is the obvious one: is it making money?
I track both.
On the system side, every run leaves a full audit trail. I can see what each analyst recommended, what the portfolio manager tried to do, what the risk engine rejected, what actually got executed, and how the portfolio changed afterward. That helps me debug the agent instead of just looking at a final return number and guessing what happened.
On the performance side, I track portfolio value, P&L versus the S&P 500, current holdings, and executed trades. This helps me figure out how it's actually performing.
The long-term goal is to get this closer to having a self-improvement loop. The nice thing about trading is that the feedback signal is real. You can write down whatever rationale you want, but eventually the portfolio either performs or it does not. That makes it possible to keep tightening the system against something concrete.
Now that you've gotten all this way, I'll let you in on a secret. I'm still paper trading as I iterate on this system before I unleash real money on it.
Stay tuned for when I actually do that.