Intro
AI models have found genuinely useful applications in legal work. They summarize, search, draft emails, prepare first-pass research memos, and help lawyers move faster through text. Law firms have spent the last two years identifying the right use cases. Lawyers have tried everything from genuinely useful workflows (chat, agents, research) to impressive but rarely used ones (tabular review).
And while senior leadership is often impressed by polished outputs, many firms have realized that the underlying models still lack fundamental capabilities. Legal work tolerates very few errors. Most law firms therefore still do not trust AI on the hard work, and instead use it for first-pass screening while keeping humans deeply in the loop. In some cases, this increases the work required: lawyers have to review the AI output and still form their own opinion. That gap is not a temporary inconvenience. It points to a structural problem in how legal AI is currently deployed.
The frontier models are powerful, but they do not belong to the firm. They do not learn from the firm's feedback. They do not absorb the correction that made a draft usable, nor remember why one client accepts a risk another client would reject. When the task is over, the most valuable signals disappear back into the DMS and email chains.
Coding went through a related phase first. Early 2025 AI coding tools were impressive on benchmarks, but in realistic open-source work METR found experienced developers were slower with the tools than without them.1 The bottleneck was not typing code. It was review, context, quality standards, and implicit knowledge. As coding agents improved, the workflow began to change: less "help me autocomplete this function", more "produce the work and let the human review the result." Even Linus Torvalds, long skeptical of AI hype, described using AI to generate a bounded Python visualizer for a personal side project.2
What would it take to achieve the same jump in legal?
Verticals only tip when models are trained against the real structure of the work. Coding had a vast public corpus, executable tests, and reinforcement learning signals that were easy to scale. Legal does not have the same starting conditions. The most important data is private. The right answer is often a matter of strategy, client context, risk tolerance, jurisdiction, house style, and professional judgment. Even defining a positive training signal is harder.
So the unlock for legal AI will not come from a generic frontier model alone. It will come from models that learn inside the boundary of the law firm.
Learning Is Structurally Blocked
The signals needed to improve legal AI already exist inside the law firm's work.
They appear when a partner rewrites an associate's clause. They appear in comments between associates about why a point is weak, in issue lists that distinguish theoretical risk from commercial risk, and in emails where lawyers debate how far to push a negotiation. They appear in abandoned drafts, final drafts, and the path between them.
Deriving signal from this training data is not as simple as "upload everything into a vendor system." The learning signal has to be constructed from the data, separating context-specific adaptation from generalizable learning. The difference between an answer that is technically correct and an answer a partner would send to a client is exactly this kind of signal.
But the same data that can make legal AI valuable is the most protected knowledge in every firm.
Client confidentiality, legal privilege, data protection obligations, information barriers, and contractual limits on use all restrict what any external vendor can store, learn from, or reuse. Law firms have to explicitly exclude this data from model development.
They also know intuitively that this knowledge has value. Law firms have never treated institutional knowledge as a commodity. A firm's judgment is taught through apprenticeship, review, correction, and repeated exposure to clients and matters. The more capable AI models become, and the more firms depend on them, the smaller a firm's advantage becomes if it differentiates only with the same commodity intelligence as everyone else.
This is why many legal AI systems remain trapped in a shallow loop. The lawyer corrects the output, but the model does not really change. The same mistake returns next week. The vendor may offer configuration, retrieval, templates, or "skills" written in text. Those can help at the margin. They are not the same as training a model on the firm's own standards.
Owning Enables Learning
Open-weight base models increasingly provide the raw capability to approach state-of-the-art frontier performance, opening the possibility for law firms to post-train their own. A firm's corrections, rubrics, practice preferences, and client-specific standards provide the learning signal to specialize. The resulting model can run inside infrastructure the firm controls. The training data remains within the firm's boundary. The improvement loop is governed by the firm, creating an asset that compounds for the firm.
It is important that the law firm owns the model and infrastructure, but the real question is not whether the compute sits in a laptop, basement, private cloud, sovereign cloud, or controlled data center. The real question is: who owns the input data signals and the results of training on the firm's knowledge?
If the answer is a vendor, the firm is renting intelligence and slowly diluting its competitive moat.
An owned model can learn from partner corrections without turning privileged work into someone else's platform advantage. It can be evaluated against private practice-area rubrics. It can be trained separately for funds, disputes, M&A, regulatory, employment, or tax. It can learn that one client's preferred answer is not another client's preferred answer. It can be updated when law changes, when the firm's style changes, and when a practice group discovers a better way to work.
Most importantly, it can improve continually.
Corrections keep becoming training data, the base model can be exchanged, and the combination can be evaluated against the firm's unique standards. Over time, the model stops being a generic assistant and starts becoming a representation of the firm's accumulated judgment.
Post-Training Has Been Proven in Legal
Public legal-agent results now show the same pattern from several independent labs: open-weight models, post-trained on legal-agent tasks, can move toward frontier performance while reducing cost and improving control and auditability.
Nemotron 3 experiments on Harvey's LAB benchmark show the pattern clearly: post-training lifts open models toward closed frontier systems, first with Nemotron 3 Super and then with early Nemotron 3 Ultra results.34
Source: Trajectory and Harvey LAB figures on post-training Nemotron 3 Super and Ultra; the source article reports the LAB setup and Harvey shared the Ultra early-access result.34
In a related Fireworks AI experiment on a 100-task LAB slice, an open-source worker model with a frontier advisor reached stronger all-pass performance than an end-to-end Claude Opus run, at substantially lower cost. Its post-training experiments on Kimi K2.6 also moved all-pass performance from 11 of 100 tasks to 15 of 100 tasks with supervised fine-tuning, and reinforcement fine-tuning improved mean score further.5
Source: Fireworks AI's 100-task Harvey LAB slice. Cost figures are total inference estimates for the slice and move with pricing and token mix.5
The most surprising result is that a 27B open-weight model, trained inside a legal-agent harness, entered the closed-source frontier band on LAB.6 That implies legal capability can be moved into models that are smaller, more controllable, and more ownable than a general frontier API.
Source: Baseten and Harvey's Figure 5 on Qwen3.5-27B before and after iSFT in the compaction harness.6
Not All Legal Models Are the Same
The experiment with the Nemotron 3 family shows one important caveat: training does not move every legal task in the same direction. In the same run, some practice areas improve sharply while others barely move or get worse.3
Source: Trajectory behavior-analysis figure on practice-area gains after legal post-training.3
This may mean that public legal data does not generalize evenly across the actual work of a firm. A model trained on a broad legal corpus can become better at one practice area and worse at another, because the signal that teaches one behavior can interfere with another.
For a firm, the unit of specialization is smaller than "legal." It is practice area, matter type, client posture, drafting convention, risk appetite, and sometimes partner preference. Funds work does not need the same model behavior as litigation. Tax does not need the same reasoning loop as M&A. Regulatory advice for one client may reward caution where another client's commercial context rewards speed. Legal models need continuous evaluation: it shows when training goals interfere and a task deserves its own model, and when training should be pooled for generalization.
Conclusion
A firm should train models for the practices where it has deep private signal, evaluate each against that practice group's rubrics, and route work through an orchestrator that selects the right model, calls frontier advisors when needed, and records corrections back into the right learning loop. The resulting system should still present itself as "one AI" through orchestration, but underneath it can be a set of specialized experts.
Legal AI is not solved, and there are real advantages to moving early. LAB itself exists because legal work is still hard to measure: long-horizon tasks, closed-universe matter files, expert-written rubrics, and all-pass grading where missing one material issue can make a deliverable incomplete. That is exactly why the results matter. They show that progress in legal AI depends on domain-specific evaluation, domain-specific training signal, and models that can be improved against that signal.
Eigenwelt Labs helps law firms train specialized models with state-of-the-art research methods. All data and resulting models stay with the law firm. Anyone interested in piloting model development in a practice area or contributing to our research can reach out at chris@eigenweltlabs.com.
References
- METR, Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.
- Linus Torvalds, AudioNoise README.
- Trajectory, The Sovereign Path for Continual Learning: Early Results on Harvey LAB with NVIDIA Nemotron.
- Harvey, Nemotron 3 Ultra early LAB result.
- Fireworks AI, Open-source agents with frontier advisors.
- Baseten Research, Post-training frontier legal agents with Baseten Research.