Intro

AI models have found genuinely useful applications in legal work. They summarize, search, draft emails, prepare first-pass research memos, and help lawyers move faster through text. Law firms have spent the last two years identifying the right use cases. Lawyers have tried everything from genuinely useful workflows (chat, agents, research) to impressive but rarely used ones (tabular review).

And while senior leadership is often impressed by polished outputs, many firms have realized that the underlying models still lack fundamental capabilities. Legal work tolerates very few errors. Most law firms therefore still do not trust AI on the hard work, and instead use it for first-pass screening while keeping humans deeply in the loop. In some cases, this increases the work required: lawyers have to review the AI output and still form their own opinion. That gap is not a temporary inconvenience. It points to a structural problem in how legal AI is currently deployed.

The frontier models are powerful, but they do not belong to the firm. They do not learn from the firm's feedback. They do not absorb the correction that made a draft usable, nor remember why one client accepts a risk another client would reject. When the task is over, the most valuable signals disappear back into the DMS and email chains.

Coding went through a related phase first. Early 2025 AI coding tools were impressive on benchmarks, but in realistic open-source work METR found experienced developers were slower with the tools than without them.1 The bottleneck was not typing code. It was review, context, quality standards, and implicit knowledge. As coding agents improved, the workflow began to change: less "help me autocomplete this function", more "produce the work and let the human review the result." Even Linus Torvalds, long skeptical of AI hype, described using AI to generate a bounded Python visualizer for a personal side project.2

What would it take to achieve the same jump in legal?

Verticals only tip when models are trained against the real structure of the work. Coding had a vast public corpus, executable tests, and reinforcement learning signals that were easy to scale. Legal does not have the same starting conditions. The most important data is private. The right answer is often a matter of strategy, client context, risk tolerance, jurisdiction, house style, and professional judgment. Even defining a positive training signal is harder.

So the unlock for legal AI will not come from a generic frontier model alone. It will come from models that learn inside the boundary of the law firm.

Learning Is Structurally Blocked

The signals needed to improve legal AI already exist inside the law firm's work.

They appear when a partner rewrites an associate's clause. They appear in comments between associates about why a point is weak, in issue lists that distinguish theoretical risk from commercial risk, and in emails where lawyers debate how far to push a negotiation. They appear in abandoned drafts, final drafts, and the path between them.

Deriving signal from this training data is not as simple as "upload everything into a vendor system." The learning signal has to be constructed from the data, separating context-specific adaptation from generalizable learning. The difference between an answer that is technically correct and an answer a partner would send to a client is exactly this kind of signal.

But the same data that can make legal AI valuable is the most protected knowledge in every firm.

Client confidentiality, legal privilege, data protection obligations, information barriers, and contractual limits on use all restrict what any external vendor can store, learn from, or reuse. Law firms have to explicitly exclude this data from model development.

They also know intuitively that this knowledge has value. Law firms have never treated institutional knowledge as a commodity. A firm's judgment is taught through apprenticeship, review, correction, and repeated exposure to clients and matters. The more capable AI models become, and the more firms depend on them, the smaller a firm's advantage becomes if it differentiates only with the same commodity intelligence as everyone else.

This is why many legal AI systems remain trapped in a shallow loop. The lawyer corrects the output, but the model does not really change. The same mistake returns next week. The vendor may offer configuration, retrieval, templates, or "skills" written in text. Those can help at the margin. They are not the same as training a model on the firm's own standards.

Owning Enables Learning

Open-weight base models increasingly provide the raw capability to approach state-of-the-art frontier performance, opening the possibility for law firms to post-train their own. A firm's corrections, rubrics, practice preferences, and client-specific standards provide the learning signal to specialize. The resulting model can run inside infrastructure the firm controls. The training data remains within the firm's boundary. The improvement loop is governed by the firm, creating an asset that compounds for the firm.

It is important that the law firm owns the model and infrastructure, but the real question is not whether the compute sits in a laptop, basement, private cloud, sovereign cloud, or controlled data center. The real question is: who owns the input data signals and the results of training on the firm's knowledge?

If the answer is a vendor, the firm is renting intelligence and slowly diluting its competitive moat.

An owned model can learn from partner corrections without turning privileged work into someone else's platform advantage. It can be evaluated against private practice-area rubrics. It can be trained separately for funds, disputes, M&A, regulatory, employment, or tax. It can learn that one client's preferred answer is not another client's preferred answer. It can be updated when law changes, when the firm's style changes, and when a practice group discovers a better way to work.

Most importantly, it can improve continually.

Corrections keep becoming training data, the base model can be exchanged, and the combination can be evaluated against the firm's unique standards. Over time, the model stops being a generic assistant and starts becoming a representation of the firm's accumulated judgment.

Post-Training Has Been Proven in Legal

Public legal-agent results now show the same pattern from several independent labs: open-weight models, post-trained on legal-agent tasks, can move toward frontier performance while reducing cost and improving control and auditability.

Nemotron 3 experiments on Harvey's LAB benchmark show the pattern clearly: post-training lifts open models toward closed frontier systems, first with Nemotron 3 Super and then with early Nemotron 3 Ultra results.34

Post-training Nemotron 3 Ultra matches the frontier Bar chart showing rubric pass rates across 120 held-out Harvey LAB tasks: GPT-OSS 120B at 22 percent, Nemotron 3 Super at 42 percent, Nemotron 3 Ultra at 52 percent, Trajectory Nemotron 3 Super at 67 percent, GPT-5.5 at 78 percent, Trajectory Nemotron 3 Ultra at 81 percent, Sonnet 4.6 at 85 percent, and Opus 4.6 at 85 percent. MODEL EVALUATION Post-training Nemotron 3 Ultra matches the frontier Rubric criteria passed across 120 held-out Harvey LAB tasks. 0% 20% 40% 60% 80% Rubric pass rate 22% 42% 52% 67% 78% 81% 85% 85% GPT-OSS120B Nemotron3 Super Nemotron3 Ultra Traj.Nemotron3 Super GPT-5.5 Traj.Nemotron3 Ultra Sonnet 4.6 Opus 4.6 base Super post-trained Super base Ultra post-trained Ultra reference models

Source: Trajectory and Harvey LAB figures on post-training Nemotron 3 Super and Ultra; the source article reports the LAB setup and Harvey shared the Ultra early-access result.34

In a related Fireworks AI experiment on a 100-task LAB slice, an open-source worker model with a frontier advisor reached stronger all-pass performance than an end-to-end Claude Opus run, at substantially lower cost. Its post-training experiments on Kimi K2.6 also moved all-pass performance from 11 of 100 tasks to 15 of 100 tasks with supervised fine-tuning, and reinforcement fine-tuning improved mean score further.5

Open-source worker plus frontier advisor beats Opus on cost and quality Scatter plot of all-pass tasks out of 100 against total cost: Claude Opus 4.7 scores 14 at 954 dollars, GPT-5.5 scores 11 at 560 dollars, GLM 5.1 scores 12 at 121 dollars, Kimi K2.6 scores 11 at 75 dollars, Kimi K2.6 plus SFT scores 15 at 84 dollars, and GLM 5.1 plus Opus 4.7 advisor scores 18 at 368 dollars. COST AND QUALITY Open workers, frontier advisors Lower cost is to the right; higher all-pass is better. 0 5 10 15 20 25 All-pass / 100 $1000 $750 $500 $250 $0 Total cost across 100 tasks, decreasing rightward Claude Opus 4.7 14 / $954 GPT-5.5 11 / $560 GLM 5.1 + Opus advisor 18 / $368 +4 tasks, -$586 +6 tasks GLM 5.1 12 / $121 Kimi K2.6 + SFT 15 / $84 Kimi K2.6 11 / $75

Source: Fireworks AI's 100-task Harvey LAB slice. Cost figures are total inference estimates for the slice and move with pricing and token mix.5

The most surprising result is that a 27B open-weight model, trained inside a legal-agent harness, entered the closed-source frontier band on LAB.6 That implies legal capability can be moved into models that are smaller, more controllable, and more ownable than a general frontier API.

Post-training Qwen3.5-27B enters the frontier band on Harvey LAB Two horizontal bar charts. Criterion pass rate: Qwen3.5-27B standard 76.1 percent, Qwen3.5-27B with iSFT in harness 91.2 percent, Claude Sonnet 4.6 in harness 92.4 percent, GPT-5.5 in harness 91.5 percent. All-pass rate: Qwen standard 0 percent, Qwen iSFT 17.1 percent, Sonnet 19.5 percent, GPT-5.5 15.7 percent. POST-TRAINING A 27B open model reaches the band Qwen3.5-27B before and after iSFT, compared with closed frontier models in the same harness. Criterion pass rate 65% 75% 85% 95% Qwen 3.5-27B standard Qwen 3.5-27B iSFT, in harness Claude Sonnet 4.6 in harness GPT-5.5 in harness 76.1% 91.2% 92.4% 91.5% All-pass rate 0% 10% 20% 30% Qwen 3.5-27B standard Qwen 3.5-27B iSFT, in harness Claude Sonnet 4.6 in harness GPT-5.5 in harness 0.0% 17.1% 19.5% 15.7%

Source: Baseten and Harvey's Figure 5 on Qwen3.5-27B before and after iSFT in the compaction harness.6

The experiment with the Nemotron 3 family shows one important caveat: training does not move every legal task in the same direction. In the same run, some practice areas improve sharply while others barely move or get worse.3

Where training moves the needle by practice area Two-panel horizontal bar chart showing base and trained rubric pass rates by legal practice area. Top gains include Environmental ESG, Banking Finance, Bankruptcy Restructuring, Immigration, and Funds Asset Management. Least gains include Tax, Litigation Dispute Resolution, Corporate M and A, Emerging Companies Venture Capital, and Capital Markets. BEHAVIOR ANALYSIS Where training moves the needle Rubric pass rate before and after training, by practice area. Top gains Least gains 0% 50% 100% 0% 50% 100% Rubric pass rate Rubric pass rate Base Trained Environmental ESG Banking Finance BankruptcyRestructuring Immigration Funds AssetManagement +33 pts Tax Litigation DisputeResolution Corporate M&A Emerging CompaniesVenture Capital Capital Markets

Source: Trajectory behavior-analysis figure on practice-area gains after legal post-training.3

This may mean that public legal data does not generalize evenly across the actual work of a firm. A model trained on a broad legal corpus can become better at one practice area and worse at another, because the signal that teaches one behavior can interfere with another.

For a firm, the unit of specialization is smaller than "legal." It is practice area, matter type, client posture, drafting convention, risk appetite, and sometimes partner preference. Funds work does not need the same model behavior as litigation. Tax does not need the same reasoning loop as M&A. Regulatory advice for one client may reward caution where another client's commercial context rewards speed. Legal models need continuous evaluation: it shows when training goals interfere and a task deserves its own model, and when training should be pooled for generalization.

Conclusion

A firm should train models for the practices where it has deep private signal, evaluate each against that practice group's rubrics, and route work through an orchestrator that selects the right model, calls frontier advisors when needed, and records corrections back into the right learning loop. The resulting system should still present itself as "one AI" through orchestration, but underneath it can be a set of specialized experts.

Legal AI is not solved, and there are real advantages to moving early. LAB itself exists because legal work is still hard to measure: long-horizon tasks, closed-universe matter files, expert-written rubrics, and all-pass grading where missing one material issue can make a deliverable incomplete. That is exactly why the results matter. They show that progress in legal AI depends on domain-specific evaluation, domain-specific training signal, and models that can be improved against that signal.

Eigenwelt Labs helps law firms train specialized models with state-of-the-art research methods. All data and resulting models stay with the law firm. Anyone interested in piloting model development in a practice area or contributing to our research can reach out at chris@eigenweltlabs.com.

References