Verifier-backed workflow data for CUA

A verifier-backed workflow package gives labs a way to inspect whether a task is solvable, hard, reusable, and worth training on.

Verifier-backed workflow data is the difference between a useful CUA training package and a pile of recordings.

Computer-use agents operate in environments where success can be subtle. A model may create a record but use the wrong price. It may upload the right file to the wrong form. It may send an email with the correct intent but omit the required attachment.

If the verifier only checks that something happened, the model can learn the wrong behavior. If the verifier is too strict, it can reject valid solutions. Both cases damage the training signal.

Verifiers are part of the data product

For CUA data, the verifier is not an implementation detail. It is part of the product a lab is buying.

A lab needs to inspect how success is measured. That means a workflow package should describe the expected state, the verifier logic, known edge cases, and any manual review process used to audit the verifier.

The verifier does not have to be perfect. It has to be explicit, testable, and measured.

What should be audited?

Verifier audits should include false positives, false negatives, edge cases, and reward-hacking loopholes.

A false positive means the verifier accepts a failed task. For RL, this is dangerous because the model can be rewarded for behavior that only satisfies the checker.

A false negative means the verifier rejects a successful task. This can hide useful model behavior and make the environment look harder than it is.

A good workflow package should also include active testing: can a human solve it, can a base model solve it, is the instruction ambiguous, is the task too easy, is the reward too sparse, and does the task break under normal UI changes?

Pass@k tells the difficulty story

Pass@k distributions help show whether a task is useful.

If every model solves a task on the first attempt, the task may be too easy for frontier post-training. If no model ever gets close, it may be too sparse for a first training run.

The interesting region is where models fail in repeatable ways and training data can reduce those failures on held-out variants.

That is why CUA workflow data should include:

pass@1, pass@3, and pass@5 across multiple models
failure traces from unsuccessful runs
examples of human correction
held-out variants that are not used for training
verifier audit notes

Contamination control matters

Frontier labs also need to know where the data came from.

If a task is copied from a public benchmark or leaked into the train split, the eval result becomes weak. If the same task is sold to multiple customers without isolation, the contamination story gets harder.

Workflow packages should track source, version, train and eval split, customer isolation, and whether the task resembles public benchmarks.

Is verifier-backed data only for RL?

No. The same package can support evaluation, SFT, RL, and model comparison. The verifier makes the task repeatable.

Can LLM judges be used as verifiers?

They can help, but they should not be the only check for most desktop workflows. Programmatic state checks, file checks, UI checks, and artifact checks are usually stronger.

What should buyers ask for?

Buyers should ask for the task source, verifier definition, verifier audit, pass@k distribution, failure taxonomy, eval split, and contamination notes.

The UseDesktop direction

UseDesktop is being built as a local workflow acquisition and verifier-authoring layer for CUA data.

The aim is to supply frontier teams with workflow packages that have a quality story: not just what the task is, but why it is worth training on and how improvement will be measured.

This connects directly to why computer-use agents need RL environments.