ATLAS Benchmark

AI Teaching and Learning Assessment Standard. Currently in development.

Most benchmarks measure whether AI gets the answer right. ATLAS measures whether it actually fits in education.

ATLAS is an independent benchmark we are developing. For every task, it scores two things: whether the model is accurate, and whether it teaches the student or just does the work for them. No major benchmark measures that second part yet. We are building it with teachers.

Help shape it

We are recruiting teachers to help design the cases ATLAS runs on. It is one workshop on Zoom, about 90 minutes, in late July or early August, and everyone who contributes is credited as a named contributor on the published work. Skeptics welcome. It takes about five minutes to apply.

How it works

01

What it measures

Every case hands a model a real classroom task, the source materials, and a rubric. Each response gets two scores: an accuracy score for whether the work is correct, and a pedagogy score for whether it actually helps someone learn. The cases cover nine kinds of work, split evenly between student-facing and teacher-facing tasks, with a small shared slice. The exact weights are still being finalized with our contributors.

For students

  • Understanding material and tutoring
  • Assignment completion
  • Exam prep

For teachers

  • Grading and feedback
  • Lesson planning
  • Course materials
  • Assessment creation

Shared

  • Administrative communication
  • Research
02

How it is graded

Grading follows the method the APEX benchmark used to reach about 89 percent agreement with human experts. Each response is scored by a panel of three different AI models rather than one, so no single model's blind spot decides the result.

03

Why it stays independent

ATLAS does not evaluate Deskpad's own tools, so it can never be a scoreboard that happens to favor us. The rubrics and methods are public, the results get published whether they flatter us or not, and Deskpad sells nothing. The point is a measurement people can trust, not a marketing claim.

Where it is headed

Now

Recruiting teachers to help shape the first set of cases.

Late summer 2026

A contributor workshop where teachers help build and pressure-test the cases.

Fall 2026

ATLAS v1: a first set of cases across the categories, run against today's AI models and teaching tools.

After that

A public leaderboard and the first papers, all under Deskpad Labs.

If you teach, you can help decide how AI in classrooms gets measured.

Apply to contribute