ATLAS Benchmark
AI Teaching and Learning Assessment Standard. Currently in development.
Most benchmarks measure whether AI gets the answer right. ATLAS measures whether it actually fits in education.
ATLAS is an independent benchmark we are developing. For every task, it scores two things: whether the model is accurate, and whether it teaches the student or just does the work for them. No major benchmark measures that second part yet. We are building it with teachers.
Help shape it
We are recruiting teachers to help design the cases ATLAS runs on. It is one workshop on Zoom, about 90 minutes, in late July or early August, and everyone who contributes is credited as a named contributor on the published work. Skeptics welcome. It takes about five minutes to apply.
How it works
What it measures
Every case hands a model a real classroom task, the source materials, and a rubric. Each response gets two scores: an accuracy score for whether the work is correct, and a pedagogy score for whether it actually helps someone learn. The cases cover nine kinds of work, split evenly between student-facing and teacher-facing tasks, with a small shared slice. The exact weights are still being finalized with our contributors.
For students
- Understanding material and tutoring
- Assignment completion
- Exam prep
For teachers
- Grading and feedback
- Lesson planning
- Course materials
- Assessment creation
Shared
- Administrative communication
- Research
How it is graded
Grading follows the method the APEX benchmark used to reach about 89 percent agreement with human experts. Each response is scored by a panel of three different AI models rather than one, so no single model's blind spot decides the result.
Why it stays independent
ATLAS does not evaluate Deskpad's own tools, so it can never be a scoreboard that happens to favor us. The rubrics and methods are public, the results get published whether they flatter us or not, and Deskpad sells nothing. The point is a measurement people can trust, not a marketing claim.
Where it is headed
Recruiting teachers to help shape the first set of cases.
A contributor workshop where teachers help build and pressure-test the cases.
ATLAS v1: a first set of cases across the categories, run against today's AI models and teaching tools.
A public leaderboard and the first papers, all under Deskpad Labs.
If you teach, you can help decide how AI in classrooms gets measured.
Apply to contribute