Assessment in the Age of AI
The arrival of AI capable of producing fluent, well-structured text has created a genuine crisis of confidence in traditional assessment methods. But this disruption is also an opportunity: to rethink what we are actually trying to measure, why we assess students, and what forms of evidence best reveal authentic learning in the twenty-first century.
Why Traditional Assessment Is Under Pressure
The standard essay assignment has long been a cornerstone of educational assessment. Students demonstrate understanding, develop arguments, practice writing, and receive feedback through this form. But when a student can generate a competent essay on any topic in seconds using a generative AI tool, the essay assignment is no longer a reliable indicator of what the student knows, thinks, or can do independently.
This is not entirely new. Students have always been able to plagiarize, hire tutors to do their work, or collaborate beyond what was intended. But the scale and accessibility of AI-assisted text generation is qualitatively different. A student who could not previously write a coherent paragraph can now produce something that superficially resembles sophisticated analysis. This fundamentally changes the evidential value of many traditional assessments.
Schools that have responded to AI by deploying AI detection tools have quickly found themselves in an arms race that they are unlikely to win. AI detection tools produce significant numbers of false positives — flagging genuine student work as AI-generated — which creates serious fairness problems, particularly for English language learners whose writing patterns can resemble AI output. More fundamentally, detection-based approaches assume the goal is to catch and punish, rather than to redesign assessment so that AI assistance is either irrelevant or acceptable.
Rethinking What Assessment Is For
Assessment serves several distinct purposes: diagnosing student understanding, certifying competence, providing feedback for improvement, motivating effort, and holding students accountable. Different assessment forms serve these purposes differently, and AI disrupts them differently.
Formative assessment — ongoing checks for understanding during learning — is relatively resilient to AI disruption when it happens in class, in real time, through conversation, observation, and immediate response. It is only high-stakes summative assessment of independent production that faces the most acute challenge.
The most defensible response to AI is not to try to ban it from assessment contexts but to design assessments that cannot be completed by AI alone — assessments that require personal experience, oral defense, iterative process documentation, or real-time performance. When AI assistance is irrelevant to demonstrating the intended competency, the detection problem disappears.
Assessment Designs That Are AI-Resilient
Educators who have thought carefully about this challenge have developed a range of approaches that preserve the validity of assessment in an AI-rich environment:
Automated Essay Scoring: Promise and Peril
Automated Essay Scoring (AES) systems — AI tools that grade student writing — have been in development since the 1960s and are now sophisticated enough to correlate reasonably well with human rater scores on standardized rubrics. They offer potential benefits: consistency, speed, and the ability to provide immediate feedback at scale. States and testing companies use AES for scoring large-scale writing assessments, where the cost of human scoring makes it impractical for every response to receive human attention.
The research on AES shows important limitations, however. These systems tend to reward surface features of writing — length, vocabulary complexity, sentence variety — more than actual quality of argument, evidence use, or insight. Students who learn to game AES systems can produce high-scoring text that is logically weak. The systems also perform poorly on creative, unusual, or culturally non-mainstream writing that may be genuinely excellent but does not match the patterns in training data.
AES works best as a first-pass feedback tool that flags potential concerns for human review and gives students rapid formative feedback between drafts — not as a final evaluator. Many writing platforms use this model: AI scores a draft instantly, giving the student actionable feedback on mechanics and structure, while the teacher reviews the final submission for deeper qualities of thought and development.
Redefining Academic Integrity Policies
Academic integrity policies written before 2022 are almost certainly inadequate for the current AI landscape. Educators and institutions are grappling with questions that have no consensus answers: Is using AI to brainstorm ideas plagiarism? To fix grammar? To generate an outline? To write a first draft that the student then rewrites? The answers vary by context, learning objective, and institutional philosophy.
Best practice in policy development involves several elements: engaging students as participants in developing norms rather than imposing rules unilaterally, being explicit about the rationale behind each policy (what learning is the assessment intended to produce?), distinguishing between different types of AI use rather than issuing blanket prohibitions or permissions, and revisiting policies regularly as the technology evolves.
The emerging consensus among educational researchers is that the goal should be teaching students to use AI tools skillfully and ethically, with clear norms for each context, rather than trying to create AI-free zones that do not reflect the world students are entering. Assessment designs should be transparent about whether AI use is expected, permitted, restricted, or prohibited — and why.
What AI Cannot Assess
There remains a wide range of competencies that AI cannot assess well — the very competencies that are most distinctly human and most valuable in the labor market of the future. Creativity in the sense of generating genuinely novel ideas, collaborative problem-solving observed over time, oral communication and presentation presence, ethical reasoning in complex real-world dilemmas, leadership and facilitation, and physical performance all require human judgment to assess meaningfully.
This suggests a strategic direction for educators: design more assessments that require these distinctly human performances, and reserve AI-susceptible formats (standard essays, multiple-choice tests, short-answer responses) for lower-stakes formative use where AI assistance may actually be pedagogically appropriate.