Mass Online Testing Has a Validity Problem Nobody’s Measuring

Online examinations were adopted at scale because circumstances demanded it. What is harder to explain is why, five years on, most institutions have not asked whether the results they produce are actually valid.

Validity, in assessment terms, is not about whether an exam is difficult or well-designed. It refers to something more fundamental: whether a test score reliably measures what it claims to measure, and nothing else. When external factors (unstable connectivity, inadequate hardware, a disruptive home environment) systematically affect performance, the score stops being a clean measure of knowledge. It becomes a measure of knowledge plus conditions. At a small scale, that noise is manageable. Across tens of thousands of concurrent test-takers sitting from wherever they happen to live, it becomes a data integrity problem that institutions are, for the most part, not equipped to detect.

The evidence that this is happening is not theoretical. Research published in 2025 in Quality in Higher Education by Professors Philip Newton and Michael Draper of Swansea University found that of 119 UK universities responding to Freedom of Information requests, 78% were still running online exams introduced during the pandemic. Only 60 of those institutions had any policies or guidance relating to security or integrity. Newton called it “actively unethical behaviour by universities”: a charge that is pointed precisely because it is not directed at the technology, but at the institutions choosing to rely on it without scrutiny.

The question this raises is not whether online assessment should exist. It is whether the organisations deploying it at scale have any mechanism to know when it is failing, and who bears the cost when it does.

What Validity Actually Means and Why Conditions Contaminate It

Validity, in assessment terms, is not about whether an exam is difficult or well-constructed. It refers to whether a test score reliably reflects what it claims to measure, and only that. When a score is influenced by factors outside the construct being tested, the measurement is contaminated. Psychometricians call this construct-irrelevant variance: systematic error introduced not by gaps in student knowledge, but by variables the test was never designed to capture. Inconsistent internet connectivity. Substandard hardware. A disruptive home environment. Each of these shifts a score in ways that have nothing to do with what the student knows.

In a controlled examination hall, institutions invest considerable effort in suppressing these variables: uniform conditions, identical materials, timed uniformly, and invigilated consistently. The entire architecture of traditional assessment is, in a sense, a validity protection mechanism. Online, at scale, that architecture is absent. What replaces it varies enormously from one student’s kitchen table to the next.

The consequences are not theoretical. A 2024 scoping review in Higher Education Quarterly, drawing on student experiences of remote proctoring across multiple countries and disciplines, found that technical disruption, environmental stress, and procedural inconsistency were dominant recurring themes, not isolated complaints. Students with anxiety conditions, learning disabilities, or ADHD were disproportionately affected, with evidence of a specific negative effect on high-anxiety test-takers under remote proctoring conditions. These are not isolated cases of bad luck. They are exactly what happens when you take a test that was built to require controlled conditions and strip those controls away. It’s like using a measuring tape in a windstorm; the tool is fine, but the conditions make the measurement unreliable. And the bigger the group you test this way, the bigger the damage.

Scale Does Not Average Out the problem; It amplifies it

There is a common institutional assumption that, at sufficient scale, individual variation in conditions evens out. It does not. What scale does is widen the range of conditions under which students are assessed, and with it, the gap between those in stable, well-resourced environments and those who are not.

The OECD acknowledges that despite near-universal nominal internet access across member countries, significant disparities in the quality of that access persist along socioeconomic and geographic lines. Nominal access is not exam-ready access. Both students receive a score. Neither score carries a footnote about the conditions under which it was produced.

In Australia, the gap is documented with unusual precision. Securing Digital Equity in Australian Education, a 2024 report by Professor Leslie Loble AM and Dr Kelly Stephens for the Australian Network for Quality Digital Education, found that over one-fifth of students in disadvantaged schools lacked adequate digital resources, compared to just two per cent in wealthier cohorts.

The students most likely to be assessed under substandard conditions are also the least likely to have the institutional capital to challenge a result. At scale, the validity problem and the equity problem are the same problem.

The Accountability Gap

What makes this a governance problem rather than merely a technical one is that institutions are not, in the main, collecting the data that would allow them to identify it. There is no standard requirement to record test conditions alongside test results, no audit trail linking bandwidth quality or device specifications to performance outcomes, and no regulatory expectation that institutions demonstrate condition-equivalence before using online scores for high-stakes decisions.

The Newton and Draper findings are particularly damning in this light. Institutions running online exams without integrity policies are not simply taking a risk; they are making consequential decisions about students’ progression, classification, and credentialing on data whose validity they have not scrutinised and, in most cases, have no mechanism to scrutinise.

This is where the choice of assessment infrastructure becomes an institutional accountability question, not merely an operational one. Whether an institution is running national certification programmes, cross-campus exams, or large undergraduate cohorts, the decisions made at the infrastructure level (about data capture, condition monitoring, and audit capability) determine whether the institution could ever mount a credible defence of its results. A large-scale platform built for high-stakes delivery should, at a minimum, give institutions the visibility to know what conditions their students were assessed under. Without that, scale is not an achievement. It is a liability that hasn’t yet been called in.

The Measurement Gap Nobody Is Closing

Part of the reason this problem persists is structural: the people best positioned to identify it, psychometricians and assessment researchers, are rarely embedded in the institutions deploying online exams at scale. And the institutions deploying those exams have little incentive to go looking for validity problems in results they have already used to make decisions.

Professor Newton’s June 2025 keynote at the Welsh Integrity and Assessment Network symposium, where he argued for abandoning unsupervised online exams as a default mode of assessment, represents one position on the spectrum. The more cautious and arguably more practical position is not abandonment, but accountability: a requirement that institutions demonstrate, through data, that the conditions of their online assessments were sufficiently equivalent to support the score interpretations being made.

That requirement does not currently exist in any European or Australian regulatory framework in a meaningful, enforceable form. The EU AI Act’s provisions around high-risk AI systems (including automated assessment tools) gesture towards it, but a 2025 analysis of Italian universities under the EU AI Act found that institutions were far better at documenting intentions than demonstrating outcomes, with validity testing and fairness monitoring among the weakest areas of actual implementation. The regulation existed. The mechanism to satisfy it did not.

Until there is external pressure to close that gap, the incentive structure remains unchanged. Institutions will continue expanding online assessment because it is cheaper, more flexible, and operationally convenient. The students who bear the cost of unmeasured condition variance are, by definition, the ones least likely to be in a position to challenge it.