The Testing Problem That Comes With AI Personalisation at Scale

Duolingo is one of the most carefully engineered consumer apps in the world. Its approach to AI-driven learning — adapting lesson sequences, difficulty curves, and exercise types to individual user behaviour in real time — represents a serious and sustained investment in personalisation at a scale that very few companies have attempted. The product has over 500 million registered users and a genuine commitment to using machine learning to improve learning outcomes, not just engagement metrics.

It's worth taking that seriously, because what Duolingo has built is an example of where a significant portion of the software industry is heading. Personalisation driven by AI — not just surface-level customisation, but genuine adaptation of the core experience based on individual behaviour — is becoming a standard expectation rather than a premium feature. Recommendation engines, adaptive learning systems, personalised content feeds, dynamically generated UI flows: these are no longer the exclusive domain of the largest tech companies. They're being built by teams of every size, across every category of software.

And they all share a testing problem that most teams haven't fully confronted yet.

What AI Personalisation Actually Changes About the Product

Traditional software, for all its complexity, has a property that makes it testable: given the same inputs, it produces the same outputs. A QA team can define test cases, run them systematically, and verify that the software behaves as specified. The state space is large but bounded. Coverage, while never complete, is meaningful.

AI personalisation breaks this property at a fundamental level. When the experience a user receives is shaped by their individual history, behaviour, and inferred preferences, no two users see quite the same product. The lesson sequence Duolingo generates for a 40-year-old professional learning Spanish for a trip to Mexico is different from the one it generates for a teenager learning the same language with a gaming motivation. Both are generated by the same system. Both need to work. But they cannot both be tested by running the same test case.

The combinatorial problem: In a traditionally structured app, the number of distinct experiences a user can have is large but finite. In an AI-personalised app, the number of distinct experiences is effectively unbounded — a function of every user's unique history and the model's response to it. You cannot enumerate the test cases. You have to think differently about what testing means.

This is the core challenge. It's not a failure of engineering ambition — it's a direct consequence of the thing that makes AI personalisation valuable. The more genuinely adaptive the system is, the less useful conventional test coverage becomes as a quality assurance strategy.

How the Industry Got Here

The shift toward AI-driven personalisation has happened gradually enough that its testing implications have crept up on the industry rather than arriving as a clear discontinuity. Teams that have been building personalisation features incrementally — adding a recommendation engine here, an adaptive difficulty curve there — have generally extended their existing testing practices to cover each new feature individually. The cumulative effect on the testability of the overall experience has been less visible.

Duolingo's 2023 pivot toward what its CEO Luis von Ahn described as an "AI-first" approach — replacing a significant portion of content contractor work with AI-generated content and leaning further into personalised learning paths — is a useful marker because it made the stakes of this shift explicit. The bet is that AI can deliver better learning outcomes than a more static, manually curated approach. That bet is plausible. But it requires confronting the testing problem directly, not treating it as a secondary concern.

The more genuinely adaptive your product becomes, the less useful conventional test coverage becomes as a quality assurance strategy. This isn't a bug in the approach. It's a consequence of the ambition.

The same dynamic applies to any team building AI-personalised software. The personalisation features that differentiate your product from a static competitor are precisely the features that are hardest to validate through traditional means. The gap between "the model works" and "the experience it generates is reliably good across the full range of users it will encounter" is not closed by automated testing alone. It requires a different kind of evidence.

The Five Specific Challenges AI Personalisation Creates for Testing

You can't test the experience — only the system

Automated testing can verify that the personalisation model runs correctly, that API calls succeed, that content is rendered without errors. What it cannot verify is whether the specific experience a specific user receives is actually good — coherent, appropriately paced, motivating, and accurately calibrated to their level. That judgment requires a human, encountering the generated experience as a real user would.

Edge users are invisible until they appear

AI models are trained on distributions of user behaviour. They tend to perform well near the centre of that distribution and less predictably at the edges — users whose behaviour patterns weren't well represented in training data, users who interact with the system in unexpected ways, users from demographics that were underweighted in the training set. These edge cases don't appear in synthetic testing. They appear when real users with real diversity of background and behaviour encounter the system.

Quality is longitudinal, not transactional

For a learning app like Duolingo, the quality of the personalised experience isn't visible in a single session. It emerges over time — does the system correctly identify that a user is struggling with a particular concept and adjust? Does it avoid the trap of endlessly repeating easy content that a user has already mastered? Does it introduce new material at a pace that challenges without overwhelming? These qualities are invisible in a standard QA session and require extended engagement from evaluators to surface.

The experience varies by the device and context the model doesn't see

AI personalisation models typically optimise on behavioural signals — what the user does, how long they spend, where they stop. They don't directly observe the device the user is on, the network conditions they're experiencing, or the environment they're in. A personalised experience that feels smooth on a high-end device on fast WiFi may feel fragmented or slow on a mid-range device on a constrained mobile connection — and the personalisation layer has no way to account for that without explicit testing.

Model updates create invisible regressions

When the underlying personalisation model is updated — retrained on new data, fine-tuned for a new objective, adjusted based on A/B test results — the change in user experience is distributed across millions of individually generated journeys. There's no single feature to regression-test. The only way to detect whether a model update has degraded the experience for a class of users is to evaluate real user journeys with real people before and after the change.

What Duolingo Gets Right — and What the Industry Can Learn From It

It's worth being precise about what Duolingo has done well here, because the lessons generalise.

The company has invested heavily in what it calls its "learning engine" — not just the model itself, but the feedback infrastructure around it. It runs hundreds of A/B tests simultaneously. It uses engagement and retention metrics as proxies for learning quality, while acknowledging the limitations of those proxies. It has a research team that publishes on the pedagogical validity of its approach. This is a level of investment in understanding product quality that most teams building AI-personalised software haven't matched.

The implicit lesson is that AI personalisation demands a different quality assurance infrastructure, not just better automated testing. Metrics, experiments, and research loops are part of that infrastructure. But they operate at a population level — they tell you that version B retains users better than version A across the full user base. They don't tell you that version B has introduced a failure mode for a specific class of users that version A didn't have.

Population-level quality signals

What A/B tests and metrics reveal

Whether one experience variant performs better than another on average across the full user base.

Useful for: optimising aggregate outcomes, identifying broad regressions, making decisions about which direction to move.

Limitation: averages can improve while specific user populations get worse. A metric that goes up doesn't mean every user's experience improved.

Individual-level quality signals

What human evaluation reveals

Whether specific users, with specific profiles and behaviour patterns, have a good experience — or a broken one.

Useful for: surfacing failure modes for edge users, evaluating longitudinal experience quality, catching regressions that metrics miss.

Limitation: doesn't scale to full user base. Works best as a structured sampling process, not a comprehensive audit.

These two approaches are not in competition. They're complementary. Population-level signals tell you where to look and what's changing. Individual-level human evaluation tells you what specific users actually encounter — and whether that experience is actually good, or merely average.

A Framework for Testing AI-Personalised Products

The teams that handle this well tend to think about evaluation across three distinct dimensions, each of which requires different methods:

System correctness — does the personalisation infrastructure work as specified? This is amenable to automated testing: does the model run? Do API calls succeed? Is content rendered correctly? Traditional QA handles this well.
Population-level quality — does the personalised experience produce better aggregate outcomes than the alternative? This is the domain of A/B testing, engagement metrics, and longitudinal retention analysis. Duolingo's experimentation infrastructure is a model for this.
Individual-level experience quality — do specific users, with specific profiles, in specific contexts, actually have a good experience? This requires human evaluation: real people, experiencing the generated content as real users, assessed by evaluators who can judge quality in ways that metrics cannot.

Most teams invest heavily in the first dimension, reasonably well in the second, and inconsistently in the third. The third is where the most consequential quality failures hide — because they affect real people having bad experiences, rather than aggregate numbers moving in the wrong direction.

Evaluation Method	What It Catches	What It Misses	Fit for AI Personalisation
Automated unit / integration tests	System failures, API errors, rendering bugs	Experience quality, edge user failures, longitudinal issues	Necessary but insufficient
A/B testing and metrics	Aggregate outcome differences between variants	Subgroup regressions, qualitative experience failures	Essential for scale, blind to individuals
Internal dogfooding	Obvious bugs, major UX failures	Everything that requires diverse users and genuine naivety	Proximity problem applies
Structured human evaluation across diverse profiles	Edge user failures, longitudinal quality, context-specific issues	Population-level signals, aggregate outcomes	Closes the gap the others leave

The Diversity Imperative

There's one dimension of this challenge that deserves particular attention: the relationship between the diversity of the evaluation panel and the quality of the signal it produces.

AI personalisation models are trained on user data. The quality of the model's behaviour for a given type of user is a function, in part, of how well that type of user was represented in the training data. Users who are similar to the majority of the training distribution get well-optimised experiences. Users at the margins — different demographics, different learning styles, different device and connectivity profiles, different cultural contexts — get experiences that were less directly optimised for.

This means that an evaluation panel drawn from the same demographic and geographic pool as the development team will systematically miss the failure modes that affect users who are different from that pool. The value of a genuinely diverse evaluation panel — across age, geography, device type, connectivity, prior knowledge, and learning style — is not just ethical. It's epistemic. A diverse panel surfaces failure modes that a homogeneous one cannot.

For a product like Duolingo — whose stated mission is to make quality education universally available, and whose fastest-growing user base is in markets like Brazil, India, and across Sub-Saharan Africa — this is particularly consequential. An evaluation panel that includes learners from those markets, on the hardware and network conditions that are normal there, reveals whether the AI-personalised experience is actually delivering on the universal part of that mission.

Where This Leaves Teams Building AI-Personalised Products

The practical implication of all of this is not that AI personalisation is too risky to pursue. It's that pursuing it responsibly requires an expanded conception of quality assurance — one that takes seriously the limitations of automated testing and population-level metrics, and fills those gaps with structured human evaluation.

That evaluation doesn't have to be massive or continuous to be valuable. A panel of real users representing the diversity of your actual user base, experiencing the generated content over a realistic session, assessed by evaluators who can judge quality in context — this kind of structured human evaluation, done periodically and particularly before major model updates or feature launches, closes a gap that no amount of automated testing can address.

The teams that figure this out — that build human evaluation into their AI personalisation quality infrastructure, not as an afterthought but as a core component — will produce products that are genuinely better for a genuinely broader range of users. That's both the right thing to do and, eventually, the competitive advantage that's hardest to replicate.

We evaluate AI-personalised products with real humans.

Diverse evaluators, structured protocols, experience quality assessment that metrics can't provide. If you're building adaptive or AI-driven software and want to understand what your users actually encounter, talk to us.

Get in Touch →