Why Augmented Reality Apps Are So Hard to Test — and What Actually Works

Augmented reality is one of the most technically demanding categories of consumer software ever shipped. It asks a device to understand three-dimensional physical space in real time, overlay digital content onto it precisely, maintain that overlay as the user moves, respond to input, and do all of this without draining the battery or overheating — on hardware ranging from a three-year-old mid-range Android phone to the latest iPhone with a dedicated LiDAR scanner.

That's an extraordinary amount of complexity. And most of it is invisible during development, because development happens in a controlled environment: the same office, the same lighting rig, the same test devices, the same developer who knows exactly how the app is supposed to behave.

The users who download your app don't have any of that context. They have their own spaces, their own lighting, their own devices, and their own expectations. And when your app fails to meet those conditions — which it will, in ways you didn't anticipate — the result isn't just a bug report. It's a one-star review that says the app "doesn't work" and a user who doesn't come back.

The Fundamental Problem: The Real World Is a Variable

Every other category of software can be meaningfully tested on a device sitting on a desk. AR cannot. The environment is not incidental to the AR experience — the environment is the experience. When you strip it out, you're not testing a simplified version of your app. You're testing a different product entirely.

This creates a structural problem for standard QA processes. Your test lab can verify that your ARKit integration initialises correctly. It can confirm that your asset pipeline produces the right outputs. It can check that your UI renders at the correct scale. What it cannot do is tell you how your app behaves in the full range of real-world environments your users will actually inhabit.

The environments that break AR apps are rarely the ones developers think about — a low-texture wall that the tracking algorithm can't anchor to, an outdoor space where direct sunlight washes out the display, a small room that doesn't give the device enough depth information to place objects accurately. These failures are real, consistent, and completely invisible in a lab.

There's a useful way to think about this. Traditional software bugs are deterministic — given the same inputs, the same bug appears. AR environment failures are stochastic — whether they appear depends on variables outside anyone's control. You can't write a test case for "user is standing in a kitchen with terracotta floor tiles and indirect evening light." But that's a real environment, and your app needs to work in it.

The Five Environmental Variables That AR Testing Must Cover

Variable 01

Lighting Conditions

AR tracking relies on visual feature detection. Low light, direct sunlight, harsh artificial lighting, and mixed light sources all degrade tracking quality in different ways. An app that anchors objects perfectly under office fluorescents may drift or lose tracking entirely in natural light.

Variable 02

Surface Texture & Depth

Plane detection and surface tracking require visual texture to anchor to. Blank white walls, glass surfaces, and monochrome floors give the algorithm nothing to work with. Many AR apps fail silently in these environments — objects refuse to place, or place incorrectly, with no clear user feedback.

Variable 03

Space Size & Geometry

Room-scale AR behaves very differently in a studio apartment versus a large open-plan office versus an outdoor environment. Occlusion, depth estimation, and object scaling assumptions built in the studio may break entirely when the available space changes dramatically.

Variable 04

Movement & Motion

Users don't hold phones still. They walk, turn quickly, crouch, reach. Rapid movement can cause tracking loss that takes seconds to recover from — long enough to break immersion and confuse users who don't understand what happened. Motion testing requires real people moving naturally, not deliberate slow sweeps from a developer.

Variable 05

Device Hardware Diversity

LiDAR-equipped devices (iPhone Pro, iPad Pro) have fundamentally different depth sensing capabilities than devices relying on ARCore or older ARKit implementations. An experience that feels magical on a Pro device may feel broken on a mid-range Android — not because it doesn't work, but because the underlying sensing is less precise.

Variable 06

Thermal & Battery State

AR is among the most computationally intensive workloads a mobile device handles. After 5–10 minutes of sustained use, many devices begin to throttle CPU and GPU performance to manage heat. Frame rate drops and tracking degradation under thermal pressure are real, common, and rarely tested in short QA sessions.

Why Device Labs Don't Solve This

A natural response to the AR testing problem is to invest in a device lab — a rack of 20 or 50 devices representing the most common hardware configurations. This is a worthwhile investment for many types of software. For AR, it solves approximately half the problem and leaves the other half untouched.

Device labs address hardware diversity. They do not address environmental diversity. You can run the same AR experience across 50 devices in the same room under the same lighting and learn almost nothing about how that experience performs in the range of real-world environments your users inhabit.

AR quality isn't a function of your code in isolation. It's a function of your code meeting the real world — and the real world doesn't look like your office.

The only way to understand how an AR experience performs across environmental variables is to test it in real environments, with real people, on real devices, in conditions that the development team didn't control or anticipate. This is the gap that device labs, automated testing frameworks, and internal QA cannot close.

What the Most Common AR Failure Modes Look Like to Users

Understanding the failure modes from a user perspective — not a technical one — matters because that's the perspective that generates reviews, refund requests, and churn. Here's what real users experience when AR testing has been inadequate:

"It doesn't work" — Tracking failure without feedback

The most common and most damaging AR failure. The app launches, the camera view appears, and then nothing happens. The object won't place. The tracking indicator spins indefinitely. From a developer's perspective this is a known limitation — insufficient surface texture, poor lighting, unsupported device. From the user's perspective, the app is broken. Without clear guidance explaining what's happening and how to fix it, most users will simply give up and leave a negative review.

"Objects keep moving" — Tracking drift

Placed objects that gradually drift from their anchored position over time, or snap to a new position when the camera moves, are deeply disorienting. Users interpret this as a reliability failure — the app can't do what it claims to do. This failure mode is particularly common after device warm-up and in low-texture environments, and is almost never caught in short internal testing sessions in a well-lit studio.

"It looks wrong on my phone" — Scale and rendering inconsistency

AR experiences built and calibrated on high-end hardware often look noticeably worse on mid-range devices — not because of a bug, but because the visual quality assumptions baked into the experience don't translate. Shadows, occlusion quality, and object scale can all feel slightly off in ways that users can't articulate but immediately notice.

"It got really hot and then started lagging" — Thermal degradation

Users who engage with AR experiences for extended sessions encounter performance degradation that shorter test sessions never surface. This is particularly acute for game experiences, try-before-you-buy retail applications, and any AR use case that encourages exploration over time.

A pattern worth noting: AR app reviews that say "doesn't work on my phone" are almost always device-or-environment failures, not code failures. The app works — just not in the conditions that user encountered. Without real-world testing across those conditions, you have no way to know they exist until your users tell you.

Testing AR at Each Stage of Development

The earlier in development that environmental testing begins, the cheaper each finding is to fix. Here's how to think about AR evaluation across the development lifecycle:

Prototype stage: informal environment sampling. Even before structured testing, developers should be deliberately exposing their prototype to hostile environments — the lowest-texture surface they can find, the brightest outdoor light available, the most cluttered and geometrically complex room they can access. The goal is not to be comprehensive but to surface major assumption failures early, when they're cheapest to address.
Feature-complete: structured multi-environment evaluation. This is where independent evaluation becomes most valuable. A panel of testers using the app across different home environments, device types, and lighting conditions will surface the systematic failures — the ones that appear consistently under specific real-world conditions — before they appear in reviews.
Pre-launch: device coverage and edge case verification. Before shipping, ensure coverage across the specific device configurations that represent your target market — including the mid-range Android hardware that often gets under-weighted in testing because it's less pleasant to develop on.
Post-launch monitoring: user environment data. After launch, pay close attention to the device models and OS versions appearing in crash reports and negative reviews. These cluster — certain device/environment combinations fail consistently, and the pattern is visible in the data if you know what to look for.

What a Structured AR Evaluation Actually Covers

A well-designed AR evaluation goes beyond "does the experience work" and maps the conditions under which it works well, works acceptably, and fails. The output should give your team enough information to make informed decisions about what to fix, what to document as a known limitation, and what to address through better user guidance rather than engineering changes.

Specifically, a structured evaluation should produce:

Environment failure map — which lighting conditions, surface types, and space geometries cause tracking failure or degradation, and how severe the failure is
Device performance profile — how the experience differs across device tiers, with specific attention to the gap between high-end and mid-range hardware
User comprehension findings — where users get confused, where they give up, and what feedback the app provides (or fails to provide) when the environment is unsuitable
Thermal and extended-use behaviour — how the experience degrades under sustained use on representative hardware
Prioritised recommendations — ranked by impact, distinguishing between engineering fixes, UX guidance improvements, and known limitations to document for users

The value of knowing your limitations: One of the most underrated outcomes of AR evaluation is a clear, honest picture of what your app requires from the environment. "Works best in well-lit rooms with varied surfaces" is information your users need. Apps that communicate their environmental requirements clearly get far fewer frustrated reviews than apps that let users discover limitations for themselves.

The Competitive Case for Investing in AR Testing

AR is still early enough as a consumer category that the quality bar is uneven. Many AR apps ship with significant environmental reliability gaps that teams simply didn't have the process to discover before launch. This means that an AR product that genuinely works well across a broad range of real-world conditions is meaningfully differentiated — not just from a quality perspective, but as a competitive position.

Users who have been burned by AR apps that "don't work" are increasingly sceptical. An app that demonstrably works — in dim light, in small rooms, on mid-range hardware — earns trust that competitors who skipped environmental testing can't easily replicate. That trust shows up in ratings, in word of mouth, and in retention.

The investment required to get there is not as large as most teams assume. It doesn't require a specialised AR testing facility or a large budget. It requires real people, in real environments, on real devices — approached systematically, with a clear brief about what to look for and how to document what they find.

We test AR apps in real environments.

Our analysts evaluate your AR experience across real-world lighting conditions, surface types, device configurations, and usage patterns — then deliver a structured report your team can act on. Start with a Pilot for $300.

Talk to Us →

A Note on Emerging AR Platforms

Everything above applies to smartphone AR — ARKit on iOS and ARCore on Android — which remains the dominant delivery platform for consumer AR experiences. But the same principles apply, in sharper form, to dedicated spatial computing hardware: Apple Vision Pro, Meta Quest, and their successors.

Spatial computing platforms introduce additional environmental variables — hand tracking reliability, eye tracking accuracy, passthrough quality in different lighting conditions, ergonomic fatigue over extended sessions — that are even further removed from controllable lab conditions than smartphone AR. As these platforms mature and their install bases grow, the need for real-world human evaluation in diverse conditions will only increase.

The teams building on these platforms now, and investing in real-world evaluation early, will have a significant advantage when the market scales.