Validating parity: GVB

The report focused on a high-stakes, real-world scenario: a first-time tourist attempting to purchase a 1-hour travel ticket using a Mastercard and requesting a receipt. To ensure a fair fight, both Uxia’s AI and the human panel used identical prototypes, missions, and audience demographics (UK-based, ages 25–45).

Want to read the complete research study?

SUMMARY

01.

Massive gains in speed and cost

01.

Massive gains in speed and cost

02.

Superior insight detection

02.

Superior insight detection

03.

Reliability and engagement depth

03.

Reliability and engagement depth

Main findings

Massive gains in speed and cost

Uxia significantly outpaced traditional methods in terms of time and budget:

30x Faster Delivery: The full testing cycle (setup, execution, and analysis) took just 25 minutes with Uxia, compared to 748 minutes (over 12 hours) for the human panel.
Automated Analysis: While researchers spent over 4.5 hours manually reviewing human recordings, Uxia’s analysis time was effectively 0 minutes because it generates a ready-to-read report immediately.

Superior insight detection

The AI testers proved to be far more observant than their human counterparts:

4.25x More Issues: Uxia surfaced 17 real usability issues, while humans only detected 4.
Zero Unique Human Insights: Every single issue flagged by the human panel had already been independently identified by the AI; the humans brought no unique findings to the table.
Critical "Blind Spots": All 10 AI testers flagged a serious trust issue regarding an external Dutch-language payment redirect. In contrast, not a single human tester commented on it, likely because they were rushing to complete the task for compensation.

Reliability and engagement depth

The quality of feedback revealed a stark "attentiveness gap" between the two groups:

7x More Commentary: AI transcripts averaged 2,200 words per session, compared to just 300 words from humans.
100% Success Rate: All AI tests were valid and usable, whereas the human panel suffered a 10% failure rate due to a technical audio issue that made one transcript unusable.
Active vs. Passive Testing: Human testers often clicked through screens in "automatic mode," spending only 3–5 seconds on onboarding slides. AI testers "thought out loud," questioning ambiguous labels and identifying multi-layered friction points.