Comparing User Testing:
Uxia's AI vs. Human

This report compares AI vs. Human user testing using the Amsterdam Public Transit App (GVB) as the case study. Provide your email to unlock the full report.

Professional portrait of a young woman with black hair outdoors

Professional portrait of a young woman with brown hair

Professional portrait of a young man with a moustache and beard.

Trusted by +700 leading Product Teams

PRODUCT HUNT

#1 Product of the Day

#1 PRODUCT OF THE MONTH

User Experience

#1 PRODUCT OF THE WEEK

Design Tools

About the Report

The report focused on a high-stakes, real-world scenario: a first-time tourist attempting to purchase a 1-hour travel ticket using a Mastercard and requesting a receipt. To ensure a fair fight, both Uxia’s AI and the human panel used identical prototypes, missions, and audience demographics (UK-based, ages 25–45) .

Main Findings

Massive Gains in Speed and Cost

Uxia significantly outpaced traditional methods in terms of time and budget:

30x Faster Delivery: The full testing cycle (setup, execution, and analysis) took just 25 minutes with Uxia, compared to 748 minutes (over 12 hours) for the human panel.
Automated Analysis: While researchers spent over 4.5 hours manually reviewing human recordings, Uxia’s analysis time was effectively 0 minutes because it generates a ready-to-read report immediately.
Significant Savings: At a volume of 15 tests per month, Uxia costs $299/month, saving over $550 compared to the $849/month required for a human-panel platform.

Superior Insight Detection

The AI testers proved to be far more observant than their human counterparts:

4.25x More Issues: Uxia surfaced 17 real usability issues, while humans only detected 4.
Zero Unique Human Insights: Every single issue flagged by the human panel had already been independently identified by the AI; the humans brought no unique findings to the table.
Critical "Blind Spots": All 10 AI testers flagged a serious trust issue regarding an external Dutch-language payment redirect. In contrast, not a single human tester commented on it, likely because they were rushing to complete the task for compensation.

Reliability and Engagement Depth

The quality of feedback revealed a stark "attentiveness gap" between the two groups: