I have a folder on my laptop called “tool evaluations.” It has fourteen subfolders. Each one contains notes from a proof of concept I ran for a testing tool that was supposed to change everything. Fourteen tools over ten years. Fourteen sets of optimistic notes from week one, followed by increasingly frustrated notes from week four, followed by silence. I stopped updating the notes around the time I stopped believing the pitches.
If you are building or selling a test automation tool and you are reading this, I am not your enemy. I am your most honest customer. I want you to succeed. I have wanted every single one of those fourteen tools to succeed. But I need you to understand something that your sales team will never tell you: the people evaluating your product are exhausted. Not curious. Not excited. Exhausted. And if you lead with the same promises we have heard before, we are going to tune out before the demo is over.
The Script I Can Recite in My Sleep
Every new testing tool follows the same playbook. I could write the marketing page myself at this point.
“No more flaky tests.” Yes, I have heard this. The last tool said it too. My tests are still flaky. The tool before that also said it. Those tests were also flaky. Flakiness is not a feature problem. It is an architecture problem. If your tool still relies on DOM selectors to find elements, your tool will still produce flaky tests. You can put a nicer interface on top of it. You can add retry logic. You can call it “smart waiting.” But the brittleness is structural, and no amount of polish fixes a structural problem.
“Write tests in minutes, not hours.” Sure. I can write the test in minutes. I then spend hours maintaining it when someone changes a class name. The writing was never the bottleneck. The surviving was.
“Works across all browsers and devices.” This one always gets me. Yes, it works across browsers. For web. Then I ask about mobile. And there is an awkward pause, followed by “we have a mobile add-on” or “mobile support is on our roadmap.” Two separate products, two separate maintenance burdens, one shared disappointment.
What Actually Breaks My Trust
It is not the limitations that bother me. Every tool has limitations. What breaks my trust is when the tool pretends it does not.
- ●The demo that uses a perfect app. Every tool demo tests against a clean, well-structured, accessible application with semantic HTML and proper data-testid attributes everywhere. My app does not look like that. My app has legacy components from three different frameworks, auto-generated class names, and a checkout flow that was built by four different teams over six years. Show me your tool working on something messy, and I will pay attention.
- ●The “self-healing” that does not heal. I have been told multiple times that a tool's tests will “self-heal” when the UI changes. What actually happens is that the tool tries two or three fallback selectors, and when none of them work, the test fails anyway with a less useful error message than before. Self-healing that works 60% of the time is not a feature. It is a trap, because now I do not know if my test is passing because the app works or because the healer papered over a real change.
- ●The “AI-powered” label on what is clearly just a script generator. Using a large language model to generate Cypress code is not AI testing. It is AI-assisted test writing. The test itself still runs the old way, with the old fragility, and the old maintenance burden. Putting AI in the creation step while leaving the execution step untouched is a half measure that does not solve the problem I actually have.
I am not trying to be harsh. I am trying to save us both time. If your tool is genuinely different, show me the difference in the execution model, not the creation workflow. That is where the real problems live.
The Three Questions I Ask Now
After fourteen failed evaluations, I have distilled my criteria down to three questions. If a tool cannot answer all three with a yes, I do not proceed to a proof of concept. I cannot afford to. My team cannot afford to. The emotional cost of another migration to another disappointing tool is too high.
Question 1: Can I use the same test for web and mobile?
Not the same “framework” with different platform adapters. The same actual test. I describe the flow once, and it runs on Chrome, on iOS, on Android. If the answer involves maintaining separate test files or separate locator strategies per platform, it is not one tool. It is three tools in a trench coat.
Question 2: Will my functional tests survive a UI redesign?
If my design team changes the layout, the colors, the component structure, will my test that checks “user can log in and see their dashboard” still pass? If the test relies on any kind of selector, CSS, XPath, accessibility ID, the honest answer is no. The only honest yes comes from a tool that finds elements visually, the way a human would.
Question 3: Can the same tool do strict visual verification?
Functional forgiveness and visual strictness in the same product. Not “we integrate with a visual testing partner.” One tool, two modes. I want functional tests that ignore cosmetic changes and visual tests that catch every pixel. If I need two products for this, the complexity is doubled and the value is halved.
These are not unreasonable requirements. They are the obvious requirements. The fact that almost no tool meets all three tells you everything about how stuck the industry has been.
Why I Am Skeptical of “AI Testing” but Not of AI
Let me be precise about this because the distinction matters.
I am deeply skeptical of tools that call themselves “AI-powered” while using AI only for test generation or maintenance suggestions. Those are useful features. They are not a paradigm shift. The test still executes the same way. It still finds elements the same way. It still breaks the same way. You are using AI to write the script faster, not to change the fundamental nature of the script.
What I am not skeptical of is the potential for AI agents that interact with applications through vision. An agent that looks at a screen, sees a login form, types credentials, and clicks submit, without knowing or caring about the underlying DOM structure. That is not an incremental improvement. That is a completely different architecture for test execution.
When the execution model changes, everything downstream changes with it. Cross platform becomes trivial because the agent sees screens, not platform specific element trees. Functional resilience becomes natural because the agent finds things by appearance and intent, not by selector. Visual verification becomes a mode switch, not a separate product, because the agent is already looking at the screen.
That is the kind of different I would actually believe. Not different in how tests are written. Different in how tests are run. Different at the level that actually determines whether my tests survive next month or become another maintenance burden.
What “Show Me” Looks Like
If you want to earn the trust of someone like me, someone who has been burned fourteen times, here is what you do. You do not lead with slides. You do not lead with architecture diagrams. You do not lead with customer logos.
You take my ugliest page. The one with the legacy components, the dynamic class names, the modal that loads asynchronously from a third party script. You write a test against it. Then you have someone on my team change the layout. Move things around. Rename classes. Swap a library. And you show me the functional test still passes. Not because it healed itself with fallback selectors, but because it never relied on selectors in the first place.
Then you switch to visual mode on the same page and show me that it catches the three pixel padding change my designer just introduced. Functional forgiveness and visual precision. Same tool. Same test run.
Then you run the same test on the mobile app. Not a rewritten version. The same one. And it works, because the agent sees a login screen, not a platform specific element hierarchy. Do that, and you will have my attention. Do that, and you will have earned something that no amount of marketing can buy: my trust.
The Real Cost of Being Wrong Again
I want to explain why this matters beyond the technical arguments. Every failed tool migration has a human cost that never shows up in the evaluation spreadsheet.
When you tell your team “we are switching to a new tool,” people invest emotionally. They learn the new syntax. They rewrite tests. They build workflows. They start to hope that maybe this time, the automation will actually automate instead of creating more work. And when the tool disappoints, when the same old problems resurface three months in, something breaks that is harder to fix than any test suite. People stop believing that things can get better. Cynicism sets in. The next time you suggest a new approach, you see it in their eyes before they even respond. “Sure. Whatever you say.”
That is the real cost of overpromising. Not the license fee. Not the wasted setup time. The erosion of a team's willingness to try. I have watched it happen to good teams, and it takes years to undo. So when I say I need you to prove it before I buy it, I am not being difficult. I am protecting the people who will have to live with the decision.
Why I Am Watching Yalitest Closely
I will tell you where I am right now. I am not sold on anything. But I am watching one approach more carefully than the others, and that is the vision based agent approach that Yalitest is building.
Not because of the marketing. Because of the architecture. An AI agent that interacts with the screen visually is the only approach that structurally answers all three of my questions. One tool for web and mobile, because vision does not care about platform. Functional tests that survive redesigns, because the agent finds elements by what they look like, not where they sit in the DOM. Visual verification in the same product, because the agent is already looking at the screen.
I have been wrong fourteen times. I might be wrong again. But for the first time in a long time, the argument for why something is different is not about features or workflows or developer experience polish. It is about a fundamentally different execution model. And that, at least, is worth paying attention to. If you have been burned as many times as I have, you owe it to yourself to look at what is actually changing under the hood, not just what is changing on the surface.