1. Verifying a fix to flaky test
1.1. Run the attempt for a period of time before validating the fix
1.1.1. Jenkins job that runs every hour
1.1.2. Cmd line trigger
2. What is it
2.1. Definition
2.1.1. A test that doesn't fail everytime
3. Remediation strategies
3.1. Best case
3.1.1. Finding and fixing root
3.2. Life has to move on
3.2.1. Try/catch for locators
3.2.1.1. StaleElementReference
3.2.1.2. If an element is not found first time, try for it again after doing the previous action
3.2.1.2.1. Especially helpful with gestures
3.2.2. Retry mechanism for failures
3.2.2.1. Make sure retry count is acceptable
3.2.2.2. Also that its not masking a failure
3.3. Worst case
3.3.1. Disable them, the coverage lost is not too high
4. Possible causes
4.1. Test design
4.1.1. Test data
4.1.1.1. Does the test case use the same test data for each run
4.1.1.1.1. Shared logins
4.1.2. Bad waiting mechanisms
4.1.2.1. Explicit vs implicit waits
4.1.3. Order dependent
4.1.3.1. Tests executed later assume a certain state that is set by earlier tests
4.2. Test synchronization
4.2.1. Application/Backend
4.2.1.1. Add sleeps between each test step
4.2.1.1.1. Make sure to remove them after diagnosing
4.3. Test execution
4.3.1. Test isolation
4.3.1.1. Running along with other tests vs running in isolation
4.4. Environment
4.4.1. Local vs remote
4.4.1.1. Does this fail locally or only from CI system
4.4.2. Different browsers/devices
4.4.2.1. Mobile
4.4.2.1.1. Does the error happen on all devices
4.4.3. Build flavor
4.4.3.1. Test vs Dev vs Stg vs Prod
4.4.3.2. Tests that passed in one may not pass in another
4.4.4. Build infrastructure scalability
4.4.4.1. Jenkins
4.4.4.1.1. Can it handle the peak load of multiple jobs
4.4.4.1.2. Are other build jobs hogging the system
4.4.4.2. External cloud providers
4.4.4.2.1. Does the current tier be able to scale
5. Evidence
5.1. Logs
5.1.1. Error messages
5.1.1.1. Is it the same error message every time it fails
5.1.2. Stacktrace
5.1.2.1. Does the error originate from the same line of code