This is a non-technical explanation of z-tests, with further links at the bottom for more advanced reads.
Z-tests compare the differences in conversion rate, measuring the amount of overlap between two samples. This value is known as the P-value. The smaller the overlap (the more distant the data sets are), the smaller the P-value and therefore the more likely you are to have a significant result (whether positive or negative).
Being able to see a complete separation of the two would lead to probabilities of 100% (it conversion is higher) or 0% (if conversion is lower). Any overlap brings this number closer to 50%.
So, we hope for extreme changes in conversion rate, which are easier to identify statistically.
What levers affect the output
Two levers drive the calculations here, and the following output of normal distribution curves, p-values, etc.:
Traffic: The more traffic we have going through an experiment (higher the sample size), the taller and tighter the bell curve will be. This is why we so often talk about tests needing more traffic to prove significance.
While conversion rates may not change as more traffic comes in, narrow bell curves will intersect less, and so significance is easier to deduce.
Confidence level: This is how much of the bell-curve we "pay attention to" - the number of standard deviations from the average we consider in a mathematical sense.
The lower the number, e.g. 70%, the fewer standard deviations / thinner a slice (from the average) we consider for our overlaps, and so the chance of seeing a clear gap between our variations is more likely. Looking at 0% from either side of our conversion rate, any difference in conversion rate would be instantly declared significant, but this would not allow for any amount of chance or variance.
The higher the number, e.g. 99%, the more standard deviations / wider a slide (from the average) we consider for our overlaps, and so the chance of seeing a clear gap between our variations is less likely. This is why with a higher confidence level, significance becomes harder to deduce.
Because our tests are two-tailed, the confidence level you supply is split in half, and these form our boundaries for how much of the distribution we "pay attention to".
Chance to beat control
Alongside Significance as a true/false output, you will find Change to Beat Control - a Bayesian measurement. This is a percentage likelihood of one sample group outperforming another - e.g. the Variation has a 100% Chance to beat Control.
Consider the following:
95% chance to beat control - this would require a confidence level of up to 90% to be flagged as significant.
At 95% confidence, split in half as it's 2-tailed, we would require 97.5%+ chance to beat control for a change to be flagged as significant.
98% chance to beat control - this would require a confidence level of up to 99% to be flagged as significant.
To summarise:
Confidence level is your appetite for risk.
Chance to beat control is your numeric output
Significance is your decision