This software requires too many tests!
• You have too many variations of your test suite to run them all!
• To test all combinations of the inputs for this function would take millions of tests!
Do you have a problem like this? There’s help and it’s called Pairwise Testing.
About This Blog
There’s been a lot written about pairwise testing.
What I’m going to cover are the use cases:
• Why you care: what two situations are ideal for this technique
• The risks and benefits and how to use it to deliver the best effect
• How best to use pairwise testing
• How to create a great oracle to handle pairwise results
• Tools to standardize on
Example 1 – Testing a Web App
To understand what problem pairwise testing solves, let’s start with an example. You are testing a web app, but you’ve got a suite of automated tests, so this is doable, right?
Let’s say you need to test across:
• 4 browsers
• 5 client platforms (2 phones and 3 OS’s)
• 5 languages
• 3 screen sizes (phone, tablet, and desktop)
You would run your test suite across 17 variations (4 + 5 + 5 + 3 = 17), right?
But what if there’s a bug that will only happen due to a combination of these variations? The full combinations could take 300 test passes of every test in the suite (4 * 5 * 5 * 3 = 300). If your suite takes an hour to run, this would probably take days or weeks to run.
Is it realistic to think that bugs come from combinations of inputs? Yes! In his classic work on pairwise testing, Kuhn found that 28% of browser app bugs were found from a single input while 76% came from two inputs.
What if I told you that 26 tests would cover enough combinations to give you a 76% confidence that you’ve found all the bugs from those combinations? In most cases, you’re going to go with the 26 variations. It’s not much more than the 17 tests and gives a LOT more confidence. If you have very high-quality requirements you might run 103 tests for 95% confidence. That’s still better than having to do all 300 combinations.
This technique for reducing the number of combinations to test is called “pairwise testing” or “combinatorial testing”. It’s been around for decades, but most developers and testers don’t learn it in school so they miss out on a great technique to improve productivity.
How much confidence do you need?
As we implied in the example above, you can test pairs of browser app inputs for 76% confidence that you are finding all the bugs, but some bugs require three different specific inputs to happen. In that study, testing triples found 95% of browser app bugs. Increasing the N-tuples increases your confidence, though the return diminishes rapidly and N>6 should never be needed.
It’s worth noting that these numbers are from an old study of complex, monolithic software. In a more recent comparison of techniques, 98% of bugs seeded in 5 applications were found through pairs of inputs.
Which numbers are right? We can’t know for your app, but, generally speaking, pairwise testing will find the majority of issues. For complex legacy software, triples may take you from 76% to 95%, but in newer components it may not give more than 1% or 2% more coverage. Your mileage will vary, but it could be expected that if you have an app with low cyclomatic complexity, pairwise testing will likely be enough for commercial purposes.
When to use pairwise testing
There are two places this gets used:
1) running the same tests under multiple variations or environments
2) testing a complex function with two or more coupled inputs
The first case is like the situation in our example above. You are testing the same functionality across many variations.
This comes up in:
• compatibility testing
• setup testing
• upgrade testing
• config switch testing
The second common use for pairwise testing is for a complex function or UI where the output depends on several inputs. Let’s add an example of that below. When testing a function, pairwise testing not only improves productivity, but also improves quality at the same time.
Example 2 – the OpenFile function
Pairwise testing also helps when testing a complex function or API where the output depends on several inputs. To see that, let’s look at the Windows function to open a file.
This function opens a file named lpFileName with one or more of 15 uStyle flags where the flags are a collection of bits (0 or 1 in the table below) stored in an unsigned integer. The problem is that there are 32,768 total combinations of those uStyle flags (2^15). It’s very unlikely that we are going to be able to execute that many tests, especially as we’ll have to multiply that number by the number of unique classes of filenames we can think of to get the real total number of cases for both inputs. Using pair combinations, you can test all 15 style flags with just 10 tests! That’s even better than testing all 15 flags separately.
Here are all the tests:
You can see that, for instance, flags #1 and #2 have all four pair combinations shown circled in tests 1, 2, 3, and 5. Every pair combination of the 0’s and 1’s for every flag are represented.
To make these 10 tests work, our test validation must check all 15 uStyle properties of the open file on every test instead of just one flag per test. It’s a best practice to build a simple test oracle that knows what the output should look like given the inputs. We’ll see how to do that later.
How do we get that table of tests? The practical answer is that we use a tool. It’s beyond this post to go through the math, but if you are interested, there’s a simple visualization in the appendix.***********
Other Requirements For Using Pairwise Testing
Pairwise testing is effective when the inputs are
• directly dependent on each other OR semi-coupled to each other
and less effective for
• Mathematical formulas
• Sequential operations
• Ordered inputs
• State-full or asynchronous systems
Sometimes one of these less effective cases can be approximated or modeled as a simple input/output function where pairwise testing can work. Example: a UI wizard has a group of ordered dialogs, but when you hit the Finish button, it takes the results collected by those dialogs as an unordered set and executing them. Therefore, the results of the wizard can be modeled with pairwise testing. Asynchronous UI’s and DB’s can also be tested with pairwise testing when you are not concerned about the asynchronous or ordered aspects.
There are best practices when doing pairwise testing. Here’s the general pattern you’ll follow:
- Identify inputs and give them names
- Create equivalence classes
- Give names to equivalence classes. Avoid hard-coding variables in a range
- Validate your model
- modify the model, not the inputs
- Identify constraints
- Consider weighting and seeding
- Improve the model
- What’s enough? Pairs, triples, etc.
- Use randomly generated values within the ranges
Let’s look at each of these steps
What are the inputs you are dealing with? For a function or a class this might be easy: just pick the parameters or the settable properties respectively. For a test matrix, you need to identify the separate variables. Imagine you are looking at two variables: Operating system and Browser type. You need to ask:
• Are they relevant to the test? • Are they separate? Can they vary independently for the test? If not, skip ones those that are not.
• Are they coupled? Could a combination of the two have functionality that neither has on its own? Remember that this technique isn’t used for independent variables.
Create a label for each input. It helps to use a chart where the inputs are below the column headers.
Let’s use this font dialog in figure #1 as an example we want to test. What are the inputs?
Create Equivalence Classes
Once you have the list of inputs, you need to know the ranges you will test. This process is known as equivalence partitioning or creating equivalence classes. The simple form of the idea is to divide the range of possible values for each input into ranges that have equivalent behavior.
For example: for most number inputs, there’s a low value and a high value beyond which the function will not work. For this simple case where all values in the middle behave the same way there are between 3 and 5 classes, depending on how you count the boundaries. They are:
1) below the bottom (invalid low)
2) the lowest valid value, or low boundary
3) a number in the valid range of values
4) the highest valid value, or high boundary
5) above the valid range (invalid high)
The boundary values might be just part of the valid range, or they might be values you want to test specifically.
For our example font dialog, we would have something like this:
Name each of your classes. Don’t use a representative number. You’ll want to be able to change the number later as needed. In the example above, we could have used:
but by giving each class a range and each range a name, we can more easily understand what the range is for and change what input value we use within that range.
When you are using the output values it can be helpful to randomly generate values within the equivalence class range. If you are wrong and there is some boundary within the range that behaves differently, this will catch it.
Validate and improve your model
Use a tool to generate a set of outputs and review them. Does the output make sense? It might have combinations that don’t work. You could also see where there should be special test combinations that are important to you that just aren’t there. When this happens, don’t edit the output! Change the model that creates the output. Many tools allow you to add constraints, weights, sub-models, or seeds to change the output.
For many systems you’ll test, there will be combinations that just don’t make sense. In our example font dialog, the brush script font will always be italic. Likewise, for the monotype font, it’s either normal or bold and italic. The Pict.exe tool represents those constraints as below: *********
Many pairwise testing tools can handle constraints like this to reduce the total number of tests to just those tests that make sense.
Consider Weighting, Seeding, and Sub-Models
Where your tool will support it:
• weighting allows you to insist that some classes appear more often in the output
• seeding allows you to insist that certain combinations will happen
• sub-models allow you to combine a few inputs together as one model before combining with other inputs
• random seeds can allow the model to generate different tests and possibly catch new issues. A fixed seed is better for reproducibility.
The Perfect Oracle for Pairwise Testing
Because you must validate a system with a complex set of inputs and outputs, it really helps to know what the expected result should be without having to specify it for every test. This means using a test oracle – some code that will give you the expected answer using a very simple algorithm. The ideal test oracle is one that is easy to understand and easy to add to as you learn more about the system under test. For pairwise tests, that oracle is a special kind of chain of responsibility pattern.
Trying to remember what a chain of responsibility is? A simple form is just a series of IF/THEN statements, where each THEN statement calls breaks out of the series to call some functionality. For pairwise testing, we use an oracle that first tests the worst-case exception, then the 2nd worst case, and so on, till we get to “else it works”.
Test Oracle Example
We’re testing a command-line tool that copies data from SQL table 1 to SQL table 2. It has two input parameters which are the source table and the destination server.
Each input can have several possible equivalence classes (i.e. – ranges of inputs), which are represented in the table below.
Here’s our oracle for this tool:
First, notice that we can easily add to the oracle when we learn new things. If we discover later that when source data is “no data” there should be an error, we can easily add it. If we decide later that there are more rows in our table, that’s easy to add too.
Second, notice how easy it is to turn the oracle into code. It practically already is code. If you copy and paste it into your code editor, it can be the comments you code around.
Third, you are not limited to “official” inputs. You can add columns for external influences like “out of memory” or a prior state.
Test cases contain input data and expected results:
Pairwise Testing is Modeling (and the limits of pairwise testing)
When most people think of modeling software, they think of finite state models. Those are good when you test something where a change in state changes the outcome. Pairwise testing, when combined with an oracle, is an even simpler form of modeling called a “stateless model”. A stateless model works where you can map the inputs directly to the outputs. Models are never the same as the real system and that means:
“The most that can be expected from any model is that it can supply a useful approximation to reality:
All models are wrong; some models are useful.” – G. Box
The less you know about the system under test, the greater the probability it will report the wrong results – false positives or negatives. Missed equivalence classes and boundaries are the most common reason for this kind of model to fail. Conversely, pairwise models are great the more you know about what you are testing.
Example visualization of reducing a coverage array
Rather than show the whole math it takes to get pair combinations, let’s just look at one example. Say we simplify the Font dialog example to just the 4 flags that describe font styles. The full expansion of all combinations of four flags is a total of 16 tests (2^4). We show every combination of those four flags in the table below.
There are four tests that have a yes for bold and italic. Let’s pick the first one – test #2. That also gives us a no for underline and strike-through, and a yes/no for italics and underline, bold and underline, bold and strike-through. Just that one choice gives us 6 combinations. For the next test, let’s pick #7 because it’s very different from #2. This gives us 6 more combinations! We now have 12 pairs with just two tests. Let’s grab #11 as it’s the next available yes/no for bold and italic. It gives us 4 additional pairs (one is duplicate of what we have). We have just 6 to go and #16 gives us 4 of them. The last 3: can be attained through #14.
In each case we picked a test that maximized the number of unique pairs (6, 6, 5, 4, then 3) that it gave us.
In summary, we can see that if we just do 5 tests, numbers: 2, 7, 11, 14, and 16, then we have all the pair combinations. To get those 5 tests is a mathematical row reduction problem. Many tools can do it including Excel (with the problem solver add-in) and Mathematica, but you don’t need to do the math yourself. Test tools, like NUnit, do a simple reduction which won’t find the smallest possible number of tests, though they are typically good enough. Others, like Pict, are more thorough and will allow you to eliminate invalid combinations.
A Few Targeted Test Tools
There are two typical ways to use pairwise testing in practice. One is to generate the test variations in advance and the other is to generate them at run-time. There are lots of tools that can do both. I’ll pick just two, because they will do the job for you.
Creating the Test List in Advance
The tool PICT (for Pairwise Independent Combinatorial Testing) is a simple command line tool that you can drive with a config file.
An example file for a font dialog could look like:
The output is the list of generated tests.
Notice that Pict supports constraints with the if-then statements at the end of the file. These prevent you from generating tests that don’t exist.
Pict supports other N-tuples, like triples, quadruples, etc.
Creating the Test List on the Fly
When you want to test a function with coupled inputs, it’s best to generate the tests on the fly. It’s common to need this when doing unit testing, and since most unit testing is done with some form of NUnit, let’s look at that tool.
NUnit supports simple pairwise testing. It has limitations:
1) It will find a small set of tests, but doesn’t always find the smallest set of tests.
2) It doesn’t support constraints. If there’s a combination that is invalid in your test matrix you must handle it in code with an if-then.
3) It only supports testing pairs, not triples and above.
Example pairwise unit test from the NUnit docs:
If you need a better tool you can call within your code, look for the appropriate one for your language at http://pairwise.org/tools.asp.
 “More than 70% of bugs were detected with two or fewer conditions (75% for browser and 70% for server) and approximately 90% of the bugs reported were detected with three or fewer conditions (95% for browser and 89% for server). […] It is interesting that a small number of conditions (n<=6) are sufficient to detect all reported errors for the browser and server software.” R. Kuhn and M. J. Reilly, 2002
 K. Tatusmi. Test case design support system. In Proceedings of International Conference on Quality Control (ICQC), Tokyo, 1987. Pages 615-620.
 Evaluation of Combination Strategies for Test Case Selection (Grindal, et. al.), where pairwise testing is identified as AETG. https://cs.gmu.edu/~offutt/rsrch/papers/evalcombstrat.pdf
 As created by BJ Rollison for his pairwise article: https://www.stickyminds.com/sites/default/files/presentation/file/2013/08STRER_T19.pdf