This article reports on an experiment I have been conducting in my current team that investigates the relationship between story estimates and sum of task estimates in order to determine if story estimates are sufficient for iteration and release planning.
Prediction is difficult, especially of the future. – Neils Bohr
When I started with my current team, I saw the following cycle repeat over several iterations:
1. An iteration planning session would occur during which some stories were chosen
2. The chosen stories would be broken down into tasks and estimated
3. Developers would express concern about the estimated amount of work required for the iteration
4. Work would proceed with each pair of developers working on a story
5. The iteration would finish with many stories unfinished and some not yet started, thus making the next iteration planning session difficult due to leftovers and leaving us with a general sense of dissatisfaction about the iteration.
Thic cycle would continue until enough stories were completed to constitute a release, which frequently occurred in the middle of an iteration
There was no release planning1.
There are several issues that come out of the description above (lack of release planning, lack of control over the iteration structure, the independence of tasks in the breakdown of each story, etc.) but the one I want to focus on here is the lack of essential information in the iteration planning session.
What was missing from the process was estimates for the stories under consideration in iteration planning, and a velocity for the last iteration calculated as the sum of estimates for the planned stories.
When I asked the team why they did not estimate the stories and use these estimates for their planning, they expressed the same concerns that I’ve heard many times before; they felt that story estimates would be too inaccurate for the purposes of planning and also that stories were frequently unestimable – it was the task breakdown which bought out many of the issues which business analysts then needed to medicate. This frequently caused stories to be pulled from the iteration, invalidating a great deal of the results of the iteration planning session.
With this in mind, I proposed a safe experiment to determine the following:
1. Is there a simple (linear) relationship between story estimates and sum of task estimates and is this relationship strong enough to allow the use of story estimates for release and iteration planning?
2. Is the relationship between story estimates and the sum of task estimates stable over a period of time that is of the scale of a few releases so that longer term planning can also be achieved?
Explicitly excluded from this experiment was anything to do with comparing estimates to actuals. Also, the idea was not to change the process as we were doing the experiment, so apart from sessions to estimate stories, everything else remained the same.
To start the experiment, we needed to produce estimates for each of the stories that were likely to come up over the next few iterations. The business analysts were able to tell us which stories these were, although a few stories did subsequently get dropped. During every iteration, any new stories were also estimated.
We did not want to spend a huge amount of time on this. In particular, we did not want to do task breakdowns for all of the stories in advance. However, we did want to try to bring out more of the issues to allow the business analysts time to work out the missing pieces before putting the story into an iteration.
Task breakdowns were done as they always had been – immediately after the iteration planning session.
Collection of Results
The results were collected over 10 iterations (each iteration is one week long). This resulted in poker and task estimates for 41 stories. During this time, the number of developers in the team varied between eight and ten. The team works on three different projects, with developers moving between the projects fairly fluidly. One particular developer was on the project under observation for the duration of the experiment and he was joined by three other developers drawn from the pool at the beginning of each iteration. When poker estimates were required, as many developers as possible were involved, up to a maximum of about six.
The consequence of this is that the poker estimates were produced by different people over the course of the experiment and the task breakdowns and work were not necessarily done by the developers who produced the poker estimates2. This is not an ideal situation. However, it did not work out too badly as we can see from the results and there was a general sense of team ownership of, and commitment to, the estimates.
The results presented here will not make much use of statistical measures of significance, null hypotheses, etc. It is my aim to inform readers about the results of the experiment without losing many along the way because of the lack of a statistical background. In any case, I would not expect to get statistically significant results given the amount of data collected. Instead I will make use the well known method of Proof By Gesticulation; the only requirement for these results were that that sufficient evidence was presented to allow my team to make intuitive use of the results.
Is There a Simple Relationship Between Story Estimates and Sum of Task Estimates?
We can get some simple summary information about the relationship between story and sum of task estimates by plotting them on a graph – see below. The points each represent a (story estimate, sum of task estimate) pair. The reason for the vertical clusters of points is the Estimation Poker method used for producing the story estimates. The line of best fit through the data points is also shown, together with its parameters.
It is clear that there is some correlation between the sum of task estimates and the story estimates. Somewhat gratifying is the fact that the line of best fit passes very close to the origin. These two things taken together mean that the story estimates predict the respective sum of task estimates with nothing more than a scale factor of about 1.3.
In addition, the R 2 value shown on the chart indicates that about 80% of the variation in the sum of task estimates is ‘explained’ (or predicted) by the story estimates.
This is an extremely useful result.
Is The Relationship Between Story Estimates and the Sum of Task Estimates Stable?
By ‘stability’, I mean that the relationship between story estimates and sum of task estimates does not change significantly over time. This is important because business analysts, project managers and other people on the business side of software development frequently needs to plan over time scales significantly larger than a single iteration (one week for my team) or release (a few weeks). If the relationship between stories and sum of task estimates does not hold for this significant period of time, then the business cannot use the story estimates for their longer term planning.
To get an idea of the stability of the process, think of the sequence of story estimates. These estimates are produced in small batches and the estimated stories accumulate until some are chosen to be developed in an iteration. At this point, the stories are broken down into tasks, each of which is estimated and we can now ‘measure the accuracy’ of the story estimates (subject to the known scale factor of 1.3). Because we have seen that the two methods of estimation produce values that are approximately proportional over the whole sample, we can divide the sum of task estimates by the story estimate and compare the value to the expected value of 1.3 to determine how well a particular story conforms to the expected relationship.
Plotting the sequence of sum of task estimates to story estimate ratios in the order of production of the story estimates (see the chart below) gives us points on a chart which preserves the time ordering of the story estimates.
What the chart shows is that over the period of the experiment, the relationship between sum of task estimates and story estimates held – there was no significant drift away from the expected value at any time. This is despite the fact that some of the story estimates were produced early in the experiment, and their task breakdowns and estimates were done after a significant elapsed time.
We can also note that all but one of the points on the chart fall ‘close’ to the mean value. This traditionally indicates that the process is under ‘statistical control3’, meaning that the deviations from the mean are due to random errors and are not due to some systemic change in the process that needs to be investigated. The one point that is a long way from the mean should be investigated to determine if there is a ‘special cause of error’ that can be avoided in the future.
The results show that, for my team using the estimation methods described, story estimates correlate well with sum of task estimates and the relationship holds over a significant time. This suggests that our story estimates are as good as the task estimates for iteration and release planning purposes, and this relationship is stable enough to allow longer term planning.
As a result of this experiment, my team has started relying on the story estimates for planning of timeboxed releases. Previously, as described above, release planning was superficial and releases generally happened when sufficient scope had been accumulated.
Because of the greater confidence in the planning process, we have significantly increased the frequency of releases. Previous releases tended to happen every month to six weeks. They now happen every two weeks, with the occassional three week release.
Also, because we discuss the meaning of the stories when we attempt to produce story estimates, issues are bought out with the stories early, thus giving business analysts time to respond before development is required to start. This brings greater stability to our iterations with many fewer stories needing to be reconsidered after iteration planning.
When the larger stories are under development, we tend to have more than one pair working on them in order to make sure they are finished within a single iteration. This is a change in attitude that favours completion and closed iterations. This leads to greater collective code ownership and better focus on the task at hand.
1 What was termed ‘release planning’ was actually just the declaration of a business analyst that some themed, but individually and collectively ill defined set of features must be delivered before a given arbitrary date.
2 Actually, there were a couple of stories for which one developer insisted that their estimate was good, despite it being much lower than that of the other developers. In these cases, that developer was charged with leading the story implementation when it was bought into development.
3 I use this term very loosely here. To truly say the the process is under statistical control, we would have to determine an appropriate model for the distribution (which is not Gaussian, by the way) and place suitable lower and upper control limits on the chart. Nevertheless, we should not let the facts stand in the way of a good story…