Why is the EEF getting it so wrong about ability grouping?

For a while now, whenever the topic of setting and streaming has come up, people have referred me to the EEF toolkit, and particularly graphs like this one:

It shows that “ability grouping”, which to teachers in England is likely to be thought to refer to setting (i.e. grouping by previous test scores in a particular subject) or streaming (grouping by a measure of general ability or a combination of test scores across subjects) has a negative effect on student achievement. For other topics I’m interested in, the effect sizes found by the EEF, based on meta-analyses, correspond to similar work done by John Hattie for his book Visible Learning. Hattie found a positive effect size of 0.12 for ability grouping. The EEF found a negative effect size: -0.09. This puzzled me when I first saw it.

