> On 02/14/12 00:38, Alexander Motin wrote:
>> I see no much point in committing them sequentially, as they are quite
>> orthogonal. I need to make one decision. I am going on small vacation
>> next week. It will give time for thoughts to settle. May be I indeed
>> just clean previous patch a bit and commit it when I get back. I've
>> spent too much time trying to make these things formal and so far
>> results are not bad, but also not so brilliant as I would like. May be
>> it is indeed time to step back and try some more simple solution.
> I've decided to stop those cache black magic practices and focus on things
> that really exist in this world -- SMT and CPU load. I've dropped most of
> cache related things from the patch and made the rest of things more strict
> and predictable:
This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be nice
to also add priority as a greater influence in the load balancing as well.
> This patch adds check to skip fast previous CPU selection if it's SMT
> neighbor is in use, not just if no SMT present as in previous patches.
> I've took affinity/preference algorithm from the first patch and improved it.
> That makes pickcpu() to prefer previous core or it's neighbors in case of
> equal load. That is very simple to keep it, but still should give cache hits.
> I've changed the general algorithm of topology tree processing. First I am
> looking for idle core on the same last-level cache as before, with affinity
> to previous core or it's neighbors on higher level caches. Original code
> could put additional thread on already busy core, while next socket is
> completely idle. Now if there is no idle core on this cache, then all other
> CPUs are checked.
> CPU groups comparison now done in two steps: first, same as before, compared
> summary load of all cores; but now, if it is equal, I am comparing load of
> the less/most loaded cores. That should allow to differentiate whether load 2
> really means 1+1 or 2+0. In that case group with 2+0 will be taken as more
> loaded than one with 1+1, making group choice more grounded and predictable.
> I've added randomization in case if all above factors are equal.
This all sounds good. I will need to review in detail but the approach
seems straightforward and fixes corner cases that are undesirable.
> As before I've tested this on Core i7-870 with 4 physical and 8 logical cores
> and Atom D525 with 2 physical and 4 logical cores. On Core i7 I've got
> speedup up to 10-15% in super-smack MySQL and PostgreSQL indexed select for
> 2-8 threads and no penalty in other cases. pbzip2 shows up to 13% performance
> increase for 2-5 threads and no penalty in other cases.
Can you also test buildworld or buildkernel with a -j value twice the
number of cores? This is an interesting case because it gets little
benefit from from affinity and really wants the best balancing possible.
It's also the first thing people will complain about if it slows.
> Tests on Atom show mostly about the same performance as before in database
> benchmarks: faster for 1 thread, slower for 2-3 and about the same for other
> cases. Single stream network performance improved same as for the first
> patch. That CPU is quite difficult to handle as with mix of effective SMT and
> lack of L3 cache different scheduling approaches give different results in
> different situations.
> Specific performance numbers can be found here:
> http://people.freebsd.org/~mav/bench.ods > Every point there includes at least 5 samples and except pbzip2 test that is
> quite unstable with previous sources all are statistically valid.
> Florian is now running alternative set of benchmarks on dual-socket hardware
> without SMT.