Charles Prudhomme Guest
|
Posted: Sun Nov 14, 2004 7:46 am Post subject: Linux to support Massive Multi-Threading (or dies)! |
|
|
Overhaul Linux Kernel to support Massive Multi-Threading (or Linux dies)!
As microprocessor architectures are switching from single-core to dual-
core
and multi-core architectures (couple years), the operating-systems such
as
Linux must follow. The announcements from hardware makers such as Intel,
AMD and Sun regarding their new soon to be released (within next 2 years)
multi-core microprocessors is very revealing.
The Linux kernel must undergo a deep overhaul if it does not want to fall
into obsolescence. Linux must be updated to support massive
multi-threading microprocessor technology. There is no way around it.
Microsoft will surely update Windows but the Linux community seems slow
to
react. This technology will first appear in severs where Linux has a bit
of
an edge. So Linux will soon fell the heat on this.
I have included below an article from the Internet magazine The Inquirer
that talks extensively about Sun's future multi-core design. Read it
carefully. It's a rare to find this level of details. This should give
you
an idea about what is coming in the near future.
http://www.theinquirer.net/Default.aspx?article=19423
=========================================================
Sun's Niagara falls neatly into multithreaded place
Architecture will make people jump
By Charlie Demerjian: Tuesday 02 November 2004, 10:54
SUN IS GOING TO BE the first out of the gate with a new architecture that
everyone will be jumping aboard soon, the massively multithreaded chip.
Niagara from Sun Microsystems is a family of CPUs that blurs the lines
between chip, CPU and thread, something that could have a profound effect
on
how software is written and used. It could succeed spectacularly or it
could
flop, but seeing as how Intel is doing much the same thing with Tukwila,
I
would bet on this succeeding.
I had a chat with Sun Fellow and Vice President Marc Tremblay about the
upcoming chip, what they are, what they do, and where they will fit in.
There are a lot of changes embodied in this chip, and Sun refers to what
it
represents as disruptive threads, in the sense of what it will do to the
marketplace, not to your software. Disruptive is a good thing, especially
when you are talking about the price/performance curve.
Let's start out with the first chip in the Niagara family, called
strangely,
Niagara. On a macro level, it will have eight cores, each core capable of
running 4 threads in parallel, for 32 concurrently running threads. Each
thread can be a process, or you can have one process running 32 threads,
it
is up to you. Most likely, the loads will fall in between those, for
example
running all the threads from a process on a single core. The currency of
this chip is the thread, not the MIP.
You can get tricky though, Niagara has the ability to partition the
different cores on a chip in the same way you can partition a Sun 15K's
processors. If you want two cores dedicated to your web server, and four
to
the JVM, and two others to the database, no problem, you can do it in the
same way you could with multiple sockets. It also has some inherent fault
isolation but you can't mirror cores, yet.
Each of these cores has a 24K three cycle L1 cache, split into 16KB
Instruction and 8KB Data caches, each 4 way set associative. The I-Cache
has
a 32 byte line size, the D-Cache a 16 byte lines. The L2 is a little
odder
than the average cache. It is four way banked with 12 way set
associativity,
and data is interleaved across the banks in 64 byte lines. The size of
the
L2 is 3MB, or about 400K per core, but since this is shared, simple maths
does not quite tell the whole story. There was a tradeoff between the
unspecified core size and the cache, and Sun put the onus on cores. One
really interesting thing to think about was when Tremblay said that if
you
blow your die size budget by two millimetres on a chip like a Pentium 4,
no
big deal, one per cent over you can live with If you blow it like that on
a
Niagara core, you multiply that by eight, and suddenly it is a problem.
There are several reasons why the cache may be of less importance on
Niagara
than in traditional architectures, the threads and the fact that the core
features in order execution. If you have a single core capable of
executing
a single thread, and it has a cache miss, you wait and wait and wait,
sometimes hundreds of cycles. With out-of-order execution, it can also
have
some effects on the whole instruction stream.
Because it is an in-order core, the messiness of out-of-order execution
goes
away, but so do a lot of the benefits, mostly the ability to hide little
memory operations like cache fetches and the ability do other things
while
that data is being grabbed. You also lose the ability to optimize how the
program is executed, the instructions are executed how they are sent, not
how it should best be done.
Here is where threading helps a lot. If you have a cache miss and are
facing
a long wait for something to come back from memory, you just switch to
another thread. That thread can execute its instruction stream until it
hits
a pothole, then it hands execution off to another thread. Intel has the
ability to do this between two threads on the Pentium 4 with
hyperthreading,
and Niagara has four threads running in parallel per core. To make up
numbers, if a cache miss takes 100 cycles, and on average each thread can
execute for 25 cycles before it needs to hit main memory, in theory you
should completely hide memory latency.
In the real world this won't happen. You will take a hit from memory
access
time, but with four threads, that is greatly minimised. If Sun ran the
numbers right, a small cache should be enough to get them by, even if it
is
not anywhere near the size of a modern single threaded CPU's cache. It is
a
different idea, and comparing it directly to the structures of current
chips
is not as valid as you might think.
The cache structure has some interesting effects on the way processes
interact, and lets you do things that were impractical in modern multi-
CPU
systems. Passing data between threads on a core is very fast - it is just
an
L1 read. Passing data to another process on another core is about 20
times
faster than passing it between CPUs on a SMP system. Since the L2 is
shared,
it just dumps the data to L2, and the next core can read it.
This may not seem like a big deal, it is just faster than current
systems,
but other than speed, it more or less does the same thing. Inter-thread
communication is one of the areas where I think Niagara will have a steep
learning curve for programmers to get optimal performance out of the
silicon. When a person programs for a current SMP system, there are huge
efforts made to localise memory access, and penalties to go to remote
CPUs
can be huge. Even in the best of systems, this penalty can be
substantial,
going from a noticeable to effectively bringing things to a grinding
halt.
Programmers have for decades worked around these issues, and the thought
process for programming massive NUMA system are all geared around working
with these limits. The tools are also set to work with these ideas. In
comes
Niagara and potentially throws this out the window, virtually no penalty
for
passing data between four threads, and minor delays between cores. It
will
take some catching up from the software folk.
The massively multithreaded architecture also does things to the cores
themselves. With a lessened effective penalty for memory misses you can
make
your branch prediction less aggressive, which means easier development
and a
smaller die. The whole architecture of the Niagara family is based around
threads, not the other way around. The hardware is built to facilitate
massive multi-threading, it is not added on as a feature.
Rather than looking at this as a chip that runs at a given clock, imagine
it
is a thread engine with optimal groupings. Four threads work well
together,
almost as if they were one. You can have groups of these groups
interacting
with a slight delay, but nothing huge, and they can run in parallel with
no
loss. Instead of the old bumper sticker that said "visualise whirled
peas",
you could sum Niagara up with the bumper sticker "visualise thread
groups".
This whole sea change is no more than words if the OS running on the chip
does not effectively use the shortcuts provided by the chip. If you can
pass
things between threads with no penalty, it does you no good if the OS
still
schedules things to happen while the threads should be waiting. A
disconnect
between the hardware and software would be a very bad thing here.
Luckily for Sun, the two main OSes that will run on the chip, Solaris and
Linux, are either owned by them, or completely open to them. In the case
of
Solaris, Marc Tremblay talked about the OS getting feedback directly from
the hardware. If a core is overloaded, it could potentially signal the OS
to
move threads to another core, and if it is underloaded, it could request
more work. There are some extremely interesting optimisations that can
come
from this, and I suspect it will be a subject of academic papers for
years
to come.
The biggest question in my mind about Niagara had to do with feeding the
beast. This is a new chip, and a new paradigm, but it looks to use the
same
old I/O mechanisms. Niagara will have several, and they would not get
into
more detail about what that means, DDR2 controllers on board. There
should
be a lot of bandwidth available to the cores, but is it enough?
Sun has always been known for having a lot of bandwidth in its systems.
You
absolutely need to if you are going to make machines like the 10K and
15K.
Anything more than a few CPUs in a box simply needs the fattest pipe you
can
throw at them to function. Sun is also known to have individual CPUs that
don't fight for the top SPEC numbers individually, let's just say the CPU
power to bandwidth ratio is heavily leaning toward the bandwidth side.
When you move to Niagara type chips, Sun is in a unique position. Each
Niagara core is said to be at least as powerful as a current UltraSPARC 3
chip. It stands to reason that if you have eight of these in a socket,
you
need at least eight times the bandwidth of a single core. This means lots
of
memory controllers and lots of pins. Niagara certainly has that, but
since
current SPARCs don't need all the bandwidth they have, Sun was able to
get
away with an unspecified multiple of the current per chip bandwidth for
Niagara without compromising performance. Think less than ten times, but
that is still a lot.
In the future, other members of the Niagara family will use different
memory
technologies. One that was mentioned was FB-DIMMs (See here and here and
here). FB-DIMMs allow for huge memory capacity with low pin counts, but
you
take a latency hit. Luckily, Niagara type architectures can mask that
latency very well, so they two technologies are an ideal match for each
other.
This brings up another interesting tradeoff. In Niagara type designs, if
you
have 6 FB-DIMM channels, nothing out of the question for a system like
this,
that gives you 48 DIMMs to plug into the system, 96GB if you use 2GB
DIMMs.
To make up more numbers, if Niagara costs $1000 and the memory only costs
$500 a DIMM, you have $24,000 in memory. This puts the cost of the chip
in
the same price category as rounding errors, basically if Sun doubles the
cost, will anyone notice? In some ways, this is a very enviable position
to
be in for a chip maker.
In addition to the memory controllers, Sun will be adding in a few other
features to the chip, on the first iteration, there will be an Ethernet
controller. Future versions are said to have 10Gb Ethernet controllers
and
on board encryption capabilities. Who would have thought we would have
multi-core system-on-a-chip machines before the single ones were
available,
much less starting the trend in the server space?
If you wonder about a system with multiple Niagaras in them, don't,
Niagara
is a single chip only family. There is no SMP built in, nor will there
ever
be, for that you have to wait another two years or so till the Rock
family
debuts. At that point, the Rock chips will take over from the current
UltraSPARC line, and allow multiple CPUs in a box.
So, with a single CPU in a box that allows for massive numbers of
concurrent
threads, and large memory capacities, what markets is Niagara aimed at?
Marc
Tremblay repeatedly mentioned that Niagara was 'network facing' not 'data
facing', which will be the domain of Rock. This means things that you can
hit directly with a web browser which individually do not require huge
number crunching ability, but are present in great quantity. Searches,
web
page serving and streaming media were all mentioned as good candidates.
If
you need to service thousands of the same task a second, Niagara should
shine. If you want to crunch huge databases, wait for Rock.
Throughout this, you may have gotten the impression that the Niagara
chips
are not massive FP crunching cores, they are meant to pass large amounts
of
moderate tasks through the system at high speed. When Sun was modelling
the
chip, it came to an interesting conclusion, that performance is not all
that
clock sensitive.
The chip was modelled at, 1, 1.5 and 2GHz, and the end result showed that
the clock rate had a minimal impact on performance. With that in mind,
Niagara was optimised for throughput, not clock speed. The end result
will
be under 2GHz, but Sun would not give an exact figure.
The last thing is the chip makes far more efficient use of the resources
available. If you open up Task Manager in Windows, you will see the CPU
use
typically hovers around zero, and then spikes to near 100%, and drops
back
down. This behaviour can be described as peaky, or if you want to be less
charitable, very inefficient.
Niagara takes a different approach, one that aims to keep all the
resources
of the chip as busy as they can be for as long as you have data to do it
with. This means in Niagara chips, the peak load will be very close to
the
average load, and the peak power use will be close to the TDP. If there
was
anything unused, the chip will throw a thread at it. Look for the first
generation of Niagaras to consume around 60 watts, give or take a bit.
This
is considerably less than most high performance PC CPUs on the market
now.
So, what is Sun giving us with Niagara? It is the polar opposite of the
Pentium 4. Instead of speed at all costs, it threads at all costs, and
clock
speed may happen. You don't get the same problems where the only way to
pump
more data through is to crank up the clock, something which is looking
increasingly harder to do. Niagara instead goes wider and slower per
thread,
but makes up for it in quantity.
If you are thinking of this in terms of a current Xeon or Opteron, the
comparison is simply invalid. Any benchmark run on the single treaded
chips
won't have enough threads to make Niagara flex its muscles, and any
benchmark made for Niagara would probably crush a Xeon with the overhead.
Niagara signals the first volley in a new way of thinking about servers,
or
at least a class of servers. The whole idea of Disruptive Threads, as
Marc
Tremblay calls, it is quite real, and it will take a while for many
people
to understand, much less use. Threads, threads and more threads, that is
what this chip lives for. µ |
|