Thursday, August 22, 2013

What's the fuss about parallel programming?

What's the fuss about parallel programming?

A young friend of mine, now a 2nd year computer engineering student, asked me:

What is parallel programming? Why is parallel programming regarded as important in the future?

I don't have any idea about parallel programming and try to learn  by Googling. Yet,  it is difficult to understand. Why mutable data has been considered as inefficient in programming recently? How it creates problem and in what way functional programming avoids this mess? Will functional programming increase the performance of multicore systems?
Also, to which OS books should I refer? As I am starting my study on my own, and I want to get good at OS and basically able to understand the difference between linux and windows, which book should I follow? Earlier, you said that you are interested in operating systems and also best at it. Please, just suggest me some books which would able to justify the differences between linux and windows technically.
In which language is OS programming done?

Image of multiple processors taken somewhat out of context with a thank-you to Tom's Hardware, a web site where people try to keep up with this stuff

This is my reply to that e-mail...
You ask "What is parallel programming?"   That's a very similar topic to another topic you recently asked about:   concurrent programming.   Both concern how to write programs that do more than one thing at once so that overall performance is improved.   e.g. if the time to run the program depends on "n" (perhaps n is the amount of input data to be processed), then what a parallel program wants to do is apply more than one processor to the problem so the job can be completed sooner than one processor would be able to.

For example, if the job is to sort n items, you might divide the list up into a separate list per processor so each processor needs only sort a shorter list of items.   Of course, before the job is finished, the multiple shorter lists need to be merged together to make the final result.

Distributing the items across the processors is work, merging the lists back together again is work.   Whether the overhead of those extra steps is worth it or not depends on things like how much memory each of the processors has good access to.   If the items divided make a small enough list to fit in the RAM of each processor, then things are probably going to go very fast.    But if the sub-problems are still big enough that you need to spill things out to intermediate work files, and if the extra processors don't have good access to the disk space used to store the spill files, then the dividing up of things might turn out to be a net loss in performance.

Moore's Law

You also ask "Why is parallel programming regarded as important for the future?".   Well, if you go way back to the early days of integrated circuits, Gordon Moore predicted in 1965 that the number of transistors on an integrated circuit would double every 2 years.   He thought that observation would hold true for 10 more years or so.   We actually have gotten through a lot more doublings than that and aren't done yet (though folks are starting to fret that they can see ultimate limits ahead - so it won't go on forever).
His prediction was more and more transistors and it isn't entirely obvious that that translates to mean faster computers.   But, in fact, what folks have done with those transistors is figure out ways to apply them to make faster computers.    If you look back to the earliest IBM PC's, the processor chip didn't even do floating point arithmetic.   If you needed faster floating point, you'd have to add a math co-processor onto the motherboard (there was a socket for that additional chip).

I confess to liking that idea of having separate useful pieces that you can custom integrate to create a tailored computer with exactly the strengths that you want.   Alas, the expense of having multiple chips connected together at the circuit board level argues powerfully against that piece-part model of the chip business.   The trend instead has been to absorb more and more functionality into a single chip - whole systems on a chip - just to be rid of the sockets and pins and propagation delays of getting off-chip and on-chip and back again.

So where did all the transistors get spent to speed things up?   Some of it is obvious.   Computers today have amounts of memory that were unthinkable just a few years ago.   Along with more memory, you certainly have more cache and more layers of cache to speed up access to that memory.   There's much to be learned in contemplating why there are more layers of cache instead of just bigger cache.   But that's a more hardware-centric topic than I'm comfortable explaining here as a software guy.

Besides more memory and more registers, the paths and registers have gotten wider.   Where there were 8 bits in the beginning, there are often 64 bits today.    You can try cranking that up in the future to 128 bits, but at some point you get into diminishing returns.   Slinging around 128-bit pointers in a program that could be happy dealing with only 32-bit pointers may not be optimal.    Maybe the problem is just that we need a little more time for programs to comfortably exploit gigantic memory spaces.   My PC today only has 2GB of real RAM.    32 bits is more than enough to directly address that much memory.  2^32 in fact is enough to directly address 4GB of RAM.   So the line of needing more than 32 bits isn't super far away. But 64 bits is enough to directly address 16 exabytes of RAM.   I can't even afford a Terabyte of RAM yet, so needing more than 64-bits is surely a long way away. (1 Terabyte=1024 Gigabytes. 1  Petabyte=1024 Terabytes.   And 1 Exabyte=1024 Petabytes).

Those are really big numbers.   Bigger than even Doc Brown is likely ready to contemplate:

But it isn't always obvious how best to spend the many transistors that the progress predicted by Moore has provided to us.   I see a certain amount of oscillation in design approaches as things get wide and then get back to serial again.   Look at ATA vs. SATA, for example.

One way to spend transistors is to make more complex circuitry to make the time for each instruction be shorter - do faster multiplication or division, but there's only so far that you can push things in that direction. Current consensus seems to be that making faster and faster processors is getting to be very difficult.   As clock speeds go up, the chip's thirst for electrical power goes up too and with that the amount of heat that has to be taken away from the chip to avoid reducing it to a puddle or a puff of smoke.   So, the industry's current direction is toward spending the transistors on having more processors with moderate speed per processor.   The aggregate instruction rate of such an array of processors multiplies out to nice high numbers of instructions per second, but the challenge is how to effectively apply all those processors to solve a problem faster than an older uniprocessor computer would be able to. Hence the anticipated growing importance of parallel computing in the future.

I think so far I've answered the questions in your subject line.   I hope you have the patience for me to try answering the questions in the body of your mail too.

A Day at the Races

I see your next question is "Why the fuss about mutable data?"   Well, as I understand it, the concern is that if your data is mutable, you need to worry about inter-processor synchronization and locking so that when a processor updates a stored value, that it doesn't interfere with some other processor.
The processing of read-only (immutable) data doesn't have to worry about locking and synchronization.  But consider something as simple as A=A+1, where A is a mutable value.    Underneath it all, your processor needs to figure out where the value of A is stored, fetch the value into some arithmetic register, add 1 to the value and store the value back into the location for A.   If A is accessible only to your one processor, there's little to sweat about, but if A is accessible to multiple processors there's a potential for a race.   What if both processors have fetched the value of A and both have incremented their copy.    Only one of them has the right answer.   If they both store their new values for A back to the shared location, the final result is one less than it ought to be.

One solution is to have specialized hardware that makes the A=A+1 operation be atomic, indivisible, so there's no chance of one processor seeing the old value when it should be using a new value.

There's the challenge of figuring out exactly which atomic instructions are most useful additions to your instruction set design.   IBM mainframes had an interesting, though complicated instruction called compare and swap.   As I remember it, the instruction took 2 registers and a memory location.   If the first register matched the value in the memory location, then the 2nd register would be stored into the memory location.   If they didn't match, then the memory location would be loaded into the 1st register. And the whole operation was indivisible.    So a processor could do it without having to worry about whether some other processor was operating on the same memory location.   So, you could use compare and swap to do our A=A+1 operation safely.   You fetch the value of A into a register. You copy that register to a 2nd register.   Add 1 to the 2nd register.    Now do a compare-and swap to store the result back to memory.   If the compare and swap sets the condition code that says the 1st register didn't match, then sorry, but you have to repeat your computation.   Copy the newer value from the first register to the 2nd register. Add 1 to the (new) value to get a newer value and try the compare and swap again.   Of course, if there are many processors in hot contention for the value of A, then you might have to spin for a while in that loop trying to compute the right value and get it back before it becomes stale.

The compare-and-swap instruction can be used for more than A=A+1 kinds of computations.    For instance consider a linked list of items, perhaps the run-able thread list in your operating system kernel.   You want to be able to remove an item from that list.   That involves fetching the link to the next item, fetching the link to the item after that and then storing the link to the next next item into the location where the link to the item you are removing from the list came from.

    A  ----> B ----> C becomes A ----> C

As with the A=A+1 case, there's the potential for a race if there are multiple processors that are contending to pick B off the list.  compare-and-swap can at least make it safe from races, but again, if there is hot contention among many processors, there can be much wasted spinning before a processor succeeds in grabbing B off the list.

So, if you have careful control at the machine instruction level, the problem is practically solved.   But that sort of implies that you drop down into assembler language from time to time or you have a compiler that generates incredibly clever object code that knows where to use these specialized multi-processing instructions.   What if you are using a garbage-collected language like Java or Python?   Maybe your problem is worse than the value of A that you used in your computation became stale between your fetch and your store back to memory.   Maybe the location of A has changed entirely and your store operation is smashing something else entirely different than the variable A.   Big trouble ahead...   In fact, if you think in terms of Python, maybe by the time you are trying to store the new value, A isn't even an integer any more. "Gee, it was an integer value when I fetched it.   Who the heck changed it to be a floating point number in the meanwhile?".   Could be subtler: Python will happily and silently promote a int to a long if the value gets too big to fit into an int, so you need to be very careful that the value you fetched still makes sense before you store the result back to memory.

The article I pointed you to the other day "Downfall of Imperative Programming” asserts that "Imperative programs will always be vulnerable to race conditions because they have mutable variables".   So functional programming languages, by avoiding mutable variables, dodge a major bullet in the multiprocessing world. The thing that I don't know is how to be sufficiently productive in functional programming languages for Haskell to be worth the trouble to learn.   The Downfall article predicts that the race conditions are an insoluble problem for imperative programming language implementations.  I'll happily accept that there's trouble ahead to watch out for, but I do have a bit of difficulty accepting that the races absolutely can't be resolved.

Python's Global Interpreter Lock

Python worries about the possibility of races among threads in interpreting the instructions of Python code. They have a "Global Interpreter Lock" (GIL) to assure that one interpreter thread won't change a value in use by another interpreter thread. Folks worry that this coarse level of locking will keep Python programs from being able to scale up with increasing numbers of processors.
I've seen some clever dodges of the GIL in Python programs, mainly by spreading the program across separate address spaces (multiple Python interpreters, each with their own GIL) and limiting interprocess interaction to some carefully controlled set of places in the code with appropriate locking protections.  On the one hand, this doesn't give transparent scaling up from a uniprocessor to M processors all running in parallel, but on the other hand, it does get the job done.

My (weak) excuse for not having more first hand experience with this...

My home PC doesn't bring multiprocessors to the party.   Some day I hope to replace it with an i5-ish based computer with 64-bit addressing and >4GB of memory.   As a retiree with a rather modest pension, that's a discretionary expense that I've been postponing into the future.  Maybe in the meanwhile my target will shift to something with way more processors than an i5.   What I have in mind is something with enough oomph to be able to run Linux and Windows both in virtual machines (Based on Xen, VMWare, something else?  I don't know...). Heck, Microsoft isn't even making it easy to buy such a configuration without paying for a Windows license twice (once bundled into the PC's base price and then again for an installable copy that can be installed into a VM).  I'm assuming that a re-install CD that wants to reload Windows onto a bare PC isn't going to be able to install into a VM environment.   I'm expecting that multi-processor race conditions and their associated problems will come along naturally to my world once I have a rich enough configuration and that encountering those problems on more than just paper will motivate me into doing something about them.
Maybe I'm just old-fashioned in thinking that what I need is a richer computing environment here at home.   Maybe the right thing to do is to venture out into things like Amazon's Cloud Computing service and see what kind of trouble I can get into using other people's multi-processors via the Internet.  One of my worries about that, is maybe the underlying MP nature of their cloud services is too deeply wrapped for me to really see the problems I'd be up against from MP.   And, "look, dear, here's the marvelous new computer I just bought" is a much easier conversation to anticipate having with my wife then "Just let me pay this bill for cloud services.   It isn't so much money and I did really learn from having tried their services."

Comparative Operating Systems

You ask me to recommend an OS book to better understand Windows vs. Linux.  I don't know which book is the right choice.    Certainly an Amazon or Google search will turn up a large number of candidate titles. Perhaps your school's library has some of those titles so you can look them over, or perhaps they can arrange inter-library loans for you to be able to look over some of the candidate titles.   "Which of these is best" is always a tricky question because the answer depends so much on your particular criteria for "best"
So let me turn this around and ask you for a summary of your findings from digging into the too-long list of candidate titles and your recommendation.   You might want to ask your question of your school's professor for the OS classes too.   Maybe he's got a more formed opinion on this topic than I have.

Linux Weekly News

Meanwhile, I stand by my suggestion that you should make an effort to keep up with  (free of charge at the price of having to lag back a week from the most current articles) to see what is going on in the Linux world.  Don't feel obligated to have the newest and most experimental kernel on your home PC,  But if you spend some time watching the evolution and planning of kernels, you'll have a better idea of Linux's strengths and weaknesses and what "they" are doing about the weaknesses.  Unlike Windows, if you are sufficiently motivated to want Linux to be different then it is today, you can make that happen.

Kernel programming languages?

What programming languages show up in OS programming?  Well, at this time, I expect the correct answer to that is C.  Other languages (e.g.Java and Python) do show up in supporting roles, but generally don't make it into kernel code.   Even C++ tends to need too demanding an environment to be a good candidate for kernel code.   Maybe as time goes on the kernel will sprout suitable layers of capability to make higher level languages more attractive for implementing functionality within the kernel but right now if someone tells you a kernel is written in C++, ask them more questions to confirm that.   It wasn't all that long ago that the likely choice for programming an OS kernel was surely assembler language.  Unix introduced the C language and the then radical idea of using a higher level language in the kernel and even having kernel code that is somewhat portable across computing system architectures. (To calm the historians in the audience, I'll concede here that I may be under-crediting the Multics operating system, portions of which were written in PL/I. And the Multics site gives credit to Burroughs for having done a kernel in Algol, but that's way before even my time).
Stack overflow article on the languages of the Android OS:

Stack overflow article on the languages of MacOS, Windows and Linux:

Not every answer is to be trusted to be correct on stackoverflow....

One sub-link of that article that I followed and that does look interesting and credible: article on what's new in the Linux 3.11 kernel expected to become available in September 2013...
This is a particularly interesting link from one of the many comments on that article about 3.11:

In Closing...

You quote me as saying that I'm best at operating systems.   I tried rummaging in old mail to you to put that statement in context, but didn't succeed in tracking down what I said.   I will concede that I'm especially interested in operating systems, and given a list of computer science topics, I'm probably more interested in operating system then in most of the others, but claiming I'm best in operating systems sounds like it surely needs some context.

I confess that except for command line pipelines, I've never actually written a multi-threaded program of my own. So don't assume more expertise here than I actually have.