Motivation for hand-optimized Assembly code:
There's a popular saying that "in 90% of cases, a modern compiler writes faster code than a typical Assembly programmer would". But anyone that has actually tested this theory knows how wrong this statement is! Hand-written Assembly code is ALWAYS faster and/or smaller than the equivalent compiled code, as long as the programmer understands the many intricate details of the CPU they are developing for. eg: I wrote both an optimized C function and an optimized Assembly function (using NEON SIMD instructions) to shrink an image by 4, and my Assembly code was 30x faster than my best C code! Even the best C compilers are still terrible at using SIMD acceleration, which is a feature that is available on most modern CPUs and can allow code to run between 4 to 50 times faster, and yet is rarely used properly!
ARM's RVDS compiler generates code that is upto 2x faster than any other C compiler for the iPhone (or Android or any ARM device), but on most ARM devices, hand-written Assembly code can often be 10x faster! (Assuming you use SIMD vectorization such as ARM's NEON Media Processing Engine or Intel's MMX/SSE/AVX). This is similar to the speedups you can expect from GPGPU acceleration (using NVidia's CUDA or OpenCL), but on a small mobile device rather than an expensive desktop video card! And luckily the iPhone, iPad, iPod and Android phones & tablets nearly all use ARM CPUs with NEON vector processing, so you can use the same Assembly code in apps for the official iPhone App Store and the Android Market (with NDK). And with the recent popularity of ARM CPUs in portable devices, this is likely to continue for several generations of smartphones, tablets, and ultra-portables (eg: in the NVidia Tegra3 "Kal-el", TI OMAP4, QualComm Snapdragon S4 "Krait", Apple iPad2 & iPhone5, etc). Obviously you shouldn't write a whole app using Assembly language, but if you need certain loops to run as fast as possible, then a few sections of Assembly language might be exactly what you need!
Modern processor architectures are much more complicated now than they were at the start of the PC era, which definitely makes efficient Assembly code hard to write by hand, but it also makes efficient code hard for a compiler to generate, and so there is significant room for improvement in efficient code design.
UPDATE: Note that Cortex-A9 and Cortex-A15 CPUs are much more advanced than Cortex-A5, Cortex-A7 & Cortex-A8, so the advantages of Assembly code & NEON SIMD will be less important in Cortex-A9 than in simpler devices such as Cortex-A8.