I’ve been playing with doing numerical computing in pure
Java. Normally I’d say that this is a silly thing to try and
do because there are mature, established, well-tested
high-performance libraries to this already (e.g. BLAS, LAPACK) and
reinventing the wheel is a waste of time. However, there are good
reasons for wanting to do numerical computing in pure Java.
Firstly, because the JVM performs just-in-time (JIT) compilation of
the JVM bytecodes to native instructions, there may be
optimisations that the JIT compiler can make based upon data that
is only available at run-time and not at the time your source code
is compiled (i.e. to bytecodes, or for languages like
C/C++/Fortran, to native instructions). Therefore, the JIT compiler
is theoretically able to optimise your code for the data that it is
being run on, and the likes of C/C++/Fortran can never hope to make
such optimisations. It seems reasonable to expect that as the JVM
matures, performance will increase as better use is made of
run-time data. Another good reason to use Java is that software
distribution is a breeze— one just needs to plop a
.jar archive in a directory and off you go; no
compilation required. These arguments can apply to any
bytecode-based runtime environment, but Java is the most mature of
these.
There is a good article on Apple’s developer pages on software pipelining. This is applicable when you have some algorithm that is applied to lots of data. Software pipelining is a way of strongly encouraging your compiler (bytcode or traditional) to generate instructions that make optimal use of the parallelism of your CPU. Most CPUs make use of a technique called pipelining: the process of performing a single instruction is broken into separate tasks (e.g. fetch data, perform instruction, save result). If you perform each instruction in that order, only 1/3 of your CPU’s hardware will be active at any given time.
If you can have three instructions ‘in-flight’ at once, one can be being fetched, one can be being performed and one can be having its result saved. So all of your CPU’s hardware is being utilised at once and you get three times as much work done. This is all a massive simplification, but it gets the idea across.
The trick is to keep the pipeline full. If the pipeline is not full, you’re not getting optimal performance. If your instruction is a branch (i.e. you’re testing some variable and may take one of two paths in your code), then you don’t know which instruction you should be working on next, so your pipeline will dry up until you do know. This may not seem too bad with a three-stage pipeline, but it’s typical for pipelines to be much longer (I believe the one on my CPU has about 15 stages). Because branches are so common in code, it’s quite likely that any sequence of 15 instructions will have a branch instruction in it. This means that keeping the pipeline full is hard.
The typical way around this is to use branch prediction: your CPU has some way of guessing which branch will be taken and starts to speculatively execute the instructions that would be found down that branch if it is taken. Your compiler might even generate some data that the CPU can use to make this decision, specific to your code. However, if the branch prediction is wrong, you have to throw away those speculatively executed instructions—resulting in the pipeline drying up—and you have to start from scratch on the branch that you didn’t expect to take.
Most good compilers will probably try and generate code that keeps the pipeline full, but compilers can’t know ahead of time that a particular algorithm is going to be a bottleneck and go to heroic lengths to optimise the code to keep the pipeline full (or to take advantage of your CPU’s other parallelism features). However, as the Apple article describes, you can write code that will give your compiler every opportunity to generate code that takes advantage of the processor’s architecture.
I was intrigued as to whether writing Java code in software
pipelined style would yield performance improvements. To test this,
I implemented naive and pipelined functions to apply the function
f(x) = [(x+2) * 3] / 12 to every element of a large
array. The performance of these functions was evaluated by running
them a large number of times and computing the minimum, mean and
maximum times taken to execute each of the two functions. The code
is given at the end of this article. Due to the fact that Java does
not have pre-processor macros like C and C++, the code is even less
readable than the C code it is based upon (the C code in the Apple
example).
The experiment was run 12 times under the following JVM:
java version “1.5.0_06” Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05) Java HotSpot(TM) Client VM (build 1.5.0_06-b05, mixed mode, sharing)
The code was run under
“Linux 2.6.12–10–386 #1 Sat Mar 11 16:13:17 UTC 2006 i686 GNU/Linux”
on a (admittedly ageing) Dell Dimension 8100 with a 1.3GHz Pentium
4 processor and 512MB RAM. The variables that were changed were:
the ‘mode’ that the JVM was running in (explained
shortly), the length of the input data array (the size
variable) and the number of times over which the results are
averaged (the numTests variable). The JVM can run in
one of two modes: by default it runs in ‘client’ mode,
which is optimised for fast startup times; alternatively it can run
in ‘server’ mode, which performs some optimisations on
the bytecode at startup time (i.e. your program takes longer to
start) but which is supposed to then run the code slightly faster.
In the table below, A corresponds to size = 1024 and
numTests = 1000, B corresponds to
size = 1024 and numTests = 10000 and C
corresponds to size = 10240 and
numTests = 1000. The results are given in the table
below and timings are in nanoseconds (along with those for a naive
version of the algorithm implemented in Matlab). The following
convention for displaying minimum, mean and maximum times is used:
minmeanmax.
| ‘Client’ JVM | ‘Server’ JVM | Matlab R14 | |
| Naive implementation | A=39000806734688000 B=39000750005210000 C=370000312621330439000 | A=4100010875817781000 B=420007924115543000 C=378000244520024155000 | A=1210001468904000000 B=1210001326502800000 C=5270005626503200000 |
| Pipelined implementation | A=40000487992980000 B=39000444753395000 C=37500042247410574000 | A=410009682440637000 B=410005008740117000 C=37900048704642408000 |
I found that running the naive and pipelined algorithms once (i.e.
not taking the average of many runs) showed that the naive
algorithm was about twice as fast the pipelined algorithm (with
size = 1024). However, running them many times and
taking the average of the run-times shows that
in general the pipelined algorithm is more efficient.
This is a fairly quick and dirty experiment, but we can see that in the average and worst cases for tasks A and B, software pipelining delivers approximately twice the throughput of the naive algorithm (1.7 times faster). In the best case, software pipelining does no better and no worse than the naive algorithm.
In case C, where we are operating over a large number of data items (10240 compared to 1024 in cases A and B), in the average case, software pipelining delivers almost an order of magnitude more throughput (7 times faster) compared to the naive algorithm! In the worst case, software pipelining is three times faster than the naive algorithm. In the best case, software pipelining does no better and no worse than the naive algorithm.
It is interesting to note that, contrary to conventional wisdom, running the JVM in server mode only yields a performance increase in one of the 12 runs (case C for the naive implementation). However, it’s likely that the server mode is optimised for integer arithmetic which will much more common in server applications than floating point arithmetic. In conclusion, software pipelining for Java can yield performance increases for numerical applications.
Software pipelined code is, by its nature, much less readable and
more verbose than the equivalent naive code. In order to make it
clear how the code below works, please first read the Apple
document linked to above. Once you’ve read that, you need to
know that there are five stages to the software pipeline
implemented below. Stage 1 loads data from the input array. Stage 2
adds two to that data. Stage 3 multiplies stage 2’s result by
3. Stage 4 divides stage 3’s result by 12. Stage 5 writes
stage 4’s result to the target array. It’s important to
use separate array indexing variables (i and
j).
import java.util.Random;
public class SoftwarePipeline
{
public static void main(String[] args)
{
try
{
SoftwarePipeline pipeline = new SoftwarePipeline();
pipeline.run();
}
catch(Exception e)
{
System.out.println(e.getMessage());
}
}
private void run()
throws Exception
{
final int size = 10240;
// Create the input.
double[] input = null;
final int numTests = 1000;
long start, time;
long[] pipelinedTimes = new long[numTests];
long[] naiveTimes = new long[numTests];
double[] naiveResult = null;
double[] pipelinedResult = null;
for(int i =0; i < numTests; i++)
{
input = getInput(size);
start = System.nanoTime();
naiveResult = doNaive(input);
time = System.nanoTime() - start;
naiveTimes[i] = time;
start = System.nanoTime();
pipelinedResult = doPipelined(input);
time = System.nanoTime() - start;
pipelinedTimes[i] = time;
if(!checkResult(naiveResult, pipelinedResult))
{
throw new Exception("Methods differ!");
}
}
System.out.println(
"Naive implementation: min, mean, max: "
+ Double.toString(minOf(naiveTimes))
+ ", "
+ Double.toString(meanOf(naiveTimes))
+ ", "
+ Double.toString(maxOf(naiveTimes))
+ " ns.");
System.out.println(
"Pipelined implementation: min, mean, max: "
+ Double.toString(minOf(pipelinedTimes))
+ ", "
+ Double.toString(meanOf(pipelinedTimes))
+ ", "
+ Double.toString(maxOf(pipelinedTimes)));
}
/**
Get some data to work on.
*/
private final double[] getInput(final int size)
{
final double[] retVal = new double[size];
Random generator = new Random();
// Populate it with random data.
for(int i = 0; i < size; i++)
{
retVal[i] = generator.nextDouble();
}
return retVal;
}
/**
Naive implementation: apply the following to each item of
input: result = ((input + 2) * 3) / 12.
*/
private final double[] doNaive(final double[] input)
{
final int len = input.length;
final double[] retVal = new double[len];
for(int i = 0; i < len; i++)
{
retVal[i] = ((input[i] + 2) * 3) / 12;
}
return retVal;
}
/**
Implements a software-pipelined version of the naive algorithm.
*/
private final double[] doPipelined(final double[] input)
{
final int len = input.length;
final double[] retVal = new double[len];
double stage1result;
double stage2result;
double stage3result;
double stage4result;
final int stageCount = 5; // There are 5 stages.
int i = 0;
int j = 0;
if(len >= (stageCount-1))
{
// Stage 1:
stage1result = input[i];
i++;
////////////////////////////////////////
// Stage 2:
stage2result = stage1result + 2;
// Stage 1:
stage1result = input[i];
i++;
////////////////////////////////////////
// Stage 3:
stage3result = stage2result * 3;
// Stage 2:
stage2result = stage1result + 2;
// Stage 1:
stage1result = input[i];
i++;
////////////////////////////////////////
// Stage 4:
stage4result = stage3result / 12;
// Stage 3:
stage3result = stage2result * 3;
// Stage 2:
stage2result = stage1result + 2;
// Stage 1:
stage1result = input[i];
i++;
////////////////////////////////////////
// Stage 5:
retVal[j] = stage4result;
j++;
// Stage 4:
stage4result = stage3result / 12;
// Stage 3:
stage3result = stage2result * 3;
// Stage 2:
stage2result = stage1result + 2;
// Stage 1:
stage1result = input[i];
i++;
////////////////////////////////////////
while(i < len)
{
// Stage 5:
retVal[j] = stage4result;
j++;
// Stage 4:
stage4result = stage3result / 12;
// Stage 3:
stage3result = stage2result * 3;
// Stage 2:
stage2result = stage1result + 2;
// Stage 1:
stage1result = input[i];
i++;
}
////////////////////////////////////////
// Stage 5:
retVal[j] = stage4result;
j++;
// Stage 4:
stage4result = stage3result / 12;
// Stage 3:
stage3result = stage2result * 3;
// Stage 2:
stage2result = stage1result + 2;
////////////////////////////////////////
// Stage 5:
retVal[j] = stage4result;
j++;
// Stage 4:
stage4result = stage3result / 12;
// Stage 3:
stage3result = stage2result * 3;
////////////////////////////////////////
// Stage 5:
retVal[j] = stage4result;
j++;
// Stage 4:
stage4result = stage3result / 12;
////////////////////////////////////////
// Stage 5:
retVal[j] = stage4result;
j++;
}
// Cleanup code for small arrays.
while(i < len)
{
// Stage 1:
stage1result = input[i];
i++;
// Stage 2:
stage2result = stage1result + 2;
// Stage 3:
stage3result = stage2result * 3;
// Stage 4:
stage4result = stage3result / 12;
// Stage 5:
retVal[j] = stage4result;
j++;
}
return retVal;
}
private final boolean checkResult(final double[] a,
final double[] b)
{
if(a.length != b.length)
{
return false;
}
for(int i = 0; i < a.length; i++)
{
if(a[i] != b[i])
{
return false;
}
}
return true;
}
private final double meanOf(final long[] array)
{
double total = 0;
for(int i = 0; i < array.length; i++)
{
total += array[i];
}
return total / array.length;
}
private final double minOf(final long[] array)
{
double min = array[0];
for(int i = 1; i < array.length; i++)
{
if(array[i] < min)
{
min = array[i];
}
}
return min;
}
private final double maxOf(final long[] array)
{
double max = array[0];
for(int i = 1; i < array.length; i++)
{
if(array[i] > max)
{
max = array[i];
}
}
return max;
}
}