The tree build phase of Barnes-Hut is particularly interesting because,
although it accounts for only a trivial amount of time compared to the
force calculations, it drives the force calculations by exposing the
locality and pruning away calculations. Tree-building algorithms are also
interesting because they generate space partitions of points in the
This page details four different tree-building schemes. They include:
A comparison of the time/space properties of the algorithms we implemented (Single-node, AtomicProc, and Hashed) are in the section on results. The space complexity is particularly important to us, since it can limit the number of particles that can be simulated.
Our first implementation was a version where a single node builds the tree in local memory:
fun InsertBody(global root,body) =
FindMyLeaf(body)
if (leaf_is_full)
SplitLeaf(root,body)
InsertBody(new_leaf)
else
InsertToLeaf(body)
end
Because the data structure in this case must reside in this single
processor's memory, the problem size was limited by the tree size.
In addition, not only was the data structure limited by a single
processor's memory, our implementation was originally limited by the
single processor's cache; without using a software caching method
to reduce communication bandwidth and mask latency, the full cost of
global reads were necessary:
Note that this multiple of eight is only bounded by either the minimum perturbation constant or the floating point accuracy if the partition is a space-partition and the nodes are at nearly identical points.
Once software caching was added, the extreme communication costs were greatly minimized, but at the expense of additional memory and therefore maximum problem size.
Our fastest implementation was a parallel tree-building algorithm loosely based on the original Barnes-Hut tree-build, but which we optimized to reduce lock contention. Additionally, we allow up to eight bodies per leaf node, greatly benefiting cache performance and reducing some of the memory overhead necessary for links between nodes.
The original Barnes-Hut algorithm works by loading the tree starting at the top and locking nodes on every insertion. Because the overhead for performing a locking operation is a full round trip for the lock and a full round trip for the unlock, we wanted to minimize the number of locks. Therefore, we implemented our ``AtomicProc'' algorithm such that the nodes are only locked when a node has to be split. However, as we will show, this still results in lock contention because the amount of work to split a node is substantial: all of the bodies have to be retrieved and re-inserted into a new tree which has been split into 8 parts. A sketch of the algorithm follows:
fun InsertBody(global root,body) =
do
result := try_atomic_insert(root,body)
case result in
insert_good:
In this case, the bulk of the work is in the try_atomic_insert function,
which executes on the processor which owns the root (it is a global
pointer). In the usual case, the root will be partially empty, and
the body will just be stuck into the root. Occasionally, the root will
become full. In this case, the node which caused the root to fill is
responsible for splitting the root into the 8 children, and re-inserting
all of the bodies into the appropriate leaves. While it is splitting
the root, the root remains locked so that other processors can't try
to update the value. The other exceptional case is that the node has
been split at some point, in which case it is now an internal node.
In this case, the algorithm simply recurses down the tree looking at
the new internal node. As an additional optimization, our implementation
caches these internal nodes; once a node has been split into
an internal node, it will not be subsequently modified until after the
end of the tree-build stage. For well-behaved insertions, where the
body data on each node is relatively well space-separated, the insert
operation will run in O(1) time instead of O(log n)
using this optimization. One way to force the O(1) performance on
random data (as mentioned and used in the Hashed Tree-Build algorithm of
[10]) is to sort the body data beforehand, a relatively expensive,
but reasonable, option. Note that sorting the body data before beginning
gives far better performance for all of the parallel algorithms presented
here: each processor eventually starts to work on a portion or portions
of the tree for which the other processors have no interest, thereby greatly
minimizing lock contention.
The Parallel Merge algorithm is described in [6], and the SPLASH II papers [5], but was not implemented by the SPLASH group as they describe. The SPLASH group apparently decided that since the tree-build phase was such a small portion of the overall cost of Barnes-Hut, an optimized tree-build was uncalled for. They instead opted to use the naive original algorithm with significant locking overhead.
Parallel Merge can be described as follows:
Note that as before, the more well-behaved the data is, the less expensive the merge operations become. Simply sorting the data before starting the algorithm will give this heightened performance.
The Parallel Hashed tree-build algorithm of [10] attempts to replace the explicit tree structure with a hash table. The idea is to number the nodes with hash values similar to a hamming code. Thus, the root node is hashed to 000, its eight children are hashed to 001-111, the children of node 001 hash to 001000-001111, etc. There are some immediate benefits to this:
Continue to Calculation of Cell and Body Interactions
-----
maintained by <hodes@cs.berkeley.edu>