**Why can't I set the maximum depth to be more than 30?**

In the 'ot::DA' module, we compress the octree and just use 1 char to
represent the octree. Here, we only use 5 bits to represent the level and
we use the remaining 3 bits as flags to store some additional information.
With 5 bits, we can represent numbers from 0 through 31. Hence, the levels
are restricted to be < 32 and since we increase the initial maximum
depth by 1, the maximum depth for construction/balancing is restricted
to be < 31. However, you can set the maximum depth to be 31 if you do
not use the 'ot::DA' module. Moreover, within the code we often use the number 2^D,
where D is the maximum depth. To be able to represent this number using an
'unsigned int' datatype, we need D to be < 32 (assuming that an 'unsigned
int' uses 4 bytes). An element at level 30, will have a length equal
to approximately one-billionth (2^(-30)) of the length of the domain. At present, we are
not aware of any application that needs this resolution. If such a need arises, this restriction
can be lifted by minor modifications to the code such as using larger datatypes.

**Is there an upper bound on the local grain size assuming that I have plenty
of RAM? **

Yes. We use 'unsigned int' to represent all local sizes. Moreover,
we assume that the local size does not exceed 16M in order to compress
the element-to-node mappings in the 'ot::DA' module.

**Why is the value returned by the function 'ot::DA::getMaxDepth()'
1 more than what was used to create the octree?**

The return value is the maximum depth in the modified octree that
include 'pseudo' octants added for the positive boundaries.
During meshing, the original octree is embedded into a larger octree.
The root of the original octree becomes the first child of
the root of this larger octree. The anchors of the octants in the
original octree remain unchanged in the new octree as well. Only
their levels get incremented by 1. This is done, so that we can create
a 1-1 mapping between the octants of the new octree and
the nodes of the original octree.

**How can I contribute to DENDRO?**

- An interesting feature of ot::DA is that it supports the ALL loop, which allows one to loop over pre-ghost elements. The advantage of this is that one can avoid 1 communication in each FE MatVec, ofcourse by doing some extra work (computations on ghost elements will be duplicated across processors). To do this, we had to mesh pre-ghost elements as well and this is quite complicated and adds a little extra overhead compared to meshing local elements alone. We would like to compare this extra overhead and the associated savings in the MatVecs with the simpler case if we did not support the ALL loop. So a simpler ot::SDA (simple- DA) object that does not mesh pre-ghosts would be a nice addition to the library. This should not be a lot of work. It should only require going through ot::DA and cleaning it up a bit by removing the portions for meshing pre-ghosts.
- It would be interesting to see the differences in performance with and without mesh compression and overlapping communication and computation for different problems and on different machines.
- Extend ot::DA to support higher order shape functions. It would be nice to build a generic framework that can support arbitrary order, but this is obviously non-trivial. A simpler solution would be to have a library of elements with different order.
- Compare the load imbalance and inter-processor boundary sizes got by using Morton partition and Block partition for different octrees.
- Combine DENDRO with domain-decomposition algorithms to make it work with octree forests.
- Parallel sort is the backbone of most of our algorithms. Currently, we use a sample-sort/bitonic-sort hybrid in which the splitters in the sample-sort algorithm are sorted using bitonic-sort. Moreover, if the grain size is < 5*(number of processors) then we switch to using bitonic-sort alone. This means that as we scale to thousands of processors the grain size must also increase proportionately in order to continue to use sample-sort. This might become a bottleneck at some point.
- Extend DENDRO to work with other space-filling curves as well.
- Improve the load balancing heuristic in ot::DAMGCreateAndSetDA.
- Come up with good heuristics to determine what fraction of the INDEPENDENT loop should be overlapped with ReadFromGhosts and what fraction should be overlapped with WriteToGhosts. (See the restriction/prolongation matvecs for an example of this.)
- Avoid allocating memory for buffers in every call to VecGetBuffer. This is quite simple to do. Just split VecGetBufer into VecCreateBuffer and VecCopyToBuffer. Similarly for VecRestoreBuffer.
- Improve ot::DA::createMatrix by preallocating more efficiently.
- Provide nice examples for different functions.
- Define two integer types DendroIntR (regular int) and DendroIntS (small int) and use these instead of 'unsigned int' and 'unsigned short' everywhere in the code. Also generalize the compression portion (both octree compression and LUT compression) to work with these types.
- Get rid of 'grid-sequencing' and 'RTLMG' in ot::DAMG. These are hardly ever used and is just a lot of unnecessary code. Remove function handles for creating and computing matrices and setting rhs from the DAMG structure. They can be directly set by the user.
- Add support for Newton Multigrid and FAS.