Arcee Research Group


Department of Computer Science
Indiana Univeristy
Arcee
Introduction | Recursion | Papers | People


Welcome to the Arcee Research Group's webpage!
 You can find us in Lindley Hall 328 on the Bloomington campus.

Thanks to the  National Science Foundation for supporting this research under grants entitled "A Paradigm of Parallel Programming for Morton-Ordered Matrices," numbered NSF ACI-0219884 and CCF-0541364.

Also, thanks to NSF, research assistantships are available on this project.

Introduction
The goal of the Arcee Project is to develop a recursive style for block algorithms in imperative languages like C, C++, Haskell, Scheme, or Java. That is, recursive control (rather than iterative control) is used to address the demands of locality and parallelism in high-performance computing.

Knowledge of the internal representation affects the programmer's choice of algorithm.  An author who seeks fast code has a natural tendancy to honor the hardware in making that choice. Today a big contraint is hierarchical memory of more and more levels: L1, L2, and L3 cache, TLBs, RAM, and paging to disk.  Arcee asks us to develop new control structures for old algorithms better to take advantage of all levels of the memory hierarchy! Similarly, the expression of control can restrit or enhance scheduling for parallelism.

An important programming pattern is "dovetailing," illustrated here for block-recrusive recursive matrix multiplication. One multiplication recursively becomes eight quadrant multiplications, which are ordered so that, as control moves from one to the next, one of the three needed argument-blocks is already in cache---for all levels of the memory hierarchy. As you scan this code, this fact is suggested by the syntactic alignment of one argument with another immediately below it. The single function is duplicated, in "up" and "down" forms---one reversing the control sequence of the other---and fitted together in a metaphor of the cabinetmaker's beautiful dovetail joint.

For example, here is a graph illustrating what dovetailing does for a classic algorithm, matrix multiplication:

This is a plot of times to multiply NxN matrices, divided by N3. We use this quotient because it exposes the leading coefficient of the cubic polynomial of N that yields the time; ideally it is constant and close the the minimum possible. On this MIPS R10000 processor that is indicated by the flat torquoise line at the bottom. The iterative algorithm is six nested loops for a blocked matrix multiplication, the outer three loops traversing 16x16 blocks and the inner three multiplying those 8192-flop base cases. In blue, BLAS3 thrashes as soon as paging to disk becomes necessary; in red, an Arcee block-recursive algorithm has stable times on matrices too large even for RAM. (In green, a prototype Opie compilation of loops onto Morton-order representation introduces some locality without reprogramming.)

Recursion
Morton order admits block-recursive algorithms, via a paradigm of programming different from for loops. That family of algorithms is proving to be very fast.

The most accessible characterization of this approach is a picture.
A useful tool is Ahnentafel indexing, illustrated here for a 4x4 matrix.

 Classes like Ahnentafel indices and dilated integers augment the usual ints used as cartesian, Morton, and level-order indices into binary-tree representations of vectors, quadtree representations of matrices,
 and---generally--- 2d-ary  trees representing d-dimensional arrays.


This support will be especially useful to carry the Arcee paradigm into languages like ML, Scheme, and Haskell.
Thus it is possible to formulate elegant recursive algorithms that, by default, account for performance in the memory hierarchy, and are easy to parallelize.

"Arcee"
So you're probably asking yourself, "What's an arcee?"   R. C. stands for "recursive control"; it is critical to the locality that is necessary to the style and the representation.
Also, the brother Opie project was named after a Transformer®, Optimus Prime, and it turns out that the first female transformer is named Arcee----a happy, gender-neutral coincidence.

Contributors to the Arcee Project:
Papers:
  1. K. Patrick Lorton. and David S. Wise. Analyzing block locality in Morton-order and Morton-hybrid matrices. Proc. MEDEA Workshop'06 (MEmory performance: DEaling with Applications, systems and architecture), Computer Architecture News (to appear). Authors' version
  2. Michael D. Adams. and David S. Wise. Fast Additions on Masked Integers . SIGPLAN Not. (May 2006), 39--45. Authors' version .
  3. David S. Wise, Craig L. Citro, Joshua J. Hursey, Fang Liu, and Michael A. Rainey. A Paradigm for Parallel Matrix Algorithms: Scalable Cholesky. In J. C. Cuhha and P. D. Medeiros (Eds.), Proc. Euro-Par'05, Lecture Notes in Computer Science 3648, Berlin: Springer (August 2005), 687--698. Authors' version .
  4. Rajeev Raman and David S. Wise. Converting to and from Dilated Integers. Submitted for publication, (2004). Authors' version .
  5. Michael A. Rainey and David S. Wise. Embedding Quadtree Matrices in a Functional Language. Submitted for publication (2004). Authors' version .
  6. Jeremy D. Frens and David S. Wise. QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism. Proc. 2003 ACM Symp. on Principles and Practice of Parallel Programming, SIGPLAN Not. 38, 10 (2003 October), 144--154. Authors' version.
  7. David S. Wise, Jeremy D. Frens, Yuhong Gu, and Gregory A. Alexander. Language support for Morton-order matrices. Proc. 2001 ACM Symp. on Principles and Practice of Parallel Programming, SIGPLAN Not. 36, 7 (2001 July), 24--33. Authors' version.
  8. David S. Wise. Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In A. Bode, T. Ludwig, R. Wismu"ller (eds.), Euro-Par 2000 -- Parallel Processing, Lecture Notes in Computer Science 1900, Berlin: Springer (2000 August), 774-784.
  9. David S. Wise and Jeremy D. Frens. Morton-order matrices deserve compilers' support. Technical Report 533. Computer Science Department, Indiana University (1999 November).
  10. David S. Wise. Undulant block elimination and integer-preserving matrix inversion. Sci. Comput. Program 33, 1 (1999 January), 29--85. Math Review 1 653 749. Abstract only: SIGSAM Bull. 29 (1995 June), 28. Author's version.
  11. Jeremy D. Frens and David S. Wise. Auto-blocking matrix-multiplication, or Tracking BLAS3 performance from source code. Proc. 1997 ACM Symp. on Principles and Practice of Parallel Programming, SIGPLAN Not. 32, 7 (1997 July), 206--216. Authors' version.

Related Work:
  1. E. Elmroth, F. Gustavson, I. Jonsson, and B. Kagstroom, Recursive blocked algorithms and hybrid data structures for dense matrix library software , SIAM Rev. 46(1): 3-45, March 2004.
  2. S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottenthodi. Recursive array layouts and fast parallel matrix multiplication. IEEE Trans. Parallel Distrib. Syst., 13(11):1105-1123, Nov. 2002.
  3. E. Elmroth and F. Gustavson. Applying recursion to serial and parallel QR factorization leads to better performance. IBM J. Res. Develop. 44(4):605-624, July 2000.
David S. Wise
Last modified:    2003Aug20