SPARC/UNIX Assembly Language Notes
Bart Massey 1/93

These notes cover that portion of SPARC/UNIX assembly language
used in "normal" programming.  For the rest, consult "SPARC
Processor Architecture", and "SPARC Assembly Language
Reference".

Syntax:

  Each line has an zero or more assembly-language statements --
  multiple statements are separated by semicolons.  The "!"
  character acts as a comment character, commenting everything
  up to the end of the line.

  Each statement may be optionally preceded by a label
  definition, which symbolically represents the address at
  which that statement is placed.  Label definitions always
  end with a semicolon.

  Everything is case-sensitive except builtin symbols (those
  beginning with "%").  The syntax of integers is as in C,
  floating point constants must be preceded by 0r or 0R (for
  Real).  The special constants 0rnan 0rinf, 0r-nan, 0r-inf
  exist with the intuitive meaning.  String constants are as in
  C, except that either single or double quotes may be used.

  Identifiers may contain alphabetic characters, "_", "$",
  and ".", as well as numeric characters in any position
  except the first.  By convention, symbol names beginning
  with "L" are compiler locals -- don't use them.

  The symbol "." is predefined, and always refers to the address
  of the beginning of the current assembly language statement.

  Special labels consisting of a single digit may be repeatedly declared,
  and may be referenced by subscripting them with "b" (for
  "back reference") or "f" (for "forward reference").  e.g.
	  	nop
	0:
  		mov 0b,%g1
	  	cmp %g1,0
  		ble 0f
		nop
	0:
		nop

  Register names are
    %g0 .. %g7:  global registers for current window
    %o0 .. %o7:  output registers for current window
    %i0 .. %i7:  input registers for current window
    %l0 .. %l7:  local registers for current window
  Synonyms for the above are
    %r0 .. %r31:  same as above registers (in above order)
    %sp:  same as %o6
    %fp:  same as %i6

  Constant expressions allow the binary operators
    + - * / % ^ << >> & |
  with precedence and meanings as in C, and the prefix unary
  operators
    + - ~ %hi %lo
  where the last two extract the least significant 10 bits and
  most significant 22 bits of their operand respectively.

Pseudo-Ops

  .ascii s1, s2, ...
    Generates the strings of ascii characters denoted
    by the string constant arguments.

  .asciz s1, s2, ...
    As .ascii, but postpends a null character to each string.

  .seg s
    Sets the "current segment" to the string constant s
    (begins appending generated bits to the current segment).
    Possible segments are ".text", ".data", ".data1", and
    ".bss".  The first is where programs live, the last is
    intialized-to-zero storage, and the other two are
    initialized data.

  .skip n
    Generate n bytes of "empty space" in the current segment.

  .align b
    .skip to a b-byte boundary in the current segment.

  .byte e1, e2, ...
  .half e1, e2, ...
  .word e1, e2, ...
  .double e1, e2, ...
    Generates appropriately-sized values of the expression
    arguments.  Ignores alignment, which must be done separately.

  .global s1, s2, ...
    Declares the symbol arguments as "global".  This declaration
    must occure before any definition of the global symbol.

  s = e
    Makes the symbol s denote the constant expression e.

Instructions

  Instructions ending in cc set the flags, whereas instructions
  ending in CC test the flags.  The possible tests are
    n:	 never
    ne:	 not equal to
    nz:  not zero
    e:	 equal to
    z:	 zero
    g:	 greater than
    le:  less than or equal to
    ge:  greater than or equal to
    l:   less than
    gu:  greater than, unsigned
    lu:  less than, unsigned
    leu: less than or equal to, unsigned
    geu: greater than or equal to, unsigned
    cc:  carry clear
    cs:  carry set
    pos: positive
    neg: negative
    vc:  overflow clear
    vs:  overflow set
    a:   always
  If the test is absent, "always" is assumed.

  Instructions where one possible ending is ",a" have a "delay
  slot" associated with their execution.  The "a" stands for
  "annul".  If an instruction i1 with a delay slot is not
  annulled, then the following instruction i2 will always be
  executed "concurrently" with the execution of i1.  Only
  certain instructions (principally instructions which do not
  themselves have a delay slot) can be executed in a delay slot.
  If i1 is annulled, then if the branch is not taken, then i2
  is discarded (not executed) (an exception to this is that an
  annulled "always" instruction i1 will *not* execute i2).
  
  add s1, s2, d
  addcc s1, s2, d
  addx s1, s2, d
  addxcc s1, s2, d
    Adds the register s1 to the register or 13-bit constant s2,
    storing the result in the register d.  The "x" versions add
    in the carry flag.

  and s1, s2, d
  andcc s1, s2, d
    Ands the register s1 with the register or 13-bit constant s2,
    storing the result in the register d.

  andn s1, s2, d
  andncc s1, s2, d
    Ands the register s1 with bitwise negation of the register
    or 13-bit constant s2, storing the result in the register d.

  bCC a
  bCC,a a
    Branches to the 22-bit word displacement a if the condition code
    CC is true, else falls through.

  call a
    Stores the address of the instruction into %o7 and
    branches to the 30-bit word displacement a.  Has a delay slot,
    but cannot be annulled.

  jmpl ra, rl
    Stores the address of the instruction (the "link" address)
    into the register rl, and jumps to address ra.

  ldSZ [a], r
    Loads into register r the value at address a, which is
    either a 13 bit displacement plus a register or the sum of
    two registers.  The value is modified according to SZ if present,
    where SZ is one of
      sb: signed byte
      sh: signed halfword
      ub: unsigned byte
      uh: unsigned halfword
    thus ldub would load the byte at the specified address,
    zero-extending it to fill the remainder of the word, storing
    the result in the destination register.

  nop
    No op.

  or s1, s2, d
  orcc s1, s2, d
  orn s1, s2, d
  orncc s1, s2, d
    Exactly analogous to the corresponding "and" instructions.

  restore s1, s2, d
    Pop a register window from the stack, and otherwise behave
    like an add instruction, except that the sources are read
    from the old register window, and the destination written
    into the new register window.

  save s1, s2, d
    Push a register window onto the stack, and otherwise behave
    like an add instruction, except that the sources are read
    from the old register window, and the destination written
    into the new register window.

  sdiv s1, s2, d
  sdivcc s1, s2, d
    Exactly analogous to the corresponding add instructions.  Signed.

  sethi v, r
    Zero the last 10 bits of the register r and replace the high
    order 22 bits of r with bits from the constant v.

  sll s1, s2, d
    Shift s1 left logical s2 bits, with result in d.  
    Normal arithmetic operands.
  
  smul s1, s2, d
  smulcc s1, s2, d
    Exactly analogous to the corresponding add instructions.  Signed.

  sra s1, s2, d
    Shift s1 right arithmetic s2 bits, with result in d.  
    Normal arithmetic operands.
  
  srl s1, s2, d
    Shift s1 right logical s2 bits, with result in d.
    Normal arithmetic operands.
  
  stSZ r, [a]
    Stores the value in register r at address a, which is
    either a 13 bit displacement plus a register or the sum of
    two registers.  The value is stored according to SZ if present,
    where SZ is one of
      b: byte
      h: halfword

  sub s1, s2, d
  subcc s1, s2, d
  subx s1, s2, d
  subxcc s1, s2, d
    Exactly analogous to the corresponding add instructions.

  swap [a], r
    Atomically swap the word at address a, which is either a 13
    bit displacement plus a register or the sum of two
    registers, with the contents of register r.

  tCC v
    Software interrupt.  Trap to vector v, which is given by
    either a 13 bit displacement plus a register or the sum of
    two registers.

  taddcc s1, s2, d
  tsubcc s1, s2, d
    Tagged add or subtract.  Like addcc or subcc, except that
    overflow is set if either of the bottom two bits of either
    source operand are nonzero.  These bottom two bits can be
    used as tags.

  udiv s1, s2, d
  udivcc s1, s2, d
    Exactly analogous to the corresponding add instructions.  Unsigned.

  umul s1, s2, d
  umulcc s1, s2, d
    Exactly analogous to the corresponding add instructions.  Unsigned.

  xor s1, s2, d
  xorcc s1, s2, d
  xnor s1, s2, d
  xnorcc s1, s2, d
    Exactly analogous to the corresponding "and" instructions.

Synthetic Instructions

  Because of the RISC nature of the chip, certain instructions
  available on "normal" CISC processors are not directly
  available on the SPARC.  For the convenience of
  assembly-language programmers (and compilers), certain
  pseudo-instructions are available in the assembler which
  expand into sequences of real instructions.  Operand
  restrictions are implied by the expansion.

  cmp s1, s2	--> subcc s1, s2, %g0

  jmp a		--> jmpl a, %g0

  call a	--> jmpl a, %o7
    Where a is either a register or a small constant.

  tst r		--> orcc r, %g0, %g0

  ret		--> jmpl %i7+8, %g0
  
  retl		--> jmpl %o7+8, %g0
    Return from leaf subroutine.  See below for leaf subroutines.

  restore	--> restore %g0, %g0, %g0
  save		--> save %g0, %g0, %g0
    Don't use the latter of these.  See below for calling
    conventions.

  set v, r	--> or %g0, v, r
    When -4096 < v < 4096 .

  set v, r	--> sethi %hi(v), r
    When ((v & 0x1ff) == 0) .

  set v, r	--> sethi %hi(v), r ; or r, %lo(v), r
    All cases except above.

  not s, d	--> xnor s, %g0, d
    Bitwise negation.

  not d		--> xnor d, %g0, d
    Bitwise negation.

  neg s, d	--> sub %g0, s, d
    Arithmetic negation.

  neg d		--> sub %g0, s, d
    Arithmetic negation.

  inc r		--> add r, 1, r

  inc v, r	--> add r, v, r

  inccc r	--> addcc r, 1, r
  
  inccc v, r	--> addcc r, v, r

  dec r		--> sub r, 1, r

  dec v, r	--> sub r, v, r

  deccc r	--> subcc r, 1, r
  
  deccc v, r	--> subcc r, v, r

  btst v, r	--> andcc r, v, %g0

  bset v, r	--> or r, v, r

  bclr v, r	--> andn r, v, r

  btog v, r	--> xor r, v, r

  clr r		--> or %g0, %g0, r

  clrb [a]	--> stb %g0, [a]

  clrh [a]	--> sth %g0, [a]

  clr [a]	--> st %g0, [a]

  mov v, r	--> or %g0, v, r

Register Windows

  The register windows are intended to prevent saving and
  restoring registers during most procedure invocations.  When
  a save instruction is executed, the old values of %l0..%l7
  and %i0..%i7 are stacked in an efficient fashion, and the
  current values of %o0..%o7 are moved to %i0..%i7.  A restore
  instruction moves %i0..%i7 back to %o0..%o7, and restores
  the old %l0..%l7 and %i0..%i7.

HLL(C) Calling Conventions

  By convention, on entry to a procedure p, %o7 contains the
  address of the call instruction which invoked p, and %o6 is
  the stack pointer.  p must allocate a minimum of 96
  bytes of stack space, in order to make room to save the
  register window on interrupts, and for the calling
  conventions.  p is expected to execute a save instruction,
  which will copy the updated %o6 while simultaneously
  pushing the register window, thusly:

  _p:
  	save	%sp,-96,%sp
	
  The new %o6 (after save) is 96 bytes less than the old %o6
  (before save).  The stack grows upward, so 96 bytes of
  storage have been allocated.  Thus, after the save
  instruction, the old %sp is now in %i6 and is referred to as
  the frame pointer %fp.  The call address is now in %i7.
  To return from the call, we typically do a simple restore
  in the delay slot of a jmpl which jumps to the call
  address + 2 words (1 for the call instruction itself, and
  1 for the delay slot of the call), thus

  	jmpl	%i7+8
	restore

  The restore makes the old %fp be the %sp again, and the jump
  puts us at the instruction following the call.

  The first 6 arguments are passed in %o0..%o5, with remaining
  arguments passed on the stack.  Once the save instruction in
  the procedure prolog has been executed, these become %i0..i5.
  Structures are passed by allocating them in the caller and
  passing their address.  (Floating point arguments are passed
  in the floating point registers %fp0..%fp31).  The return
  value is placed in %i0 by the callee before executing the
  restore, and is thus available to the caller in %o0
  Conventional structure return is complicated and is
  discussed seperately below.  (Floating point return is in
  %fp0).
  
  The locals %l0..%l7 are available for that purpose.  By
  convention, of the globals %g1 is scratch, %g2..%g4 are used
  for global register variables, and %g5..%g7 are reserved.

  Thus,
    %i0..%i5, %l0..%l7, %g1
  are callee-saves,
    %o0..%i5
  are caller-saves, and the remaining registers are special.

  Example:  Function which adds its two integer arguments to produce
  an integer result, and its caller.

    _add_2:
      save %sp,-96,%sp		! set up stack frame
      add  %i0,%i1,%l0		! calculate value
      mov  %l0,%i0		! set up return value
      jmp  %i7+8		! return
      restore			! restore during return

      ...
      mov  1,%o0		! set up first parameter
      mov  2,%o1		! set up second parameter
      call _add_2		! call function
      nop			! fill delay slot of call
      mov  %o0,%l0		! do something with return value
      ...

Structure Return

  Functions returning a structure have a complicated calling
  convention.  The caller allocates storage as for structure
  arguments, and places the address of the allocated store at
  %sp+64 .  The callee can then reference this address as
  %fp+64 .
  
  In addition, the caller must place an unimp instruction
  following the delay slot of the call instruction, whose
  argument is the least-significant 12 bits of the size of the
  allocated structure.  In this way, the callee can examine
  %o7+12 to find out whether an appropriate-sized structure
  has in fact been allocated (although this sanity check is
  unnecessary, and contrary to the spirit of the C language).
  Thus, a function returning a structure must skip 3 words
  past the calling instruction on returning, instead of 2.

  Note that GCC by default uses a *completely different*
  structure return convention.  Thus, if GCC-compiled code is
  to be used in conjunction with code compiled by other
  compilers, where any functions involved return a structure,
  the flag -fpcc-struct-return must be given to GCC.

A Complete Example

  Here's the C code for a sample program.

  int fib( cnt, v1, v2 )
  register int cnt, v1, v2;
  {
    if( cnt <= 0 )
      return v2;
    return fib( cnt - 1, v1 + v2, v1 );
  }
  
  int main( argc, argv )
  int argc;
  char **argv;
  {
    int tmp;
    
    if( argc != 2 )
      abort();
    tmp = atoi( argv[1] );
    if( tmp <= 0 )
      abort();
    printf( "%d\n", fib( tmp - 1, 1, 1 ) );
    return 0;
  }

  And here's a handwritten (before the C code, actually) assembly
  version.

  ! fib -- directly compute and print the nth fibonacci number
  ! Bart 1/93

		.seg	"text"		! just in case
		.global	_main		! must be globally visible
  fib:
  		save	%sp,-96,%sp	! set up
		subcc	%i0,1,%o0	! decrement count
		bg	0f		! if not finished do calc
		nop			! fill delay slot
		mov	%i2,%i0		! return arg 2 if done
		b	1f		! exit
		nop			! fill delay slot
  0:
		mov	%i1, %o2	! save arg 2
		add	%i1,%i2,%o1	! calc new arg 2
		call	fib		! recursive call
		nop			! fill delay slot
		mov	%o0,%i0		! return result
  1:
		jmp	%i7+8		! return
		restore			! while restoring
  _main:
  		save	%sp,-96,%sp	! set up
  		mov	2,%l0		! check for two arguments
  		cmp	%i0,%l0		! 1st parameter is argc
		be	0f		! on error, just abort
		nop			! fill delay slot
		call	_abort		! abort
		nop			! fill delay slot
  0:
		ld	[%i1+4],%o0	! get argv[1]
		call	_atoi		! call atoi( argv[1] )
		nop			! fill delay slot
		cmp	%o0,%g0		! did atoi() return positive?
		bg	0f		! on error, just abort
		nop			! fill delay slot
		call	_abort		! abort
		nop			! fill delay slot
  0:
		mov	1, %l0		! %o0 already contains count,
		mov	%l0,%o1		! %o1 is 1/2 state
		mov	%l0,%o2		! %o2 is other 1/2
		call	fib		! start the recursion
		nop			! fill delay slot
		mov	%o0,%o1		! want result 2nd arg
		set	format,%o0	! and format string 1st arg
		call	_printf		! call it
		nop			! fill delay slot
		mov	%g0,%i0		! resulting exit status 0
		jmp	%i7+8		! return
		restore			! while restoring
  ! string constants may remain in text segment
  format:
  		.asciz	"%d\n"


  Here's an optimized version of the fib function from above.  Note
  that we are still calling-convention compatible, but that we've
  really sped up fib.  In particular, since fib is a "leaf" procedure,
  and since it needs but one scratch register (for which we use
  %o3), we get rid of the save and restore entirely, and tail-call
  the recursion.  We also filled all the delay slots with useful
  work.  Note the use of the annulled branch.

  fib:
		subcc	%o0,1,%o0	! decrement count
		bg,a	0f		! if not finished do calc
		mov	%o1,%o3		! save arg 2 (annulled delay slot)
		jmp	%o7+8		! return
		mov	%o2,%o0		! return arg 2 (delay slot)
  0:
		add	%o1,%o2,%o1	! calc new arg 2
  		b	fib		! tail call
		mov	%o3,%o2		! get arg 3 (delay slot)


  The C compiler actually does even slightly better than this
  hand-optimized code (check it out).