SPARC/UNIX Assembly Language Notes Bart Massey 1/93 These notes cover that portion of SPARC/UNIX assembly language used in "normal" programming. For the rest, consult "SPARC Processor Architecture", and "SPARC Assembly Language Reference". Syntax: Each line has an zero or more assembly-language statements -- multiple statements are separated by semicolons. The "!" character acts as a comment character, commenting everything up to the end of the line. Each statement may be optionally preceded by a label definition, which symbolically represents the address at which that statement is placed. Label definitions always end with a semicolon. Everything is case-sensitive except builtin symbols (those beginning with "%"). The syntax of integers is as in C, floating point constants must be preceded by 0r or 0R (for Real). The special constants 0rnan 0rinf, 0r-nan, 0r-inf exist with the intuitive meaning. String constants are as in C, except that either single or double quotes may be used. Identifiers may contain alphabetic characters, "_", "$", and ".", as well as numeric characters in any position except the first. By convention, symbol names beginning with "L" are compiler locals -- don't use them. The symbol "." is predefined, and always refers to the address of the beginning of the current assembly language statement. Special labels consisting of a single digit may be repeatedly declared, and may be referenced by subscripting them with "b" (for "back reference") or "f" (for "forward reference"). e.g. nop 0: mov 0b,%g1 cmp %g1,0 ble 0f nop 0: nop Register names are %g0 .. %g7: global registers for current window %o0 .. %o7: output registers for current window %i0 .. %i7: input registers for current window %l0 .. %l7: local registers for current window Synonyms for the above are %r0 .. %r31: same as above registers (in above order) %sp: same as %o6 %fp: same as %i6 Constant expressions allow the binary operators + - * / % ^ << >> & | with precedence and meanings as in C, and the prefix unary operators + - ~ %hi %lo where the last two extract the least significant 10 bits and most significant 22 bits of their operand respectively. Pseudo-Ops .ascii s1, s2, ... Generates the strings of ascii characters denoted by the string constant arguments. .asciz s1, s2, ... As .ascii, but postpends a null character to each string. .seg s Sets the "current segment" to the string constant s (begins appending generated bits to the current segment). Possible segments are ".text", ".data", ".data1", and ".bss". The first is where programs live, the last is intialized-to-zero storage, and the other two are initialized data. .skip n Generate n bytes of "empty space" in the current segment. .align b .skip to a b-byte boundary in the current segment. .byte e1, e2, ... .half e1, e2, ... .word e1, e2, ... .double e1, e2, ... Generates appropriately-sized values of the expression arguments. Ignores alignment, which must be done separately. .global s1, s2, ... Declares the symbol arguments as "global". This declaration must occure before any definition of the global symbol. s = e Makes the symbol s denote the constant expression e. Instructions Instructions ending in cc set the flags, whereas instructions ending in CC test the flags. The possible tests are n: never ne: not equal to nz: not zero e: equal to z: zero g: greater than le: less than or equal to ge: greater than or equal to l: less than gu: greater than, unsigned lu: less than, unsigned leu: less than or equal to, unsigned geu: greater than or equal to, unsigned cc: carry clear cs: carry set pos: positive neg: negative vc: overflow clear vs: overflow set a: always If the test is absent, "always" is assumed. Instructions where one possible ending is ",a" have a "delay slot" associated with their execution. The "a" stands for "annul". If an instruction i1 with a delay slot is not annulled, then the following instruction i2 will always be executed "concurrently" with the execution of i1. Only certain instructions (principally instructions which do not themselves have a delay slot) can be executed in a delay slot. If i1 is annulled, then if the branch is not taken, then i2 is discarded (not executed) (an exception to this is that an annulled "always" instruction i1 will *not* execute i2). add s1, s2, d addcc s1, s2, d addx s1, s2, d addxcc s1, s2, d Adds the register s1 to the register or 13-bit constant s2, storing the result in the register d. The "x" versions add in the carry flag. and s1, s2, d andcc s1, s2, d Ands the register s1 with the register or 13-bit constant s2, storing the result in the register d. andn s1, s2, d andncc s1, s2, d Ands the register s1 with bitwise negation of the register or 13-bit constant s2, storing the result in the register d. bCC a bCC,a a Branches to the 22-bit word displacement a if the condition code CC is true, else falls through. call a Stores the address of the instruction into %o7 and branches to the 30-bit word displacement a. Has a delay slot, but cannot be annulled. jmpl ra, rl Stores the address of the instruction (the "link" address) into the register rl, and jumps to address ra. ldSZ [a], r Loads into register r the value at address a, which is either a 13 bit displacement plus a register or the sum of two registers. The value is modified according to SZ if present, where SZ is one of sb: signed byte sh: signed halfword ub: unsigned byte uh: unsigned halfword thus ldub would load the byte at the specified address, zero-extending it to fill the remainder of the word, storing the result in the destination register. nop No op. or s1, s2, d orcc s1, s2, d orn s1, s2, d orncc s1, s2, d Exactly analogous to the corresponding "and" instructions. restore s1, s2, d Pop a register window from the stack, and otherwise behave like an add instruction, except that the sources are read from the old register window, and the destination written into the new register window. save s1, s2, d Push a register window onto the stack, and otherwise behave like an add instruction, except that the sources are read from the old register window, and the destination written into the new register window. sdiv s1, s2, d sdivcc s1, s2, d Exactly analogous to the corresponding add instructions. Signed. sethi v, r Zero the last 10 bits of the register r and replace the high order 22 bits of r with bits from the constant v. sll s1, s2, d Shift s1 left logical s2 bits, with result in d. Normal arithmetic operands. smul s1, s2, d smulcc s1, s2, d Exactly analogous to the corresponding add instructions. Signed. sra s1, s2, d Shift s1 right arithmetic s2 bits, with result in d. Normal arithmetic operands. srl s1, s2, d Shift s1 right logical s2 bits, with result in d. Normal arithmetic operands. stSZ r, [a] Stores the value in register r at address a, which is either a 13 bit displacement plus a register or the sum of two registers. The value is stored according to SZ if present, where SZ is one of b: byte h: halfword sub s1, s2, d subcc s1, s2, d subx s1, s2, d subxcc s1, s2, d Exactly analogous to the corresponding add instructions. swap [a], r Atomically swap the word at address a, which is either a 13 bit displacement plus a register or the sum of two registers, with the contents of register r. tCC v Software interrupt. Trap to vector v, which is given by either a 13 bit displacement plus a register or the sum of two registers. taddcc s1, s2, d tsubcc s1, s2, d Tagged add or subtract. Like addcc or subcc, except that overflow is set if either of the bottom two bits of either source operand are nonzero. These bottom two bits can be used as tags. udiv s1, s2, d udivcc s1, s2, d Exactly analogous to the corresponding add instructions. Unsigned. umul s1, s2, d umulcc s1, s2, d Exactly analogous to the corresponding add instructions. Unsigned. xor s1, s2, d xorcc s1, s2, d xnor s1, s2, d xnorcc s1, s2, d Exactly analogous to the corresponding "and" instructions. Synthetic Instructions Because of the RISC nature of the chip, certain instructions available on "normal" CISC processors are not directly available on the SPARC. For the convenience of assembly-language programmers (and compilers), certain pseudo-instructions are available in the assembler which expand into sequences of real instructions. Operand restrictions are implied by the expansion. cmp s1, s2 --> subcc s1, s2, %g0 jmp a --> jmpl a, %g0 call a --> jmpl a, %o7 Where a is either a register or a small constant. tst r --> orcc r, %g0, %g0 ret --> jmpl %i7+8, %g0 retl --> jmpl %o7+8, %g0 Return from leaf subroutine. See below for leaf subroutines. restore --> restore %g0, %g0, %g0 save --> save %g0, %g0, %g0 Don't use the latter of these. See below for calling conventions. set v, r --> or %g0, v, r When -4096 < v < 4096 . set v, r --> sethi %hi(v), r When ((v & 0x1ff) == 0) . set v, r --> sethi %hi(v), r ; or r, %lo(v), r All cases except above. not s, d --> xnor s, %g0, d Bitwise negation. not d --> xnor d, %g0, d Bitwise negation. neg s, d --> sub %g0, s, d Arithmetic negation. neg d --> sub %g0, s, d Arithmetic negation. inc r --> add r, 1, r inc v, r --> add r, v, r inccc r --> addcc r, 1, r inccc v, r --> addcc r, v, r dec r --> sub r, 1, r dec v, r --> sub r, v, r deccc r --> subcc r, 1, r deccc v, r --> subcc r, v, r btst v, r --> andcc r, v, %g0 bset v, r --> or r, v, r bclr v, r --> andn r, v, r btog v, r --> xor r, v, r clr r --> or %g0, %g0, r clrb [a] --> stb %g0, [a] clrh [a] --> sth %g0, [a] clr [a] --> st %g0, [a] mov v, r --> or %g0, v, r Register Windows The register windows are intended to prevent saving and restoring registers during most procedure invocations. When a save instruction is executed, the old values of %l0..%l7 and %i0..%i7 are stacked in an efficient fashion, and the current values of %o0..%o7 are moved to %i0..%i7. A restore instruction moves %i0..%i7 back to %o0..%o7, and restores the old %l0..%l7 and %i0..%i7. HLL(C) Calling Conventions By convention, on entry to a procedure p, %o7 contains the address of the call instruction which invoked p, and %o6 is the stack pointer. p must allocate a minimum of 96 bytes of stack space, in order to make room to save the register window on interrupts, and for the calling conventions. p is expected to execute a save instruction, which will copy the updated %o6 while simultaneously pushing the register window, thusly: _p: save %sp,-96,%sp The new %o6 (after save) is 96 bytes less than the old %o6 (before save). The stack grows upward, so 96 bytes of storage have been allocated. Thus, after the save instruction, the old %sp is now in %i6 and is referred to as the frame pointer %fp. The call address is now in %i7. To return from the call, we typically do a simple restore in the delay slot of a jmpl which jumps to the call address + 2 words (1 for the call instruction itself, and 1 for the delay slot of the call), thus jmpl %i7+8 restore The restore makes the old %fp be the %sp again, and the jump puts us at the instruction following the call. The first 6 arguments are passed in %o0..%o5, with remaining arguments passed on the stack. Once the save instruction in the procedure prolog has been executed, these become %i0..i5. Structures are passed by allocating them in the caller and passing their address. (Floating point arguments are passed in the floating point registers %fp0..%fp31). The return value is placed in %i0 by the callee before executing the restore, and is thus available to the caller in %o0 Conventional structure return is complicated and is discussed seperately below. (Floating point return is in %fp0). The locals %l0..%l7 are available for that purpose. By convention, of the globals %g1 is scratch, %g2..%g4 are used for global register variables, and %g5..%g7 are reserved. Thus, %i0..%i5, %l0..%l7, %g1 are callee-saves, %o0..%i5 are caller-saves, and the remaining registers are special. Example: Function which adds its two integer arguments to produce an integer result, and its caller. _add_2: save %sp,-96,%sp ! set up stack frame add %i0,%i1,%l0 ! calculate value mov %l0,%i0 ! set up return value jmp %i7+8 ! return restore ! restore during return ... mov 1,%o0 ! set up first parameter mov 2,%o1 ! set up second parameter call _add_2 ! call function nop ! fill delay slot of call mov %o0,%l0 ! do something with return value ... Structure Return Functions returning a structure have a complicated calling convention. The caller allocates storage as for structure arguments, and places the address of the allocated store at %sp+64 . The callee can then reference this address as %fp+64 . In addition, the caller must place an unimp instruction following the delay slot of the call instruction, whose argument is the least-significant 12 bits of the size of the allocated structure. In this way, the callee can examine %o7+12 to find out whether an appropriate-sized structure has in fact been allocated (although this sanity check is unnecessary, and contrary to the spirit of the C language). Thus, a function returning a structure must skip 3 words past the calling instruction on returning, instead of 2. Note that GCC by default uses a *completely different* structure return convention. Thus, if GCC-compiled code is to be used in conjunction with code compiled by other compilers, where any functions involved return a structure, the flag -fpcc-struct-return must be given to GCC. A Complete Example Here's the C code for a sample program. int fib( cnt, v1, v2 ) register int cnt, v1, v2; { if( cnt <= 0 ) return v2; return fib( cnt - 1, v1 + v2, v1 ); } int main( argc, argv ) int argc; char **argv; { int tmp; if( argc != 2 ) abort(); tmp = atoi( argv[1] ); if( tmp <= 0 ) abort(); printf( "%d\n", fib( tmp - 1, 1, 1 ) ); return 0; } And here's a handwritten (before the C code, actually) assembly version. ! fib -- directly compute and print the nth fibonacci number ! Bart 1/93 .seg "text" ! just in case .global _main ! must be globally visible fib: save %sp,-96,%sp ! set up subcc %i0,1,%o0 ! decrement count bg 0f ! if not finished do calc nop ! fill delay slot mov %i2,%i0 ! return arg 2 if done b 1f ! exit nop ! fill delay slot 0: mov %i1, %o2 ! save arg 2 add %i1,%i2,%o1 ! calc new arg 2 call fib ! recursive call nop ! fill delay slot mov %o0,%i0 ! return result 1: jmp %i7+8 ! return restore ! while restoring _main: save %sp,-96,%sp ! set up mov 2,%l0 ! check for two arguments cmp %i0,%l0 ! 1st parameter is argc be 0f ! on error, just abort nop ! fill delay slot call _abort ! abort nop ! fill delay slot 0: ld [%i1+4],%o0 ! get argv[1] call _atoi ! call atoi( argv[1] ) nop ! fill delay slot cmp %o0,%g0 ! did atoi() return positive? bg 0f ! on error, just abort nop ! fill delay slot call _abort ! abort nop ! fill delay slot 0: mov 1, %l0 ! %o0 already contains count, mov %l0,%o1 ! %o1 is 1/2 state mov %l0,%o2 ! %o2 is other 1/2 call fib ! start the recursion nop ! fill delay slot mov %o0,%o1 ! want result 2nd arg set format,%o0 ! and format string 1st arg call _printf ! call it nop ! fill delay slot mov %g0,%i0 ! resulting exit status 0 jmp %i7+8 ! return restore ! while restoring ! string constants may remain in text segment format: .asciz "%d\n" Here's an optimized version of the fib function from above. Note that we are still calling-convention compatible, but that we've really sped up fib. In particular, since fib is a "leaf" procedure, and since it needs but one scratch register (for which we use %o3), we get rid of the save and restore entirely, and tail-call the recursion. We also filled all the delay slots with useful work. Note the use of the annulled branch. fib: subcc %o0,1,%o0 ! decrement count bg,a 0f ! if not finished do calc mov %o1,%o3 ! save arg 2 (annulled delay slot) jmp %o7+8 ! return mov %o2,%o0 ! return arg 2 (delay slot) 0: add %o1,%o2,%o1 ! calc new arg 2 b fib ! tail call mov %o3,%o2 ! get arg 3 (delay slot) The C compiler actually does even slightly better than this hand-optimized code (check it out).