Having grown up with Mathematica as my first programming language, it is only
natural that I should want to tinker with the system now. Unfortunately, it is
a proprietary system kept under the lock and key of WRI. The obvious solution
to this problem is to program it again from the ground up. While I would have
used assembly language in the past for this endeavour, we can get things done
a lot faster with C++, though not without some serious systemic performance
problems in some corner cases.
Naming things is hard. It is clearly some kind of Computer Algebra System.
And, The Linux Edition was the first to be written, so it got called
casl and then castle.
features of castle
some internal design decisions
exceptional control flow
C++ rant
asmcastle
pattern matching
compiler
screenshots
Features
Dependencies and external sources of code:
- 
FLINT and Arb, and therefore GMP and MPFR
- 
Some symbol tables are stored with a HAT-trie
- 
All font data was parsed from noto.
- 
Apparently the relevant patents
on subpixel rendering have expired, so now we are allowed to render images at
thrice the x-resolution and apply a simple filter to render on RGB display devices.
This first point is important because these C libraries are not exception safe,
and I would like the ability to safely abort a long running calculation.
This last point is noticable on low resolution displays: If the pixels of these
images line up with the pixels of your display device and you have an RGB format,
then one of these images should appear crisper than the other. If not, then
the one that is supposed to be crisper will just look like the TikTok logo.
 
 
And a pure text comparison:
 
 
Other features completely implemented from scratch:
- 
A custom font renderer capable of rendering a full screen (with uncached glyphs)
of text at about 20 frames per second. I chose a custom renderer because it is
simpler than figuring out the right OS-specific interface for measuring the size
of glyphs and drawing rotated characters, and then being disappointed with the
quality of the results. It is also not trivial to figure out how to get an OS to
draw square root symbols or curly brackets enclosing objects of arbitrary size
by stretching certain part of the glyphs.
As some point is just easier to shade the darn pixels yourself.
- 
A compiler for Mathematica patterns, which resemble regular expressions.
- 
The main evaluation loop and all of the supporting functions which constitute
the interpreter for running Mathematica.
- 
A serialization format, which is used to pass expressions between the kernel and the GUI.
- 
The appropriate wrappers of FLINT and Arb, so that polynomial factorization
(Factor[x^2-y^2]) or calculation of pi (N[Pi,100]) is
available at your finger tips as one-liners.
Internal use: pass a copy? or not?
Castle uses two raw pointer types ex and er for expressions.
These are of course wrapped in higher level classes with automatic memory
management and destructors, but the raw pointer types are used to pass and
receive expressions through the function call boundary. Why deal with these raw
pointers? Well, the C++ ABI for passing around, for example,
classes such as unique_ptr is well-known to be terrible.
Dealing with raw pointers immediately brings up the question: does the caller
own it (er)? or is the callee expected to take ownership (ex)?
Leaving only one pointer type for both is common in programming, but forces
a huge burden on the programmer as this kind of code is not self-documenting.
These two separate pointer types come with a huge caveat: the pointer aliasing
rules of C. It is possible for a er and an ex to point to the
same memory location, but the compiler is allowed to assume this is not the case.
This can and does cause real bugs. Therefore castle comes with a configuration
switch which either makes er and ex distinct types or the same type.
In case of the former we can get checking at compile time of the eternal question
"Was I supposed to pass a copy?", but the resulting binary is either invalid or
cannot be produced with optimizations enabled. In the latter case, we
can enable optimizations. Hence, the er / ex dichotomy is a
debugging tool.
A typical function that uses expressions and passes them along looks like
    void eparser::handle_ex(ex E)
    {
        uex e(E);
        handle_char(CHAR_NONE);
        if (error) {return;}
        handle_token_ex(e.release());
        e.kill_dead();
    }
The uex class is essentially a unique_ptr and is supposed to
own a raw pointer that can either be null or a real expression.
    struct uex {
        ex data;
        uex(): data(nullptr) {}
        uex(ex d): data(d) {}
        uex (const uex & other): data(ecopy(other.data)) {}
        ~uex() {if (data != nullptr) eclear(data);}
        ex release() {
            ex datacopy = data;
            data = nullptr;
            return datacopy;
        }
        ...
    }
Then, what is going on with kill_dead? Well, besides not having
destructive moves in C++, the C++ compiler is not able to figure out that
 e.data == null at the end of this function, and hence emits the full
code for ~uex()! That is, without the e.kill_dead(); and with
g++ -O3, we get the code
00000000000fcd40 <_ZN7eparser9handle_exEP9ex_struct>:
   fcd40:	endbr64 
   fcd44:	push   rbp
   fcd45:	mov    rbp,rdi
   fcd48:	sub    rsp,0x10
   fcd4c:	mov    QWORD PTR [rsp+0x8],rsi  ; constructor for uex
   fcd51:	mov    esi,0xffffffff
   fcd56:	call   10b970 <_ZN7eparser11handle_charEi>
   fcd5b:	mov    eax,DWORD PTR [rbp+0x1c8]
   fcd61:	mov    rsi,QWORD PTR [rsp+0x8]
   fcd66:	test   eax,eax
   fcd68:	jne    fcda0 <_ZN7eparser9handle_exEP9ex_struct+0x60>
   fcd6a:	mov    rdi,rbp
   fcd6d:	mov    QWORD PTR [rsp+0x8],0x0  ; put 0x0 in there for the release
   fcd76:	call   fbcb0 <_ZN7eparser15handle_token_exEP9ex_struct>
   fcd7b:	mov    rdi,QWORD PTR [rsp+0x8]  ; you dummy!
   fcd80:	test   rdi,rdi                  ; you just put 0x0 in there!
   fcd83:	je     fcd96 <_ZN7eparser9handle_exEP9ex_struct+0x56>
   fcd85:	mov    rax,QWORD PTR [rdi]
   fcd88:	sub    rax,0x10
   fcd8c:	mov    QWORD PTR [rdi],rax
   fcd8f:	jae    fcd96 <_ZN7eparser9handle_exEP9ex_struct+0x56>
   fcd91:	call   125480 <_Z12ex_cleardoneP9ex_struct>
   fcd96:	add    rsp,0x10
   fcd9a:	pop    rbp
   fcd9b:	ret 
    ...
Therefore, the code for kill_dead is
    struct uex {
        ...
        void kill_dead() {
            assert(data == nullptr);
            data = nullptr;
        }
        ...
    }
and can be used at any time to stop gcc from emitting pointless destructor code.
Non-local control flow
Another issue that immediately arises when implementing an interpreted language
is how commands such as goto, continue, and break are
going to work. The basic idea is illustrated in this loop.
    n = 0; Label["start"]; n = n + 1; If[n < 1000000, Goto["start"]]; Print["done"]
This code start out by setting n to zero, increments it until it gets
to a million, then prints "done". This is all done using wonderful
Gotos in place of other looping constructs.
The tree representation of this code with which the interpreter works looks like
    CompoundExpresssion[
      Set[n, 0]
    ,
      Label["start"]
    ,
      Set[
         n
      ,
         Plus[n, 1]
      ]
    ,
      If[
         Less[n, 1000000]
      ,
         Goto["start"]
      ]
    ,
      Print["done"]
    ]
Each node in the tree has builtin code for evaluating it, and the evaluation is applied
recursively using the known evaluation rules for Mathematica. For example, when
the Goto["start"] is evaluated, this evaluation is called from within the
evaluation of the enclosing If, which is itself called from within the
CompoundExpression.
Now, the interesting thing is how the Goto["start"] gets back to
Label["start"], that is, how the interpreter transfers control to an
arbitrary other location. To give an indication of how integrated this control
transfer must be with the rest of the system, here is one test from castle's
test code (which, of course, is run with On[Assert]).
    t = Reap[
       i = Pi;
       Block[{i = 1},
           Label["loop"];
           Sow[i];
           MatchQ[i, _ ? (If[# > 5, Sow[i*i]; Goto["out"]]&)];
           i = i + 1;
           Goto["loop"];
       ];
       Assert[False];
       Label["out"];
       i
    ];
    Assert[t === {Pi, {{1, 2, 3, 4, 5, 6, 36}}}];
The Goto["out"] can be called from within the pattern matcher and also
can transfer control to a point outside the Block. This is all perfectly legal
Mathematica code and expected behaviour. Hence, the Goto["out"] must
exit the pattern matcher gracefully and exit the Block gracefully,
which involves restoring the value of i from 6 to Pi.
It turns out that we cannot go to truly arbitrary locations in the code. A slight
modification of the example above
    n = 0; If[True, Label["start"]]; n = n + 1; If[n < 1000000, Goto["start"]]; Print["done"]
will result in a failure to find the right label. It turns out that we can only
go to labels that are inside compound expressions that enclose, at some level, the
corresponding Goto. The label is now inside an If, which
renders it invisible to the outer CompoundExpression. Well, this behaviour
of Goto screams for an implementation using exceptions.
The implementation of Goto simply throws an object which we hope
to catch later:
    ex dcode_sGoto(er e)
    {
       if (elength(e) != 1)
           return _handle_message_argx1(e);
       throw exception_sym_sGoto(ecopy(e));
       return nullptr;
    }
Now, there are two catchers for exception_sym_sGoto. One is inside the
code for CompoudExpression:
    ex dcode_sCompoundExpression(er e)
    {
       size_t n = elength(e);
       size_t i = 0;
       if (unlikely(n == 0))
           return gs.sym_sNull.copy();
       while (true)
       {
           i++;
           try {
               ex r = eval(ecopychild(e,i));
               if (i < n)
                   eclear(r);
               else
                   return r;
           }
           catch (const exception_sym_sGoto & X)
           {
               uex f(reinterpret_cast(X.data));
               for (i = 1; i <= n; i++)
               {
                   er ei = echild(e,i);
                   if (ehas_head_sym_length(ei, gs.sym_sLabel.get(), 1) &&
                       ex_same(echild(ei,1), f.child(1)))
                   {
                       // have found the correct label, proceed with try block above
                       break;
                   }
               }
               if (i > n)
               {
                   // no label found, search higher up the call stack
                   throw exception_sym_sGoto(f.release());
                   return nullptr;
               }
           }
       }
    }
The other one is at the top level so that an unfound label will gracefully
print a message instead of terminating the program as C++ is supposed to do:
    ex topeval(ex E)
    {
        try
        {
            return eval(E);
        }
        catch (const exception_sym_sGoto & X)
        {
            uex f(reinterpret_cast(X.data));
            _gen_message(gs.sym_sGoto.get(), "nolabel", "Label `1` not found.", f.copychild(1));
            return emake_node(gs.sym_sHold.copy(), f.release());
        }
        ...
    }
C++ exceptions are slow?
We have seen above the following limitations of C++ – that is, the language
along with a state-of-the-art compiler in the year 2022:
- 
no move destruction
- 
provably trivial destructor code is not optimized as such
- 
classes such as uex are always placed on the stack, even though they
could live in a register as they are just 64 bits.
There is one item that needs to be placed on this list:
- 
C++ exceptions are very slow, typically tens of thousands
of machine instructions just to pop one simple stack frame.
The slow down can be seen in action:
    In[1] := Timing[n = 0; Label["start"]; n = n + 1; If[n < 1000000, Goto["start"]]; Print["done"]]
    "done"
    Out[1] = {3.159714, Null}
    In[2] := Timing[Do[n, {n, 1, 1000000}]]
    Out[2] = {0.016782, Null}
The usual answer to this last point is that exceptions should not be used for
control flow, but, in the context of this interpreter, this is precisely what we
are doing! Furthermore, considering all of the complicated ways in which Goto is
expected to perform, it doesn't seem reasonable to implement it in any other
paradigm. Now, the actual implementation of the exceptions can take several forms:
- 
Wind return codes and error handling code throughout every function that can throw, or
- 
Rely on C++ and its zero-cost implementation when exceptions are not generated.
The first option is simply a non-starter with anything at scale (see the
 interface
of LibTomMath for proof).
With the second option we get slow gotos but code that is otherwise fast
under normal circumstances, and we don't have to deal with return codes at every
function call.asmcastle: Do zero-cost exceptions have to be so slow?
The fundamental operation that exceptions provide is a non-standard way to exit
a function call. At the machine level, your function might look like:
    f:
        construct a local object a
        construct another local object b
        call g
        construct another local object c
        call h
        destruct c
        destruct b
        destruct a
        ret
Now, if g throws (that is, exits in a non-standard way), the exit is supposed to
destruct b, then destruct a, then exit f (that is,
continue the throw). Similarly, if h throws, c is supposed to
be destructed as well. All we need at the machine level is something like:
    f:
        construct a local object a
        construct another local object b
        call g
        if non-standard exit, goto .unwind1
        construct another local object c
        call h
        if non-standard exit, goto .unwind2
        destruct c
        destruct b
        destruct a
        ret
    .unwind2:
        destruct c
    .unwind1:
        destruct b
        destruct a
        continue non-standard exit
We can do literally anything we like in the unwind code, which includes getting
around the uncomfortable fact that C++ destructors are nullary functions.
Anyways, the problem to solve here is how to associate the address of f.unwind1
with the first call to g inside f and similarly for f.unwind2
and h. This is usually done in C++ by keeping some kind of global lookup
table in some cold piece of memory. This is certainly possible to do, but the
following approach is purely local and a bit more fun: Since the call to g
must keep track of the return address somewhere in the machine, we can simply
insert a nop after the call to g which encodes f.unwind1. This
requires a machine that supports nops with at least a 10-15 bit payload. This is
fortunately the case with x86. While inserting a 3, 4, or 7 byte nop after every
function call that could throw is not technically zero-cost, it is in some sense
zero-cost because a nop literally does nothing.
First, the part of the stack unwinding responsible for looking up the handler:
    ; zero-cost exceptions, three approaches:
    ;   (1) Have the call site => unwind handler association stored in a global
    ;       table. Maintainance of this global table is tricky, and lookup is slow.
    ;   (2) Include an extra instruction "nop [rax + offset]" at call site in which
    ;       offset = addr(unwind handler) - addr(call site). The unwind_resume
    ;       simply expects a "nop" at the call site and decodes the offset from the
    ;       immediate in the instruction. While this gives a O(1) lookup time for
    ;       call site => unwind handler, it is not technically zero-cost.
    ;         4 byte nop giving 8*4*2^8 = 8*1024 targets:
    ;           zzzzzzzz_01xxx0yy_00011111_00001111
    ;         7 byte nop giving 8*4*2^32 targets:
    ;        .zzzzzzzzzz_10xxx0yy_00011111_00001111
    ;           0yy = 100 is not allowed in either
    ;   (3) 
    ;  in: rax address of exception object
    ;      [rsp] is call site
    unwind_resume:
                pop  rcx                ; rcx = address of call site
               push  rbx rsi rdi
                mov  edx, dword[rcx]    ; disassemble the hopeful nop
                and  edx, 0x00c41f0f
              movzx  esi, byte[rcx + 2]
                and  esi, 0x3b
                cmp  edx, 0x00401f0f
                jne  .not_small_offset
              movsx  rbx, byte[rcx + 3]
    .cont:
                lea  rbx, [rbx + 96]
                shr  esi, 1
                adc  rcx, rsi
                shl  rbx, 5
                shr  esi, 1
                adc  rcx, rbx
                pop  rdi rsi rbx
                jmp  rcx
    .not_small_offset:
                cmp  edx, 0x00801f0f
                jne  .unhandled_exc
             movsxd  rbx, dword[rcx + 3]
                jmp  .cont
    .unhandled_exc:
            ; oops, exit program with unhandled exception
Next, the macro that you are supposed to insert after every call that could throw.
There is a 3072 byte shift of the offset because usually the unwind handlers are
placed in a later code block, increasing the chances that the nop can fit in
4 bytes instead of 7.
    macro Unwind a {
        local n, nlo, nmid, nhigh
        n = a - $ - 3072
        nlo = n and 3
        nmid = n and 28
        if n < 0
            nhigh = -1-((-1-n) shr 5)
        else
            nhigh = n shr 5
        end if
        if (-128 <= nhigh) & (nhigh < 128)
            db 0x0f, 0x1f, 64+2*nmid+nlo, nhigh
        else
            db 0x0f, 0x1f, 128+2*nmid+nlo
            dd nhigh
        end if
    }
Finally, the code that you actually have to write to use this. Code that
doesn't catch exceptions looks much better, as it just has to pass the exception
object onto unwind_resume.
        ; in: er rax
        ; out: ex rax
    dcode_sCompoundExpression:
               push  rbx rsi rdi
                mov  rbx, rax
                mov  rdi, qword[rax + ex_node.length]
                xor  esi, esi
    .entry:
                cmp  rsi, rdi
                jae  .return_sNull
                jmp  .eval
    .next:
            ExClear  rax
    .eval:
                mov  rax, qword[rbx + ex_node.child + 8*(rsi + 1)]
             ExCopy  rax
               call  ex_eval
             Unwind  .unwind.1
                inc  rsi
                cmp  rsi, rdi
                 jb  .next
    .done:
                pop  rdi rsi rbx
                ret
    .return_sNull:
           ExGetSym  rax, sNull
             ExCopy  rax
                jmp  .done
    .unwind.1:
        ; the call to ex_eval threw: find the type of the exception without
        ; disturbing the actual exception object in rax
                mov  rcx, qword[rax + exc_base.type]
                cmp  rcx, EXC_TYPE_sGoto
                 je  .sGoto
                pop  rdi rsi rbx
                jmp  unwind_resume
    .sGoto:
        ; the exception is a Goto
               push  rbp rax
                mov  rbp, qword[rax + exc_with_ex.data]
                mov  rbp, qword[rbp + ex_node.child + 8*1]
                xor  esi, esi
    .search_next:
                mov  rax, qword[rbx + ex_node.child + 8*(rsi + 1)]
             ExType  edx, rax
           ExGetSym  rcx, sLabel
                cmp  edx, EX_TYPE_NODE
                jne  .not_found
                mov  rdx, qword[rax + ex_node.child + 8*0]
                cmp  qword[rax + ex_node.length], 1
                jne  .not_found
                cmp  rdx, rcx
                jne  .not_found
                mov  rax, qword[rax + ex_node.child + 8*1]
                mov  rcx, rbp
               call  ex_same
             Unwind  .unwind.1.1
               test  eax, eax
                jnz  .found
    .not_found:
        ; rethrow as we didn't find a matching label
                inc  rsi
                cmp  rsi, rdi
                 jb  .search_next
                pop  rax rbp
                pop  rdi rsi rbx
                jmp  unwind_resume
    .found:
        ; free the exception object and continue in the loop
                mov  rbp, qword[rsp]
                mov  rax, qword[rbp + exc_with_ex.data]
            ExClear  rax
                mov  rax, rbp
               call  exc_free
                inc  rsi
                pop  rax rbp
                jmp  .entry
    .unwind.1.1:
        ; exception inside .unwind.1 from ex_same
                mov  rdi, rax
                mov  rbp, qword[rsp]
                mov  rax, qword[rbp + exc_with_ex.data]
            ExClear  rax
                mov  rax, rbp
               call  exc_free
                inc  rsi
                pop  rax rbp
                mov  rax, rdi
                pop  rdi rsi rbx
                jmp  unwind_resume
And the code for generating and throwing the exception from Goto and
related functions:
    dcode_sContinue:
                mov  ecx, EXC_TYPE_sContinue
                jmp  dcode_sReturn.do_it
    dcode_sBreak:
                mov  ecx, EXC_TYPE_sBreak
                jmp  dcode_sReturn.do_it
    dcode_sGoto:
                mov  ecx, EXC_TYPE_sGoto
                jmp  dcode_sReturn.do_it
    dcode_sReturn:
                mov  ecx, EXC_TYPE_sReturn
    .do_it:
               push  rbx rsi rbx
                mov  rbx, rax
                mov  esi, ecx
                mov  eax, sizeof.exc_with_ex
               call  exc_alloc
             ExCopy  rbx
                mov  qword[rax + exc_with_ex.type], rsi
                mov  qword[rax + exc_with_ex.data], rbx
                pop  rdi rsi rbx
                jmp  unwind_resume
Writing code like this is fun at first, but admittedly can get tiresome. As such
asmcastle is not very far along: The parser is working and exceptions
are used to signal a syntax error, but there is no evaluation nor is there any
arithmetic beyond integers.
Pattern matching
Mathematica patterns are essentially regular expressions applied to trees
instead of plain old strings. My approach is described here,
and this is evidently completely different from WRI's approach.
Expression Compiler
Castle has had a compiler targeting its own internal byte code. This means that the code is still
interpreted but that the interpretation is orders of magnitude faster than the main interpreter.
Directly targeting x86-64 can give more speed. It is certainly possible to use llvm here, but it is
much more fun to emit the machine code oneself. Since the compiler has to do type inference on
Mathematica code anyways, much of SSA-based infrastructure for which one would use llvm already
must be present.
 
 
 
 Here is a simple example of the internal byte code and two machine code targets. The "d3" and "d3diff"
are calling conventions used by the ray tracer to evaluate f(x,y,z) and its gradient. The later
is generated from the byte code rather than from the result of symbolic differentiation (D).
While this gives slightly larger code than expected in this simple example, it easily scales
to loops and other control flow constructs on which D does not operate.
Here is a simple example of the internal byte code and two machine code targets. The "d3" and "d3diff"
are calling conventions used by the ray tracer to evaluate f(x,y,z) and its gradient. The later
is generated from the byte code rather than from the result of symbolic differentiation (D).
While this gives slightly larger code than expected in this simple example, it easily scales
to loops and other control flow constructs on which D does not operate.
 
Screenshots
 
The very first version had poor man's bold and a rather primitive repl compared to the current
version. The current version also treats spaces as multiplication, which is a big no-no, but
for compatibility what are you going to do.
 
At some point it was pointed out to me that Mathematica orders a sum and a non-sum
as if the non-sum was a sum of one term. With this ordering, we seem to match
perfectly, though I decided that strings should be compared with std::strcmp.
 
The syntax highlighting in the GUI is fully recursive as well as the rotation
boxes.
 
The rotation boxes can be edited in real time, and of course,
if the input is upside down, then the output is too.
 
Since pattern matching is at the core of Mathematica, the pattern matcher is one
of the first things that must be right.
 
 
The font system is completely custom-made and independent from the OS. The only
OS-specific code is used for the piping to connect the kernel and GUI, and for
the actual display of the bitmap to the screen.
