How Argentum makes fast dynamic_cast and interface method dispatch with 4 CPU instructions

Original link: reddit

TL;DR Argentum programming language can make interface method dispatch in just 4 instructions and dynamic_cast in just 8 CPU instructions - many orders of magnitude faster than rivals.
Unlike most modern languages it doesn't use double-sized pointers for interfaces, doesn't require this-movements and additional VPTRS in objects.
Its code is smaller and faster.

Why Dispatching Methods at Runtime

Polymorphism, one of the three pillars of object-oriented programming, requires objects of different classes to respond differently to the same requests. For example, calling the to_string method on an Animal object and an Engine object would yield dramatically different results. In general, when having a reference to an object, we don't know the exact code that will respond to the to_string call for that object reference.

So, the application code and the runtime library of the language must find the right entry point to the appropriate method corresponding to that class and method.

A Necessary Digression on Devirtualization. Naturally, modern compilers attempt to avoid this costly method lookup operation whenever possible. If the object's type is known at compile time, the compiler can eliminate the dispatch and directly call the required method of the specific class. This opens up possibilities for inlining the method body at the call site and further numerous optimizations. Unfortunately, there are situations — quite numerous indeed — when compile-time devirtualization is not feasible, and runtime dispatch must be used.

How Virtual Methods are Invoked

To invoke a virtual method, the compiler constructs tables of method pointers and applies various algorithms to compute indices in these tables. The case of single inheritance and tree-like class hierarchies is the fastest and most straightforward.

Let's consider a pseudo-C++ example:

struct Base {
    virtual void a_method() { puts("hello from Base::a_method"); }

struct Derived : Base {
    void a_method() override {
        puts("hello from Derived::a_method");
    virtual void additional_method() { puts("hello from additional_method"); }

void some_function(Base& b) {
    b.a_method(); // {1}

In the line {1}, the compiler does magic: Depending on the actual type of the object that the variable b references, either the Base::a_method or the Derived::a_method method can be called. This is achieved through method pointer tables and a couple of processor instructions. For instance, on an x86-64 processor using the Windows ABI, the code may look like this (pardon my Intel syntax):

mov rcx, b
mov rax, [rcx + offset_to_vmt_ptr]  ; {2} offset_to_vmt is typically 0
call [rax + offset_to_a_method]     ; {3}

This code works because inside the object referenced by b, there exists an invisible field, usually called vmt_ptr. This is a pointer to a static data structure that contains pointers to the virtual methods of that class.

In line {2}, we retrieve the pointer to the virtual method table (VMT), and in line {3}, we load the address of the entry point of the method and call it.

To make everything work, we also need two tables (one for each class) with method pointers:

Base_VMT   dq Base_a_method
Drived_VMT dq Derived_a_method, Derived_added_method
Post image
In the diagram: if the variable b references Base, the `mov` and `call` instructions will be directed to the red path, and if it references Derived, they will be directed to the blue one.

This method of invocation is straightforward, convenient to implement, consumes very little memory, provides free casting to base classes, and has negligible overhead during calls. It's used in Java for class inheritance, in C++ when there's no multiple inheritance, and generally wherever applicable.

Complex Inheritance

Unfortunately, in real-world applications, each class breaks down into several orthogonal roles (serializable, drawable, physical, collidable, evaluable). Sometimes roles form groups with other roles (SceneItem: drawable, collidable). All these roles, classes, and groups don't fit neatly into a single tree-like hierarchy of class inheritance. Not every graphical element is serializable, but some are. Not every element with collisions works with physics, but some do. That's why all modern programming languages have somehow allowed different forms of multiple inheritance in their class hierarchies.

In Java, Swift, C#, you can inherit implementation from one class and implement multiple interfaces. C++ permits multiple inheritance, though it introduces additional complexity when the same base class is inherited through different branches, leading to the introduction of virtual base classes. Rust implements multiple inheritance through trait implementation. Go formally avoids the term inheritance and replaces it with interface delegation and state composition. However, if it quacks like inheritance... In short, today we can say that all modern programming languages have moved away from the principle of single inheritance and tree-like class organization.

How complex inheritance affects method invocation

It varies across different programming languages:

Swift and Rust use references to a protocol/trait implementation, which are structures containing two raw pointers — one points to the object data, and the other to the virtual method table (witness table in Swift, vtable in Rust). By doubling the size of each reference, Rust and Swift enable interface method calls to be as fast as regular class virtual method calls.

Go also stores each interface reference as two pointers, but dynamically constructs method tables upon first use and stores the results in a global hashmap.

In Java and Kotlin, upon the first method call, a linear search is performed in the list of implemented interfaces, and the result is cached. If a small number (1-2) of different classes are encountered at a call site, the JIT compiler generates a specially optimized dispatcher code, but if a new class emerges, it reverts to linear search.

C++ utilizes a rather intricate approach: each class instance contains multiple pointers to virtual method tables. Every cast to a base or derived class, if they cannot be simplified into a tree, leads to a this-pointer movement in memory. This ensures that the object pointer cast to type T points to the virtual method table of type T in that object. This allows virtual methods to be called at the same speed for both single and multiple inheritance.

Each approach is a trade-off between memory usage, method invocation complexity, and pointer manipulation complexity.

The Java/Kotlin approach optimizes benchmarks by caching recently called methods and performing "runtime devirtualization" where possible. For highly polymorphic interface methods, the general dynamic dispatch essentially boils down to linear searches in lists of interface names. This makes sense if a class implements one or two interfaces, but can be costly overall.

Go, Rust, and Swift enable fast method calls, but the doubling of pointer sizes can quickly deplete the register file when passing parameters during method calls and working with local/temporary references. This can result in register spillage to memory. Additionally, it complicates reference casts between types (traits/protocols/interfaces), which in Swift is inherited from Objective-C (with dictionary-based protocol identifier lookup), and Rust lacks such casts entirely, forcing programmer ro manually write as_trait_NNN methods. Swift includes a mechanism to suppress virtualization through the instantiation of template functions for each protocol implementation (using keywords some vs any). This mechanism doesn't work for true polymorphic containers. In Rust, the suppression mechanism is always enabled and can be turned off using the keyword dyn.

C++ doesn't consume additional memory in each raw pointer, and its method invocation is fast for both single inheritance and multiple inheritance cases. However, complexity doesn't disappear: this approach leads to significant object structure and method code complexity. It introduces thunk functions, complicates constructor code, and all type casting operations. These operations are less frequent in the C++ paradigm, so their complexity isn't that important. But if this approach is transferred to systems with introspection or automatic memory management, there every operation requires access to the object's header for markers, counters, and flags, it would require a static_cast<void*>. This cast in C++ isn't free and it is incompatible with virtual inheritance. Such an operation would be required for each reference copy and deletion, or for each object scan in the case of GC. This is why smart pointers in C++ store a separate pointer for counters and markers, consuming memory somewhat like Rust/Swift. By the way, safe dynamic_cast in C++ requires RTTI data lookup, making its complexity comparable to Swift/Java/Go.

In conclusion, there are multiple problems with multiple inheritance method dispatch, and existing solutions leave room for improvement.

Argentum Approach

Each Argentum class can inherit the implementation of another class and implement a number of additional interfaces, like in Java, Kotlin, Swift and some other languages.

Here is an example Argentum program with classes and interfaces (Java-resembling syntax):

// Declare some interface
interface Drawable {
   width() int;
   height() int;
   paint(c Canvas);

// Implement this interface in some class
class Rectangle {
   + Drawable {
        width() int { right - left }
        height() int { bottom - top }
        paint(c Canvas) {
             c.drawRect(left, top, right, bottom);
   + Serializable { ... }  // Implement more...
   + Focusable { ... }     // ...interfaces
   left = 0;           // Fields
   right = 0;
   top = 0;
   bottom = 0;

// Create an instance.
r = Rectangle;

// Call interface methods
w = Window.initCentered("Hello", r.width(), r.height());

At the call site r.paint(w); the compiler generates code:

; rdx - pointer to `w`
; rcx - pointer to `r`
mov rax, Drawable_interface_id + Drawable_paint_method_index
call [rсx]

For each class, the first field is a pointer to the dispatch function.For our Rectangle, this function would be something like this:

   movzx r10, ax
   pext rax, rax, Rectangle_magic_number
   mov rax, Rectangle_interface_table[rax*8]
   jmp [rax + r10*8]

Some modern processors cannot into pext instruction, so for now, we replace it with bextr or:

and rax, Rectangle_magic_mask
shr rax, Rectangle_magic_offset

Besides the dispatch function, Rectangle will have a set of tables:

   dq Rectangle_Drawable_method_table
   dq empty
   dq Rectangle_Serializable_method_table
   dq Rectangle_Focusable_method_table
   dq Rectangle_Drawable_width   ; method entry points
   dq Rectangle_Drawable_height
   dq Rectangle_Drawable_paint
Rectangle_Serializable_method_table dq ...
Rectangle_Focusable_method_table    dq ...

How and Why It Works

During compilation, each interface is assigned a randomly chosen 48-bit identifier (stored in the highest 48 bits of a 64-bit machine word with zeros in the lower 16 bits for method indexing).

When invoking an interface method, the caller invokes the dispatcher of the target class, passing the interface identifier and a 16-bit method index within the interface as parameters.

The dispatcher must distinguish the interfaces implemented in the given class based on these identifiers. The total number of classes and interfaces in the application doesn't matter. There can be hundreds or even thousands of them. What matters is to distinguish the interfaces of this particular class, which might be just a few or in the worst case, a few tens. Thanks to strong typing, we have the guarantee that only valid interface identifiers will be passed (concerning dynamic_cast that provides this guarantee at runtime, see below).

If there's only one interface, the dispatcher bypasses the interface selection and directly transfers control to the method using the method index.

    movzx rax, ax
    jmp MyClass1_InterfaceX_methods[rax*8]

If a class implements two interfaces, their identifiers are guaranteed to differ by one bit in at least one of the 48-bit positions of their identifiers. The compiler's task is to find this position and construct a dispatcher that checks this bit:

    movzx r10, ax              ; retrieve the method index in r10
    shr rax, BIT_POSITION  ; this can be done in one..
    and RAX, 1             ; ...instruction like pext/bextr
    ; load the pointer to one of the two method tables
    mov rax, MyClass2_itable[rax*8]
    jmp [rax + r10*8]      ; jump to the method

In the case of a class implementing three interfaces, we will need a two-bit selector. When selecting three random 48-bit numbers, on average, there will be around 17.6 unique two-bit selectors from adjacent bits. Thus, the mentioned approach will work with a very high probability for more interfaces. A larger number of interfaces will require a larger selector size.

Example: Let's assume we have a class that implements five different interfaces. The identifiers of these interfaces have a unique sequence of three bits with an offset of 4.

InterfaceID (hex)ID (bin, N least bits)
IAf78745bed8931110110110001 001 0011
IB9b5aed351b360101000110110 011 0110
IC08d460f812a61000000100101 010 0110
ID6d0a3a225df60010010111011 111 0110
IE54d4c7d9bd0f1001101111010 000 1111

This class dispatcher looks like:

    movzx r10, ax
    shr rax, 4
    and rax, 7
    mov rax, class_N_itable[rax*8]
    jmp [rax + r10*8]
class_N_itable dq clN_IE, clN_IA, clN_IC, clN_IB, empty, empty, empty, clN_ID
clN_IE dq classN_interfaceIE_method1...

Of course, it might be impossible to find a selector with unique values for each interface among randomly encountered interface identifiers. What are the success probabilities for the basic dispatcher algorithm?

The number of interfaces in a classThe width of the selector in bitsUnused slots in the interface tableAverage number of unique selectors in 48-bit interface identifiers

Starting from seven interfaces in a single class, the probability of finding a continuous group of selector bits significantly decreases. We can address this by:

  • Using wider tables (+1 bit)
  • Allowing selectors to not be contiguous
  • Introducing new levels of tables.

Wide tables

Example of a class with eight interfaces:

InterfaceID (hex)ID (bin, N least significant bits)
IA36d9b3d6c5ad011011000101101 0110 1
IB6a26145ca3bf110010100011101 1111 1
ICc4552089b037100110110000001 1011 1
ID917286d627e4011000100111111 0010 0
IE889a043c83da110010000011110 1101 0
IF6b30d1399472100110010100011 1001 0
IG5939e20bb90b101110111001000 0101 1
IH850d80997bcf100101111011110 0111 1

Among the interface identifiers, there isn't a unique 3-bit selector, but there is a 4-bit one in position 1.

By increasing the size of the table by an average of 15 machine words, we achieve significantly better probabilities of finding suitable selectors, even up to cases where a class implements 13 interfaces.

Allowing Gaps in Selectors

Often, 48-bit interface identifiers contain the necessary selectors, but they might not be in contiguous bits. The ideal solution would be to use the pext instruction, which can extract arbitrary bits from a register based on a mask. However, this instruction is not available on all processors and might take an impractical 300 cycles in some cases. Hence, let's explore a more cost-effective and widely applicable approach: N contiguous bits + one standalone bit. Such a sequence can be achieved by adding just one add operation:

Expression                        Binary value
interface id                      xABxxxxCxx
mask                              0110000100
id & mask                         0AB0000C00
adder                             0000111100
(id & mask) + adder               0ABCxxxx00
((id & mask) + adder) >> offset   0000000ABC
In binary value A-B-C - desired bits, x - garbage

Code that performs this bit extraction:

movzx r10, ax
and rax, 184h
add rax, 1ch  ; extra `add` instruction
shr rax, 5
mov rax, class_N_itable[rax*8]
jmp [rax + r10*8]

By simultaneously using an additional add instruction and +1bit table width, we can confidently construct dispatchers for classes with 20+ interfaces, surpassing all practical dispatch scenarios. Utilizing the pext instruction would further enhance probabilities and reduce table sizes, staying within the limit of four instructions.

In general, finding a perfect hash function that computes with minimal resource overhead can have multiple solutions, but the bit mask extraction is the simplest among them.

How This Approach Accelerates dynamic_cast in Argentum

// Speaking of Syntax
// In Argentum, dynamic_cast takes the form:
// expression ~ Type, and returns optional(Type),
// for example:
someRandomObject ~ MyPreciousType ? _.callItsMethod();
// Read as: cast 'someRandomObject' to type 'MyPreciousType',
// and if successful, call the interface method 'callItsMethod' on it.

In Argentum, each method table has a special method at index 0. This method compares the dispatching interface identifier with the actual interface implemented by this table, and returns either the this-object or null.

When we need to check if an object has interface X, we call the method at index 0 with the 48-bit identifier of interface X for that object.

  • If the interface is implemented in the class, then selector extraction and accessing the interface table, reach the method at index 0, where the identifier X matches the constant encoded in this method. Hence, the method at index 0 returns this.
  • Otherwise if the interface X is not implemented, selector extraction takes us to the only interface table it might reside in. Consequently, the single comparison between identifier X and the identifier of the actual interface associated with this method table determines the failure of the dynamic_cast.

By the way, because of dynamic_cast, the unused entries in interface tables are filled with references to a special method table with a single element that always returns nullptr.

Hence, dynamic_cast to an interface in Argentum always takes 3 machine instructions and is executed in 10 machine instructions:

  • 3 instructions for calling method 0 of the specified interface with parameter passing (can be reduced to 2).
  • 4 dispatcher instructions.
  • 3 method 0 instructions: comparison, cmovret (can be reduced to 2 if we can accept a zero flag instead of a pointer).

Comparison with Existing Languages

In Argentum, every reference is just a single pointer to the beginning of an object. A single machine word.

  • Compared to Swift/Rust/Go, Argentum has no overhead related to pointers of doubled width, no register file spills. For instance, in x86/64 Win ABI, only 4 registers are assigned for passing parameters - two references in Swift would deplete them all, and the third function parameter would have to go through memory.
  • Compared to C++ static_cast, Argentum doesn't have the overhead of moving this-pointer within an object (with nullptr checks).

Each object has only one dispatch-related field: a pointer to the dispatcher.

  • Compared to C++, Argentum has no overhead of multiple pointers to various VMTs and virtual base offsets within object data, and doesn't have overhead during object creation.

Compared to a simple virtual method call with single inheritance, we have four dispatcher instructions.

  • This is orders of magnitude cheaper than Java dispatch.
  • This is close to C++, where in multiple inheritance cases, we often encounter the need to adjust this-pointer by the offset stored in VMT. In C++, such correction is automatically done by the thunk code, which is comparable in complexity to our dispatcher's four instructions.
  • In Rust, Go, and Swift the interface method invocation is faster by these four instructions, but they lose two instructions in each operation of passing, saving, and loading references due to their doubled size, and these operations happen more frequently than calls.

Argentum supports dynamic_cast to interfaces, which takes three machine instructions in the program code and is executed in 10 instructions.

  • This is multiple orders of magnitude cheaper than in Swift, Java, Go, and dynamic_cast in C++.
  • Rust doesn't have such an instruction.

By the way, this dispatch method is suitable for the case of dynamically loading modules that bring new classes and interfaces to AppDomains:

  • When adding a new interface, it gets a randomly generated unique 48-bit identifier. Existing dispatchers and tables won't need to be rebuilt.
  • The same applies to classes. Adding a class to the application only requires generating its own dispatcher and tables, without affecting existing ones.

Unlike many other Argentum features determined by the language's architecture (absence of memory leaks, absence of GC, absence of shared mutable state, races, deadlocks, etc.), the dispatch method described here can be borrowed and applied in other languages.

Leave a Reply

Your email address will not be published. Required fields are marked *