Strings and Cursors

In argentum strings are mere sequences of Unicode code-points. Internally they are in UTF8.

Strings are immutable, and not indexable. There is only one way to access string characters - to read them one-by-one start-to-end.

Previously class sys_String played two roles:

  • when it was mutable, it allowed reading string characters like input stream or cursor,
  • when it was frozen, it just held the characters and to actually access them, you needed to make a mutable copy.

This was odd and ineffective, because internally String was a class instance with a shared pointer to the string buffer of variable size. So Argentum made two allocations for one string, even when string was immutable.

In the new, redesigned version strings got simplified in their internal representation and made less surprising:

  • sys_String class become strictly immutable. You cannot modify characters as before, you also cannot move"position".
  • Argentum got a handy alias for shared pointer to immutable sys_String - str.
  • In order to access string characters there should be acquired a sys_Cursor object that holds the current position in the string and extracts the codepoints moving the current position start-to-end as in an input stream. So cursor is mutable but underlying string is not.
s = "Hello";   // s is of type `str` aka `*sys_String` - a shared pointer to characters

// Acquire a cursor
c = s.cursor();  // c is a cursor pointing to the first character of the string

// Read character one by one
c.getCh()  // Returns a Unicode code-point of character `H`
c.getCh()  // Returns a Unicode code-point of character `e`

// Make a copy of cursor
c1 = @c;   // Creates a separate cursor pointing at `l`

// Skip code-points in a loop
loop c.getCh() == 'o';   // skip all characters till 'o' (including)

// Attempts to read after end of string
c.getCh()  // Returns 0 as the end-of-string indicator
c.getCh()  // This and all subsequent calls will return 0

// Cursor assignments and resets
c.set(s);  // Resets cursor at the beginning of string "Hello"
c := c1;    // Makes `c` and `c1` referencing the same cursor pointing at 'l'
c := @c1;   // Makes `c` a distinct cursor pointing to 'l'

Both String and Cursor classes can be extended with methods that perform parsing. For example there is a module that adds following methods:

String.tokenize(char) // that splits string by char and returns a SharedArray(String)
Cursor.getTill(char)  // that extracts a substring out of string up to given char
Cursor.peekCh()       // that returns the next code-point without removing it from stream

This set of helper functions covers needs of tests, examples and demos, and can be easily extended as needed.

It can be said that previously strings were internally represented by string buffers and a string objects. Now these two objects were separated and represented as String and Cursor instances.

For completeness: the opposite task - string synthesis - is performed by another two runtime library classes:

  • sys_Blob - a generic byte array of variable size that also allows 8-16-32-64 bit integer access and supports utf8 characters runes manipulations. It can produce strings out of byte ranges.
  • sys_StrBuilder descendant of sys_Blob that exposes put* methods for different data types that working together with string interpolations allow to build strings and format user data types using lightweight syntax.

Leave a Reply

Your email address will not be published. Required fields are marked *