Multiline raw string constants

String literals are bad but necessary

In general, the use of literal strings in the program source code is considered a bad practice. There are two reasons for this:

  • Firstly, the text of the program is not something that needs to be changed every time you change some user-facing message.
  • Secondly, the message should be displayed in different languages in different cultures. Therefore, for reasons of configurability and internationalization, it is considered good to have all texts outside the programs, say in some configs.

Unfortunately, in the modern world, a huge number of programs are forced to work with texts. And not with a texts that are shown to users, but with technical texts:

  • All network communications between clients and servers is performed in JSON or XML, which is internally textual.
  • Internet resource addresses are texts.
  • Filenames are also texts.

Even in the ideal case it is impossible to eliminate texts from program source code. For example, if a program takes all its texts from a configuration file, the name of this file must still be present in the program so that it can open it at startup. And this filename is also a text.

Therefore, string constants still have their place in the programming languages.

How it works in nowadays languages

All modern languages started adding the ability to specify multi-line text constants without the need to escape special characters.

Raw strings appeared in C++. JavaScript also has the ability to specify multi-line constants using the special character "`".

To enable language multiline text constants, a number of technical problems need to be solved:

  1. How to define the end of a constant.
  2. How to encode line endings.
  3. What to do with white space at the beginning of each line (indentation).

Let's explore these issues in more detail.

1. String constant separator

We encode a string which can have absolutely any characters in any combinations, we must somehow select a special character or a special combination of characters that will tell the compiler that the string constant has ended here and the usual program text has continued. In JavaScript, the same backtick character is used to end the constant. Perhaps the authors of this language decided that this symbol is used extremely rarely. Actually it is not:

  • Firstly, this character encodes the inline code in markdown markup.
  • Secondly, the very choice of this character as a delimiter leads to the fact that this character will be used in JavaScript quotes. All this makes this symbol extremely common.

In contrast C++ allows you to specify any arbitrary combination of characters as delimiter. This is much better, but the programmer still has to look through the entire text of the string constant to make sure that the delimiter is not present in this text.

2. line ends

There are three common ways in the wild to encode a newline.

  • The CR (\r 0x0d) character in the macos-classic operating system (and many more).
  • LF (\n 0x0a) character in Unix operating systems.
  • Combination of CR LF in operating systems of the windows family.
    (there are others mostly historical)

The source text of the program can have any encoding of line endings. A literal string constant can also have any string ending encoding. And these two encodings do not have to match. For example, HTTP protocol headers are required to have CRLF line endings. Most Unix text editors go crazy if they see non-Unix line endings. Therefore, in the same stream of characters that are transmitted over HTTP, very often the header is encoded in one way and the body of the message in another.

In modern languages such as C++ and Python, multiline constants are forced to be encoded with the LF character. This covers most cases, but does not solve all problems. JavaScript retains the encoding that was present in the source code of the program. Which is also far from being the optimal solution.

3. Indentation

In all modern programming languages, program structure is indented.

String constants rarely end up at the very left margin. Most modern programming languages do nothing about indentation in multiline string literals. This indentation is simply transferred from the program text to the text of string constants. As a result, the programmer has three options:

  • Either mutilate the source code of the program by shifting all string constants to the beginning, breaking the indentation system
  • Or put up with the fact that there will be a lot of random whitespaces in the constants,
  • Or refuse multiline constants at all.

For a comprehensive solution to this problem, it can be divided into three parts:

  • 3.a. Should we remove the original program text indentation and if yes, how to determine how much to remove.
  • 3.b. At the request of the programmer, the compiler needs to add a certain amount of indentation inside this constant. This can be useful if the constant itself encodes some fragment of a JSON or XML file or the source code of another program that also has an indentation system and these indentations do not match the indentation of this string constant in the generating program.
  • 3.c. Sometimes a final string constant must be used where the whitespace for indentation is not a space character, but a tab character (for example, in MakeFiles). So, generally speaking, the encoding of indents in the generating program and in the string constant do not have to match.

All of these considerations become especially important because modern text editors and version control systems can quite freely convert spaces to tabs, tabs to spaces, line breaks from one format to another.

Argentum implementation

A multiline string constant does not use any special characters like backtick or their combinations. A multi-line text constant begins with a "quote" character and differs from a normal constant in that the "quote" is followed by a newline character. Previously, such lines were considered erroneous. Now these are multi-line string constants.

After the newline character goes the first line of the actual string constant. This line must be indented.

The multiline constant continues in the following lines having the same or greater indent. Therefore, there is no delimiter character needed. The constant is constrained by indentation. This solves problem #1 of multi-line text constants (see above). This outdented line must start with a "quote" symbol. This allows to properly highlight text in multiple text editors.

Example:

log("
   This is a multiline
   constant.
   "Hello"
");

All spaces that make up the first indent are excluded from all lines of the string constant. This solves problem #3a.

So the above example will print:

This is a multiline
constant.
"Hello"

All indentations beyond the base one becomes part of the string constant. By default, no transformation is performed on them. According to the Argentum language standard, indentation can only be set by spaces, so by default these spaces from the program text migrate to a string constant.

sys_log("
   <ul>
      <li>
         constant
      </li>
   </ul>
");

prints
<ul>
   <li>
      constant
   </li>
</ul>

This default behavior can be changed: at the very beginning of the declaration of a string constant, a number can be added between the quote character and the end of the string. In this case, the compiler:

- will require that all additional indents inside the string constant be a multiple of this number,

- will replace the appropriate number of spaces with one tab character. This solves problem #3c.

sys_log("4
   <ul>
       <li>
           constant
       </li>
   </ul>
");

prints
<ul>
\t<li>
\t\tconstant
\t</li>
</ul>

You can also specify in the format string that the compiler has to insert a certain number of spaces or tabs at the beginning of each line to solve problem #3b:

sys_log("....
   <ul>
      <li>
         constant
      </li>
   </ul>
");

prints 4 spaces in the beginning of each line (as specified by "....") 
    <ul>
       <li>
          constant
       </li>
    </ul>

The default newline character is LF. But this can be altered using the formatting code, which is indicated in the same place as the tab stop - between the quote character and the end of the line. You can write any combination of "nr" "rn" "r" or "n" there, and the compiler will insert the corresponding characters "\n" "\r" at the end of the line. This solves problem #2.

sys_log("rn
  GET / HTTP/1.1
  Host: localhost
");

prints
GET / HTTP/1.1<CR><LF>
Host: localhost

Sometimes it is useful to add a newline character(s) to the end of a string constant. Of course, you can add an empty line(s) with indents at the end of the string constant, but these characters are invisible (i.e consisting only of spaces), which will adversely affect readability. Therefore, such a forced addition of a new line can be done using the same format string: if it ends with a back-slash "\", the compiler will add new lines at the end of the string constant, and if you also add a dot "+", the compiler will add additional indentation(s) after the last line, the same as in all other lines.

sys_log("rn\\
  GET / HTTP/1.1
  Host: localhost
");

prints
GET / HTTP/1.1<CR><LF>
Host: localhost<CR><LF><CR><LF>

As a result, the format string:

  • turns spaces into tabs if needed
  • removes the original indentation and replaces it with the specified indents
  • sets the line endings
  • adds a final new line(s) and a final indentation.

The most complex form example (never actually used in full):

log("tt.2r\\+
     aaa
       ccc
");
where:
  tt.  add tab-tab-space to the beginning of each line
  2    convert extra spaces into tabs (2 spaces for 1 tab)
  r    make line endings \r aka CR
  \    add an extra empty line at the end
  \+   add a second extra line and an indentation at the end of string

This prints
<TAB><TAB> aaa<CR>       (all "<TAB><TAB> "s from "tt."), (all <CR>s from "r")
<TAB><TAB> <TAB>ccc<CR>  (extra <TAB> from "2") (CR from "\")
<CR>                     (from the second "\")
<TAB><TAB>               (from "+")

Conclusion: Argentum multiline strings solve the common multiline string problems that remain unsolved in other programming languages. And it does it in a simple and optimal way.

BTW, multi-line string constants also support string interpolation.

Leave a Reply

Your email address will not be published. Required fields are marked *