\documentclass[12pt,oneside]{article} \usepackage[T1]{fontenc} %\usepackage{euler} \usepackage{amssymb, amsmath, amsfonts, stmaryrd} \usepackage[mathscr]{euscript} \usepackage{mathrsfs} \usepackage{theorem} \usepackage[english]{babel} \usepackage{bm} \usepackage[all]{xy} \usepackage{array} \usepackage{multirow} %\usepackage{chngcntr} %\CompileMatrices \usepackage[bookmarks=false,pdfauthor={Nikolai Durov},pdftitle={Telegram Open Network Virtual Machine}]{hyperref} \usepackage{fancyhdr} \usepackage{caption} % \setlength{\headheight}{15.2pt} \pagestyle{fancy} \renewcommand{\headrulewidth}{0.5pt} % \def\makepoint#1{\medbreak\noindent{\bf #1.\ }} \def\zeropoint{\setcounter{subsection}{-1}} \def\zerosubpoint{\setcounter{subsubsection}{-1}} \def\nxpoint{\refstepcounter{subsection}% \smallbreak\makepoint{\thesubsection}} \def\nxsubpoint{\refstepcounter{subsubsection}% \smallbreak\makepoint{\thesubsubsection}} \def\nxsubsubpoint{\refstepcounter{paragraph}% \makepoint{\paragraph}} %\setcounter{secnumdepth}{4} %\counterwithin{paragraph}{subsubsection} \def\refpoint#1{{\rm\textbf{\ref{#1}}}} \let\ptref=\refpoint \def\embt(#1.){\textbf{#1.}} \def\embtx(#1){\textbf{#1}} \def\emb#1{\textbf{#1.}} \long\def\nodo#1{} % %\def\markbothsame#1{\markboth{#1}{#1}} \fancyhf{} \fancyfoot[C]{\thepage} \def\markbothsame#1{\fancyhead[C]{#1}} \def\mysection#1{\section{#1}\fancyhead[C]{\textsc{Chapter \textbf{\thesection.} #1}}} \def\mysubsection#1{\subsection{#1}\fancyhead[C]{\small{\textsc{\textrm{\thesubsection.} #1}}}} \def\myappendix#1{\section{#1}\fancyhead[C]{\textsc{Appendix \textbf{\thesection.} #1}}} % \let\tp=\textit \let\vr=\textit \def\workchainid{\vr{workchain\_id\/}} \def\shardpfx{\vr{shard\_prefix}} \def\accountid{\vr{account\_id\/}} \def\currencyid{\vr{currency\_id\/}} \def\uint{\tp{uint}} \def\opsc#1{\operatorname{\textsc{#1}}} \def\blkseqno{\opsc{blk-seqno}} \def\blkprev{\opsc{blk-prev}} \def\blkhash{\opsc{blk-hash}} \def\Hash{\opsc{Hash}} \def\Sha{\opsc{sha256}} \def\CellRepr{\opsc{CellRepr}} \def\Lvl{\opsc{Lvl}} \def\height{\opsc{height}} \def\len{\opsc{len}} \def\leaf{\opsc{Leaf}} \def\node{\opsc{Node}} \def\root{\opsc{Root}} \def\emptyroot{\opsc{EmptyRoot}} \def\code{\opsc{code}} \def\Ping{\opsc{Ping}} \def\Store{\opsc{Store}} \def\FindNode{\opsc{Find\_Node}} \def\FindValue{\opsc{Find\_Value}} \def\Bytes{\tp{Bytes}} \def\Transaction{\tp{Transaction}} \def\Account{\tp{Account}} \def\State{\tp{State}} \def\Maybe{\opsc{Maybe}} \def\List{\opsc{List}} \def\Block{\tp{Block}} \def\Blockchain{\tp{Blockchain}} \def\isValidBc{\tp{isValidBc}} \def\evtrans{\vr{ev\_trans}} \def\evblock{\vr{ev\_block}} \def\Hashmap{\tp{Hashmap}} \def\HashmapE{\tp{HashmapE}} \def\Type{\tp{Type}} \def\nat{\tp{nat\/}} \def\hget{\vr{hget\/}} \def\bbB{{\mathbb{B}}} \def\st#1{{\mathbf{#1}}} \def\sgn{\operatorname{sgn}} \def\caret{\^{}} % \hfuzz=0.8pt \title{Telegram Open Network Virtual Machine} \author{Nikolai Durov} \begin{document} %\pagestyle{myheadings} \maketitle \begin{abstract} The aim of this text is to provide a description of the Telegram Open Network Virtual Machine (TON VM or TVM), used to execute smart contracts in the TON Blockchain. \end{abstract} \section*{Introduction} \markbothsame{Introduction} The primary purpose of the Telegram Open Network Virtual Machine (TON VM or TVM) is to execute smart-contract code in the TON Blockchain. TVM must support all operations required to parse incoming messages and persistent data, and to create new messages and modify persistent data. Additionally, TVM must meet the following requirements: \begin{itemize} \item It must provide for possible future extensions and improvements while retaining backward compatibility and interoperability, because the code of a smart contract, once committed into the blockchain, must continue working in a predictable manner regardless of any future modifications to the VM. \item It must strive to attain high ``(virtual) machine code'' density, so that the code of a typical smart contract occupies as little persistent block\-chain storage as possible. \item It must be completely deterministic. In other words, each run of the same code with the same input data must produce the same result, regardless of specific software and hardware used.\footnote{For example, there are no floating-point arithmetic operations (which could be efficiently implemented using hardware-supported {\em double\/} type on most modern CPUs) present in TVM, because the result of performing such operations is dependent on the specific underlying hardware implementation and rounding mode settings. Instead, TVM supports special integer arithmetic operations, which can be used to simulate fixed-point arithmetic if needed.} \end{itemize} The design of TVM is guided by these requirements. While this document describes a preliminary and experimental version of TVM,\footnote{The production version will likely require some tweaks and modifications prior to launch, which will become apparent only after using the experimental version in the test environment for some time.} the backward compatibility mechanisms built into the system allow us to be relatively unconcerned with the efficiency of the operation encoding used for TVM code in this preliminary version. TVM is not intended to be implemented in hardware (e.g., in a specialized microprocessor chip); rather, it should be implemented in software running on conventional hardware. This consideration lets us incorporate some high-level concepts and operations in TVM that would require convoluted microcode in a hardware implementation but pose no significant problems for a software implementation. Such operations are useful for achieving high code density and minimizing the byte (or storage cell) profile of smart-contract code when deployed in the TON Blockchain. \clearpage \tableofcontents \clearpage \mysection{Overview} This chapter provides an overview of the main features and design principles of TVM. More detail on each topic is provided in subsequent chapters. \zeropoint\mysubsection{Notation for bitstrings}\label{p:bitstring.hex} The following notation is used for bit strings (or {\em bitstrings\/})---i.e., finite strings consisting of binary digits (bits), \texttt{0} and \texttt{1}---throughout this document. \nxsubpoint\emb{Hexadecimal notation for bitstrings} When the length of a bitstring is a multiple of four, we subdivide it into groups of four bits and represent each group by one of sixteen hexadecimal digits \texttt{0}--\texttt{9}, \texttt{A}--\texttt{F} in the usual manner: $\texttt{0}_{16}\leftrightarrow\texttt{0000}$, $\texttt{1}_{16}\leftrightarrow\texttt{0001}$, \dots, $\texttt{F}_{16}\leftrightarrow\texttt{1111}$. The resulting hexadecimal string is our equivalent representation for the original binary string. \nxsubpoint\emb{Bitstrings of lengths not divisible by four} If the length of a binary string is not divisible by four, we augment it by one \texttt{1} and several (maybe zero) \texttt{0}s at the end, so that its length becomes divisible by four, and then transform it into a string of hexadecimal digits as described above. To indicate that such a transformation has taken place, a special ``completion tag'' \texttt{\_} is added to the end of the hexadecimal string. The reverse transformation (applied if the completion tag is present) consists in first replacing each hexadecimal digit by four corresponding bits, and then removing all trailing zeroes (if any) and the last \texttt{1} immediately preceding them (if the resulting bitstring is non-empty at this point). Notice that there are several admissible hexadecimal representations for the same bitstring. Among them, the shortest one is ``canonical''. It can be deterministically obtained by the above procedure. For example, \texttt{8A} corresponds to binary string \texttt{10001010}, while \texttt{8A\_} and \texttt{8A0\_} both correspond to \texttt{100010}. An empty bitstring may be represented by either `', `\texttt{8\_}', `\texttt{0\_}', `\texttt{\_}', or `\texttt{00\_}'. \nxsubpoint\label{sp:hex.bitst}\emb{Emphasizing that a string is a hexadecimal representation of a bitstring} Sometimes we need to emphasize that a string of hexadecimal digits (with or without a \texttt{\_} at the end) is the hexadecimal representation of a bitstring. In such cases, we either prepend \texttt{x} to the resulting string (e.g., \texttt{x8A}), or prepend \texttt{x\{} and append \texttt{\}} (e.g., \texttt{x\{2D9\_\}}, which is \texttt{00101101100}). This should not be confused with hexadecimal numbers, usually prepended by \texttt{0x} (e.g., \texttt{0x2D9} or \texttt{0x2d9}, which is the integer 729). \nxsubpoint\emb{Serializing a bitstring into a sequence of octets} When a bitstring needs to be represented as a sequence of 8-bit bytes (octets), which take values in integers $0\ldots255$, this is achieved essentially in the same fashion as above: we split the bitstring into groups of eight bits and interpret each group as the binary representation of an integer $0\ldots255$. If the length of the bitstring is not a multiple of eight, the bitstring is augmented by a binary \texttt{1} and up to seven binary \texttt{0}s before being split into groups. The fact that such a completion has been applied is usually reflected by a ``completion tag'' bit. For instance, \texttt{00101101100} corresponds to the sequence of two octets $(\texttt{0x2d}, \texttt{0x90})$ (hexadecimal), or $(45,144)$ (decimal), along with a completion tag bit equal to \texttt{1} (meaning that the completion has been applied), which must be stored separately. In some cases, it is more convenient to assume the completion is enabled by default rather than store an additional completion tag bit separately. Under such conventions, $8n$-bit strings are represented by $n+1$ octets, with the last octet always equal to $\texttt{0x80}=128$. \mysubsection{TVM is a stack machine}\label{p:tvm.stack} First of all, {\em TVM is a stack machine.} This means that, instead of keeping values in some ``variables'' or ``general-purpose registers'', they are kept in a (LIFO) {\em stack}, at least from the ``low-level'' (TVM) perspective.\footnote{A high-level smart-contract language might create a visibility of variables for the ease of programming; however, the high-level source code working with variables will be translated into TVM machine code keeping all the values of these variables in the TVM stack.} Most operations and user-defined functions take their arguments from the top of the stack, and replace them with their result. For example, the integer addition primitive (built-in operation) \texttt{ADD} does not take any arguments describing which registers or immediate values should be added together and where the result should be stored. Instead, the two top values are taken from the stack, they are added together, and their sum is pushed into the stack in their place. \nxsubpoint\label{sp:tvm.val}\emb{TVM values} The entities that can be stored in the TVM stack will be called {\em TVM values}, or simply {\em values\/} for brevity. They belong to one of several predefined {\em value types}. Each value belongs to exactly one value type. The values are always kept on the stack along with tags uniquely determining their types, and all built-in TVM operations (or {\em primitives}) only accept values of predefined types. For example, the integer addition primitive \texttt{ADD} accepts only two integer values, and returns one integer value as a result. One cannot supply \texttt{ADD} with two strings instead of two integers expecting it to concatenate these strings or to implicitly transform the strings into their decimal integer values; any attempt to do so will result in a run-time type-checking exception. \nxsubpoint\emb{Static typing, dynamic typing, and run-time type checking} In some respects TVM performs a kind of dynamic typing using run-time type checking. However, this does not make the TVM code a ``dynamically typed language'' like PHP or Javascript, because all primitives accept values and return results of predefined (value) types, each value belongs to strictly one type, and values are never implicitly converted from one type to another. If, on the other hand, one compares the TVM code to the conventional microprocessor machine code, one sees that the TVM mechanism of value tagging prevents, for example, using the address of a string as a number---or, potentially even more disastrously, using a number as the address of a string---thus eliminating the possibility of all sorts of bugs and security vulnerabilities related to invalid memory accesses, usually leading to memory corruption and segmentation faults. This property is highly desirable for a VM used to execute smart contracts in a blockchain. In this respect, TVM's insistence on tagging all values with their appropriate types, instead of reinterpreting the bit sequence in a register depending on the needs of the operation it is used in, is just an additional run-time type-safety mechanism. An alternative would be to somehow analyze the smart-contract code for type correctness and type safety before allowing its execution in the VM, or even before allowing it to be uploaded into the blockchain as the code of a smart contract. Such a static analysis of code for a Turing-complete machine appears to be a time-consuming and non-trivial problem (likely to be equivalent to the stopping problem for Turing machines), something we would rather avoid in a blockchain smart-contract context. One should bear in mind that one always can implement compilers from statically typed high-level smart-contract languages into the TVM code (and we do expect that most smart contracts for TON will be written in such languages), just as one can compile statically typed languages into conventional machine code (e.g., x86 architecture). If the compiler works correctly, the resulting machine code will never generate any run-time type-checking exceptions. All type tags attached to values processed by TVM will always have expected values and may be safely ignored during the analysis of the resulting TVM code, apart from the fact that the run-time generation and verification of these type tags by TVM will slightly slow down the execution of the TVM code. \nxsubpoint\label{sp:val.types}\emb{Preliminary list of value types} A preliminary list of value types supported by TVM is as follows: \begin{itemize} \item {\em Integer\/} --- Signed 257-bit integers, representing integer numbers in the range $-2^{256}\ldots2^{256}-1$, as well as a special ``not-a-number'' value \texttt{NaN}. \item {\em Cell\/} --- A {\em TVM cell\/} consists of at most 1023 bits of data, and of at most four references to other cells. All persistent data (including TVM code) in the TON Blockchain is represented as a collection of TVM cells (cf.~\cite[2.5.14]{TON}). \item {\em Tuple\/} --- An ordered collection of up to 255 components, having arbitrary value types, possibly distinct. May be used to represent non-persistent values of arbitrary algebraic data types. \item {\em Null\/} --- A type with exactly one value~$\bot$, used for representing empty lists, empty branches of binary trees, absence of return value in some situations, and so on. \item {\em Slice\/} --- A {\em TVM cell slice}, or {\em slice\/} for short, is a contiguous ``sub-cell'' of an existing cell, containing some of its bits of data and some of its references. Essentially, a slice is a read-only view for a subcell of a cell. Slices are used for unpacking data previously stored (or serialized) in a cell or a tree of cells. \item {\em Builder\/} --- A {\em TVM cell builder}, or {\em builder\/} for short, is an ``incomplete'' cell that supports fast operations of appending bitstrings and cell references at its end. Builders are used for packing (or serializing) data from the top of the stack into new cells (e.g., before transferring them to persistent storage). \item {\em Continuation\/} --- Represents an ``execution token'' for TVM, which may be invoked (executed) later. As such, it generalizes function addresses (i.e., function pointers and references), subroutine return addresses, instruction pointer addresses, exception handler addresses, closures, partial applications, anonymous functions, and so on. \end{itemize} This list of value types is incomplete and may be extended in future revisions of TVM without breaking the old TVM code, due mostly to the fact that all originally defined primitives accept only values of types known to them and will fail (generate a type-checking exception) if invoked on values of new types. Furthermore, existing value types themselves can also be extended in the future: for example, 257-bit {\em Integer\/} might become 513-bit {\em LongInteger\/}, with originally defined arithmetic primitives failing if either of the arguments or the result does not fit into the original subtype {\em Integer}. Backward compatibility with respect to the introduction of new value types and extension of existing value types will be discussed in more detail later (cf.~\ptref{sp:old.op.change}). \mysubsection{Categories of TVM instructions} TVM {\em instructions}, also called {\em primitives\/} and sometimes {\em (built-in) operations}, are the smallest operations atomically performed by TVM that can be present in the TVM code. They fall into several categories, depending on the types of values (cf.~\ptref{sp:val.types}) they work on. The most important of these categories are: \begin{itemize} \item {\em Stack (manipulation) primitives\/} --- Rearrange data in the TVM stack, so that the other primitives and user-defined functions can later be called with correct arguments. Unlike most other primitives, they are polymorphic, i.e., work with values of arbitrary types. \item {\em Tuple (manipulation) primitives\/} --- Construct, modify, and decompose {\em Tuple\/}s. Similarly to the stack primitives, they are polymorphic. \item {\em Constant\/} or {\em literal primitives\/} --- Push into the stack some ``constant'' or ``literal'' values embedded into the TVM code itself, thus providing arguments to the other primitives. They are somewhat similar to stack primitives, but are less generic because they work with values of specific types. \item {\em Arithmetic primitives\/} --- Perform the usual integer arithmetic operations on values of type {\em Integer}. \item {\em Cell (manipulation) primitives\/} --- Create new cells and store data in them ({\em cell creation primitives}) or read data from previously created cells ({\em cell parsing primitives}). Because all memory and persistent storage of TVM consists of cells, these cell manipulation primitives actually correspond to ``memory access instructions'' of other architectures. Cell creation primitives usually work with values of type {\em Builder}, while cell parsing primitives work with {\em Slice\/}s. \item {\em Continuation\/} and {\em control flow primitives\/} --- Create and modify {\em Continuation\/}s, as well as execute existing {\em Continuation\/}s in different ways, including conditional and repeated execution. \item {\em Custom\/} or {\em application-specific primitives\/} --- Efficiently perform specific high-level actions required by the application (in our case, the TON Blockchain), such as computing hash functions, performing elliptic curve cryptography, sending new blockchain messages, creating new smart contracts, and so on. These primitives correspond to standard library functions rather than microprocessor instructions. \end{itemize} \mysubsection{Control registers} While TVM is a stack machine, some rarely changed values needed in almost all functions are better passed in certain special registers, and not near the top of the stack. Otherwise, a prohibitive number of stack reordering operations would be required to manage all these values. To this end, the TVM model includes, apart from the stack, up to 16 special {\em control registers}, denoted by \texttt{c0} to \texttt{c15}, or $\texttt{c}(0)$ to $\texttt{c}(15)$. The original version of TVM makes use of only some of these registers; the rest may be supported later. \nxsubpoint\emb{Values kept in control registers} The values kept in control registers are of the same types as those kept on the stack. However, some control registers accept only values of specific types, and any attempt to load a value of a different type will lead to an exception. \nxsubpoint\label{sp:cr.list}\emb{List of control registers} The original version of TVM defines and uses the following control registers: \begin{itemize} \item \texttt{c0} --- Contains the {\em next continuation} or {\em return continuation} (similar to the subroutine return address in conventional designs). This value must be a {\em Continuation}. \item \texttt{c1} --- Contains the {\em alternative (return) continuation}; this value must be a {\em Continuation}. It is used in some (experimental) control flow primitives, allowing TVM to define and call ``subroutines with two exit points''. \item \texttt{c2} --- Contains the {\em exception handler}. This value is a {\em Continuation}, invoked whenever an exception is triggered. \item \texttt{c3} --- Contains the {\em current dictionary}, essentially a hashmap containing the code of all functions used in the program. For reasons explained later in~\ptref{p:func.rec.dict}, this value is also a {\em Continuation}, not a {\em Cell\/} as one might expect. \item \texttt{c4} --- Contains the {\em root of persistent data}, or simply the {\em data}. This value is a {\em Cell}. When the code of a smart contract is invoked, \texttt{c4} points to the root cell of its persistent data kept in the blockchain state. If the smart contract needs to modify this data, it changes \texttt{c4} before returning. \item \texttt{c5} --- Contains the {\em output actions}. It is also a {\em Cell\/} initialized by a reference to an empty cell, but its final value is considered one of the smart contract outputs. For instance, the {\tt SENDMSG} primitive, specific for the TON Blockchain, simply inserts the message into a list stored in the output actions. \item \texttt{c7} --- Contains the {\em root of temporary data}. It is a {\em Tuple}, initialized by a reference to an empty {\em Tuple\/} before invoking the smart contract and discarded after its termination.\footnote{In the TON Blockchain context, \texttt{c7} is initialized with a singleton {\em Tuple}, the only component of which is a {\em Tuple\/} containing blockchain-specific data. The smart contract is free to modify \texttt{c7} to store its temporary data provided the first component of this {\em Tuple\/} remains intact.} \end{itemize} More control registers may be defined in the future for specific TON Block\-chain or high-level programming language purposes, if necessary. \mysubsection{Total state of TVM (SCCCG)}\label{p:tvm.state} The total state of TVM consists of the following components: \begin{itemize} \item {\em Stack} (cf.~\ptref{p:tvm.stack}) --- Contains zero or more {\em values\/} (cf.~\ptref{sp:tvm.val}), each belonging to one of {\em value types} listed in~\ptref{sp:val.types}. \item {\em Control registers \texttt{c0}--\texttt{c15}} --- Contain some specific values as described in \ptref{sp:cr.list}. (Only seven control registers are used in the current version.) \item {\em Current continuation \texttt{cc}} --- Contains the current continuation (i.e., the code that would be normally executed after the current primitive is completed). This component is similar to the instruction pointer register (\texttt{ip}) in other architectures. \item {\em Current codepage \texttt{cp}} --- A special signed 16-bit integer value that selects the way the next TVM opcode will be decoded. For example, future versions of TVM might use different codepages to add new opcodes while preserving backward compatibility. \item {\em Gas limits \texttt{gas}} --- Contains four signed 64-bit integers: the current gas limit $g_l$, the maximal gas limit $g_m$, the remaining gas $g_r$, and the gas credit $g_c$. Always $0\leq g_l\leq g_m$, $g_c\geq0$, and $g_r\leq g_l+g_c$; $g_c$ is usually initialized by zero, $g_r$ is initialized by $g_l+g_c$ and gradually decreases as the TVM runs. When $g_r$ becomes negative or if the final value of $g_r$ is less than $g_c$, an {\em out of gas\/} exception is triggered. \end{itemize} Notice that there is no ``return stack'' containing the return addresses of all previously called but unfinished functions. Instead, only control register \texttt{c0} is used. The reason for this will be explained later in~\ptref{sp:call.sw}. Also notice that there are no general-purpose registers, because TVM is a stack machine (cf.~\ptref{p:tvm.stack}). So the above list, which can be summarized as ``stack, control, continuation, codepage, and gas'' (SCCCG), similarly to the classical SECD machine state (``stack, environment, control, dump''), is indeed the {\em total\/} state of TVM.\footnote{Strictly speaking, there is also the current {\em library context}, which consists of a dictionary with 256-bit keys and cell values, used to load library reference cells of~\ptref{sp:exotic.cell.types}.} \mysubsection{Integer arithmetic} All arithmetic primitives of TVM operate on several arguments of type {\em Integer}, taken from the top of the stack, and return their results, of the same type, into the stack. Recall that {\em Integer\/} represents all integer values in the range $-2^{256}\leq x<2^{256}$, and additionally contains a special value \texttt{NaN} (``not-a-number''). If one of the results does not fit into the supported range of integers---or if one of the arguments is a \texttt{NaN}---then this result or all of the results are replaced by a \texttt{NaN}, and (by default) an integer overflow exception is generated. However, special ``quiet'' versions of arithmetic operations will simply produce \texttt{NaN}s and keep going. If these \texttt{NaN}s end up being used in a ``non-quiet'' arithmetic operation, or in a non-arithmetic operation, an integer overflow exception will occur. \nxsubpoint\label{sp:int.no.autoconv}\emb{Absence of automatic conversion of integers} Notice that TVM {\em Integer\/}s are ``mathematical'' integers, and not 257-bit strings interpreted differently depending on the primitive used, as is common for other machine code designs. For example, TVM has only one multiplication primitive \texttt{MUL}, rather than two (\texttt{MUL} for unsigned multiplication and \texttt{IMUL} for signed multiplication) as occurs, for example, in the popular x86 architecture. \nxsubpoint\emb{Automatic overflow checks} Notice that all TVM arithmetic primitives perform overflow checks of the results. If a result does not fit into the {\em Integer} type, it is replaced by a \texttt{NaN}, and (usually) an exception occurs. In particular, the result is {\em not\/} automatically reduced modulo $2^{256}$ or $2^{257}$, as is common for most hardware machine code architectures. \nxsubpoint\emb{Custom overflow checks} In addition to automatic overflow checks, TVM includes custom overflow checks, performed by primitives \texttt{FITS}~$n$ and \texttt{UFITS}~$n$, where $1\leq n\leq256$. These primitives check whether the value on (the top of) the stack is an integer $x$ in the range $-2^{n-1}\leq x<2^{n-1}$ or $0\leq x<2^n$, respectively, and replace the value with a \texttt{NaN} and (optionally) generate an integer overflow exception if this is not the case. This greatly simplifies the implementation of arbitrary $n$-bit integer types, signed or unsigned: the programmer or the compiler must insert appropriate \texttt{FITS} or \texttt{UFITS} primitives either after each arithmetic operation (which is more reasonable, but requires more checks) or before storing computed values and returning them from functions. This is important for smart contracts, where unexpected integer overflows happen to be among the most common source of bugs. \nxsubpoint\emb{Reduction modulo $2^n$} TVM also has a primitive \texttt{MODPOW2}~$n$, which reduces the integer at the top of the stack modulo $2^n$, with the result ranging from $0$ to $2^n-1$. \nxsubpoint\emb{{\em Integer\/} is 257-bit, not 256-bit} One can understand now why TVM's {\em Integer\/} is (signed) 257-bit, not 256-bit. The reason is that it is the smallest integer type containing both signed 256-bit integers and unsigned 256-bit integers, which does not require automatic reinterpreting of the same 256-bit string depending on the operation used (cf.~\ptref{sp:int.no.autoconv}). \nxsubpoint\label{sp:div.round}\emb{Division and rounding} The most important division primitives are \texttt{DIV}, \texttt{MOD}, and \texttt{DIVMOD}. All of them take two numbers from the stack, $x$ and $y$ ($y$ is taken from the top of the stack, and $x$ is originally under it), compute the quotient $q$ and remainder $r$ of the division of $x$ by $y$ (i.e., two integers such that $x=yq+r$ and $|r|<|y|$), and return either $q$, $r$, or both of them. If $y$ is zero, then all of the expected results are replaced by \texttt{NaN}s, and (usually) an integer overflow exception is generated. The implementation of division in TVM somewhat differs from most other implementations with regards to rounding. By default, these primitives round to $-\infty$, meaning that $q=\lfloor x/y\rfloor$, and $r$ has the same sign as~$y$. (Most conventional implementations of division use ``rounding to zero'' instead, meaning that $r$ has the same sign as~$x$.) Apart from this ``floor rounding'', two other rounding modes are available, called ``ceiling rounding'' (with $q=\lceil x/y\rceil$, and $r$ and $y$ having opposite signs) and ``nearest rounding'' (with $q=\lfloor x/y+1/2\rfloor$ and $|r|\leq|y|/2$). These rounding modes are selected by using other division primitives, with letters \texttt{C} and \texttt{R} appended to their mnemonics. For example, \texttt{DIVMODR} computes both the quotient and the remainder using rounding to the nearest integer. \nxsubpoint\emb{Combined multiply-divide, multiply-shift, and shift-divide operations} To simplify implementation of fixed-point arithmetic, TVM supports combined multiply-divide, multiply-shift, and shift-divide operations with double-length (i.e., 514-bit) intermediate product. For example, \texttt{MULDIVMODR} takes three integer arguments from the stack, $a$, $b$, and $c$, first computes $ab$ using a 514-bit intermediate result, and then divides $ab$ by $c$ using rounding to the nearest integer. If $c$ is zero or if the quotient does not fit into {\em Integer}, either two \texttt{NaN}s are returned, or an integer overflow exception is generated, depending on whether a quiet version of the operation has been used. Otherwise, both the quotient and the remainder are pushed into the stack. \clearpage \mysection{The stack} This chapter contains a general discussion and comparison of register and stack machines, expanded further in Appendix~\ptref{app:code.density}, and describes the two main classes of stack manipulation primitives employed by TVM: the {\em basic\/} and the {\em compound stack manipulation primitives}. An informal explanation of their sufficiency for all stack reordering required for correctly invoking other primitives and user-defined functions is also provided. Finally, the problem of efficiently implementing TVM stack manipulation primitives is discussed in~\ptref{p:eff.stack.manip}. \mysubsection{Stack calling conventions}\label{p:stack.conv} A stack machine, such as TVM, uses the stack---and especially the values near the top of the stack---to pass arguments to called functions and primitives (such as built-in arithmetic operations) and receive their results. This section discusses the TVM stack calling conventions, introduces some notation, and compares TVM stack calling conventions with those of certain register machines. \nxsubpoint\emb{Notation for ``stack registers''} Recall that a stack machine, as opposed to a more conventional register machine, lacks general-purpose registers. However, one can treat the values near the top of the stack as a kind of ``stack registers''. We denote by \texttt{s0} or $\texttt{s}(0)$ the value at the top of the stack, by \texttt{s1} or $\texttt{s}(1)$ the value immediately under it, and so on. The total number of values in the stack is called its {\em depth}. If the depth of the stack is $n$, then $\texttt{s}(0)$, $\texttt{s}(1)$, \dots, $\texttt{s}(n-1)$ are well-defined, while $\texttt{s}(n)$ and all subsequent $\texttt{s}(i)$ with $i>n$ are not. Any attempt to use $\texttt{s}(i)$ with $i\geq n$ should produce a stack underflow exception. A compiler, or a human programmer in ``TVM code'', would use these ``stack registers'' to hold all declared variables and intermediate values, similarly to the way general-purpose registers are used on a register machine. \nxsubpoint\emb{Pushing and popping values} When a value $x$ is {\em pushed\/} into a stack of depth $n$, it becomes the new \texttt{s0}; at the same time, the old \texttt{s0} becomes the new \texttt{s1}, the old \texttt{s1}---the new \texttt{s2}, and so on. The depth of the resulting stack is $n+1$. Similarly, when a value $x$ is {\em popped\/} from a stack of depth $n\geq1$, it is the old value of \texttt{s0} (i.e., the old value at the top of the stack). After this, it is removed from the stack, and the old \texttt{s1} becomes the new \texttt{s0} (the new value at the top of the stack), the old \texttt{s2} becomes the new \texttt{s1}, and so on. The depth of the resulting stack is $n-1$. If originally $n=0$, then the stack is {\em empty}, and a value cannot be popped from it. If a primitive attempts to pop a value from an empty stack, a {\em stack underflow\/} exception occurs. \nxsubpoint\emb{Notation for hypothetical general-purpose registers} In order to compare stack machines with sufficiently general register machines, we will denote the general-purpose registers of a register machine by \texttt{r0}, \texttt{r1}, and so on, or by $\texttt{r}(0)$, $\texttt{r}(1)$, \ldots, $\texttt{r}(n-1)$, where $n$ is the total number of registers. When we need a specific value of $n$, we will use $n=16$, corresponding to the very popular x86-64 architecture. \nxsubpoint\emb{The top-of-stack register \texttt{s0} vs.\ the accumulator register \texttt{r0}} Some register machine architectures require one of the arguments for most arithmetic and logical operations to reside in a special register called the {\em accumulator}. In our comparison, we will assume that the accumulator is the general-purpose register \texttt{r0}; otherwise we could simply renumber the registers. In this respect, the accumulator is somewhat similar to the top-of-stack ``register'' \texttt{s0} of a stack machine, because virtually all operations of a stack machine both use \texttt{s0} as one of their arguments and return their result as \texttt{s0}. \nxsubpoint\emb{Register calling conventions} When compiled for a register machine, high-level language functions usually receive their arguments in certain registers in a predefined order. If there are too many arguments, these functions take the remainder from the stack (yes, a register machine usually has a stack, too!). Some register calling conventions pass no arguments in registers at all, however, and only use the stack (for example, the original calling conventions used in implementations of Pascal and C, although modern implementations of C use some registers as well). For simplicity, we will assume that up to $m\leq n$ function arguments are passed in registers, and that these registers are $\texttt{r0}$, $\texttt{r1}$, \dots, $\texttt{r}(m-1)$, in that order (if some other registers are used, we can simply renumber them).\footnote{Our inclusion of $\texttt{r0}$ here creates a minor conflict with our assumption that the accumulator register, if present, is also \texttt{r0}; for simplicity, we will resolve this problem by assuming that the first argument to a function is passed in the accumulator.} \nxsubpoint\label{sp:func.arg.ord}\emb{Order of function arguments} If a function or primitive requires $m$ arguments $x_1$, \dots, $x_m$, they are pushed by the caller into the stack in the same order, starting from $x_1$. Therefore, when the function or primitive is invoked, its first argument $x_1$ is in $\texttt{s}(m-1)$, its second argument $x_2$ is in $\texttt{s}(m-2)$, and so on. The last argument $x_m$ is in $\texttt{s0}$ (i.e., at the top of the stack). It is the called function or primitive's responsibility to remove its arguments from the stack. In this respect the TVM stack calling conventions---obeyed, at least, by TMV primitives---match those of Pascal and Forth, and are the opposite of those of C (in which the arguments are pushed into the stack in the reverse order, and are removed by the caller after it regains control, not the callee). Of course, an implementation of a high-level language for TVM might choose some other calling conventions for its functions, different from the default ones. This might be useful for certain functions---for instance, if the total number of arguments depends on the value of the first argument, as happens for ``variadic functions'' such as \texttt{scanf} and \texttt{printf}. In such cases, the first one or several arguments are better passed near the top of the stack, not somewhere at some unknown location deep in the stack. \nxsubpoint\label{sp:reg.op.arg}\emb{Arguments to arithmetic primitives on register machines} On a stack machine, built-in arithmetic primitives (such as \texttt{ADD} or \texttt{DIVMOD}) follow the same calling conventions as user-defined functions. In this respect, user-defined functions (for example, a function computing the square root of a number) might be considered as ``extensions'' or ``custom upgrades'' of the stack machine. This is one of the clearest advantages of stack machines (and of stack programming languages such as Forth) compared to register machines. In contrast, arithmetic instructions (built-in operations) on register machines usually get their parameters from general-purpose registers encoded in the full opcode. A binary operation, such as \texttt{SUB}, thus requires two arguments, $\texttt{r}(i)$ and $\texttt{r}(j)$, with $i$ and $j$ specified by the instruction. A register $\texttt{r}(k)$ for storing the result also must be specified. Arithmetic operations can take several possible forms, depending on whether $i$, $j$, and $k$ are allowed to take arbitrary values: \begin{itemize} \item {Three-address form} --- Allows the programmer to arbitrarily choose not only the two source registers $\texttt{r}(i)$ and $\texttt{r}(j)$, but also a separate destination register $\texttt{r}(k)$. This form is common for most RISC processors, and for the XMM and AVX SIMD instruction sets in the x86-64 architecture. \item {Two-address form} --- Uses one of the two operand registers (usually $\texttt{r}(i)$) to store the result of an operation, so that $k=i$ is never indicated explicitly. Only $i$ and $j$ are encoded inside the instruction. This is the most common form of arithmetic operations on register machines, and is quite popular on microprocessors (including the x86 family). \item {One-address form} --- Always takes one of the arguments from the accumulator \texttt{r0}, and stores the result in \texttt{r0} as well; then $i=k=0$, and only $j$ needs to be specified by the instruction. This form is used by some simpler microprocessors (such as Intel 8080). \end{itemize} Note that this flexibility is available only for built-in operations, but not for user-defined functions. In this respect, register machines are not as easily ``upgradable'' as stack machines.\footnote{For instance, if one writes a function for extracting square roots, this function will always accept its argument and return its result in the same registers, in contrast with a hypothetical built-in square root instruction, which could allow the programmer to arbitrarily choose the source and destination registers. Therefore, a user-defined function is tremendously less flexible than a built-in instruction on a register machine.} \nxsubpoint\emb{Return values of functions} In stack machines such as TVM, when a function or primitive needs to return a result value, it simply pushes it into the stack (from which all arguments to the function have already been removed). Therefore, the caller will be able to access the result value through the top-of-stack ``register'' \texttt{s0}. This is in complete accordance with Forth calling conventions, but differs slightly from Pascal and C calling conventions, where the accumulator register \texttt{r0} is normally used for the return value. \nxsubpoint\emb{Returning several values} Some functions might want to return several values $y_1$, \dots, $y_k$, with $k$ not necessarily equal to one. In these cases, the $k$ return values are pushed into the stack in their natural order, starting from $y_1$. For example, the ``divide with remainder'' primitive \texttt{DIVMOD} needs to return two values, the quotient $q$ and the remainder $r$. Therefore, \texttt{DIVMOD} pushes $q$ and $r$ into the stack, in that order, so that the quotient is available thereafter at \texttt{s1} and the remainder at \texttt{s0}. The net effect of \texttt{DIVMOD} is to divide the original value of \texttt{s1} by the original value of \texttt{s0}, and return the quotient in \texttt{s1} and the remainder in \texttt{s0}. In this particular case the depth of the stack and the values of all other ``stack registers'' remain unchanged, because \texttt{DIVMOD} takes two arguments and returns two results. In general, the values of other ``stack registers'' that lie in the stack below the arguments passed and the values returned are shifted according to the change of the depth of the stack. In principle, some primitives and user-defined functions might return a variable number of result values. In this respect, the remarks above about variadic functions (cf.~\ptref{sp:func.arg.ord}) apply: the total number of result values and their types should be determined by the values near the top of the stack. (For example, one might push the return values $y_1$, \dots, $y_k$, and then push their total number $k$ as an integer. The caller would then determine the total number of returned values by inspecting \texttt{s0}.) In this respect TVM, again, faithfully observes Forth calling conventions. \nxsubpoint\label{sp:stack.notat}\emb{Stack notation} When a stack of depth $n$ contains values $z_1$, \dots, $z_n$, in that order, with $z_1$ the deepest element and $z_n$ the top of the stack, the contents of the stack are often represented by a list $z_1$ $z_2$ \dots $z_n$, in that order. When a primitive transforms the original stack state $S'$ into a new state $S''$, this is often written as $S'$ -- $S''$; this is the so-called {\em stack notation}. For example, the action of the division primitive \texttt{DIV} can be described by $S$ $x$ $y$ -- $S$ $\lfloor x/y\rfloor$, where $S$ is any list of values. This is usually abbreviated as $x$ $y$ -- $\lfloor x/y\rfloor$, tacitly assuming that all other values deeper in the stack remain intact. Alternatively, one can describe \texttt{DIV} as a primitive that runs on a stack $S'$ of depth $n\geq2$, divides \texttt{s1} by \texttt{s0}, and returns the floor-rounded quotient as \texttt{s0} of the new stack $S''$ of depth $n-1$. The new value of $\texttt{s}(i)$ equals the old value of $\texttt{s}(i+1)$ for $1\leq il$.\footnote{From a theoretical perspective, we might say that a cell $c$ has an infinite sequence of hashes $\bigl(\Hash_i(c)\bigr)_{i\geq1}$, which eventually stabilizes: $\Hash_i(c)\to\Hash_\infty(c)$. Then the level $l$ is simply the largest index~$i$, such that $\Hash_i(c)\neq\Hash_\infty(c)$.} \nxsubpoint\label{sp:exotic.cell.types}\emb{Types of exotic cells} TVM currently supports the following cell types: \begin{itemize} \item Type $-1$: {\em Ordinary cell} --- Contains up to 1023 bits of data and up to four cell references. \item Type 1: {\em Pruned branch cell~$c$} --- May have any level $1\leq l\leq 3$. It contains exactly $8+256l$ data bits: first an 8-bit integer equal to 1 (representing the cell's type), then its $l$ higher hashes $\Hash_1(c)$, \dots, $\Hash_l(c)$. The level $l$ of a pruned branch cell may be called its {\em de Brujn index}, because it determines the outer Merkle proof or Merkle update during the construction of which the branch has been pruned. An attempt to load a pruned branch cell usually leads to an exception. \item Type 2: {\em Library reference cell} --- Always has level 0, and contains $8+256$ data bits, including its 8-bit type integer 2 and the representation hash $\Hash(c')$ of the library cell being referred to. When loaded, a library reference cell may be transparently replaced by the cell it refers to, if found in the current {\em library context}. \item Type 3: {\em Merkle proof cell $c$} --- Has exactly one reference $c_1$ and level $0\leq l\leq 3$, which must be one less than the level of its only child $c_1$: \begin{equation} \Lvl(c)=\max(\Lvl(c_1)-1,0) \end{equation} The $8+256$ data bits of a Merkle proof cell contain its 8-bit type integer 3, followed by $\Hash_1(c_1)$ (assumed to be equal to $\Hash(c_1)$ if $\Lvl(c_1)=0$). The higher hashes $\Hash_i(c)$ of $c$ are computed similarly to the higher hashes of an ordinary cell, but with $\Hash_{i+1}(c_1)$ used instead of $\Hash_i(c_1)$. When loaded, a Merkle proof cell is replaced by $c_1$. \item Type 4: {\em Merkle update cell $c$} --- Has two children $c_1$ and $c_2$. Its level $0\leq l\leq 3$ is given by \begin{equation} \Lvl(c)=\max(\Lvl(c_1)-1,\Lvl(c_2)-1,0) \end{equation} A Merkle update behaves like a Merkle proof for both $c_1$ and $c_2$, and contains $8+256+256$ data bits with $\Hash_1(c_1)$ and $\Hash_1(c_2)$. However, an extra requirement is that {\em all pruned branch cells $c'$ that are descendants of $c_2$ and are bound by $c$ must also be descendants of $c_1$.}\footnote{A pruned branch cell $c'$ of level $l$ is {\em bound\/} by a Merkle (proof or update) cell $c$ if there are exactly $l$ Merkle cells on the path from $c$ to its descendant $c'$, including~$c$.} When a Merkle update cell is loaded, it is replaced by $c_2$. \end{itemize} \nxsubpoint\label{sp:data.boc}\emb{All values of algebraic data types are trees of cells} Arbitrary values of arbitrary algebraic data types (e.g., all types used in functional programming languages) can be serialized into trees of cells (of level 0), and such representations are used for representing such values within TVM. The copy-on-write mechanism (cf.~\ptref{sp:cow}) allows TVM to identify cells containing the same data and references, and to keep only one copy of such cells. This actually transforms a tree of cells into a directed acyclic graph (with the additional property that all its vertices be accessible from a marked vertex called the ``root''). However, this is a storage optimization rather than an essential property of TVM. From the perspective of a TVM code programmer, one should think of TVM data structures as trees of cells. \nxsubpoint\label{sp:code.boc}\emb{TVM code is a tree of cells} The TVM code itself is also represented by a tree of cells. Indeed, TVM code is simply a value of some complex algebraic data type, and as such, it can be serialized into a tree of cells. The exact way in which the TVM code (e.g., TVM assembly code) is transformed into a tree of cells is explained later (cf.~\ptref{sp:ord.cont.exec} and~\ptref{p:instr.encode}), in sections discussing control flow instructions, continuations, and TVM instruction encoding. \nxsubpoint\emb{``Everything is a bag of cells'' paradigm} As described in \cite[2.5.14]{TON}, all the data used by the TON Blockchain, including the blocks themselves and the blockchain state, can be represented---and are represented---as collections, or ``bags'', of cells. We see that TVM's structure of data (cf.~\ptref{sp:data.boc}) and code (cf.~\ptref{sp:code.boc}) nicely fits into this ``everything is a bag of cells'' paradigm. In this way, TVM can naturally be used to execute smart contracts in the TON Blockchain, and the TON Blockchain can be used to store the code and persistent data of these smart contracts between invocations of TVM. (Of course, both TVM and the TON Blockchain have been designed so that this would become possible.) \mysubsection{Data manipulation instructions and cells}\label{p:cell.manip} The next large group of TVM instructions consists of {\em data manipulation instructions}, also known as {\em cell manipulation instructions\/} or simply {\em cell instructions}. They correspond to memory access instructions of other architectures. \nxsubpoint\emb{Classes of cell manipulation instructions} The TVM cell instructions are naturally subdivided into two principal classes: \begin{itemize} \item {\em Cell creation instructions\/} or {\em serialization instructions}, used to construct new cells from values previously kept in the stack and previously constructed cells. \item {\em Cell parsing instructions\/} or {\em deserialization instructions}, used to extract data previously stored into cells by cell creation instructions. \end{itemize} Additionally, there are {\em exotic cell instructions\/} used to create and inspect exotic cells (cf.~\ptref{sp:exotic.cells}), which in particular are used to represent pruned branches of Merkle proofs and Merkle proofs themselves. \nxsubpoint\label{sp:builder.slice.val}\emb{{\em Builder\/} and {\em Slice\/} values} Cell creation instructions usually work with {\em Builder\/} values, which can be kept only in the stack (cf.~\ptref{sp:val.types}). Such values represent partially constructed cells, for which fast operations for appending bitstrings, integers, other cells, and references to other cells can be defined. Similarly, cell parsing instructions make heavy use of {\em Slice\/} values, which represent either the remainder of a partially parsed cell, or a value (subcell) residing inside such a cell and extracted from it by a parsing instruction. \nxsubpoint\emb{{\em Builder\/} and {\em Slice\/} values exist only as stack values} Notice that {\em Builder\/} and {\em Slice\/} objects appear only as values in a TVM stack. They cannot be stored in ``memory'' (i.e., trees of cells) or ``persistent storage'' (which is also a bag of cells). In this sense, there are far more {\em Cell\/} objects than {\em Builder\/} or {\em Slice\/} objects in a TVM environment, but, somewhat paradoxically, a TVM program sees {\em Builder\/} and {\em Slice\/} objects in its stack more often than {\em Cell\/}s. In fact, a TVM program does not have much use for {\em Cell\/} values, because they are immutable and opaque; all cell manipulation primitives require that a {\em Cell\/} value be transformed into either a {\em Builder\/} or a {\em Slice\/} first, before it can be modified or inspected. \nxsubpoint\emb{TVM has no separate {\em Bitstring\/} value type} Notice that TVM offers no separate bitstring value type. Instead, bitstrings are represented by {\em Slice\/}s that happen to have no references at all, but can still contain up to 1023 data bits. \nxsubpoint\label{sp:cells.of.bits} \emb{Cells and cell primitives are bit-oriented, not byte-oriented} An important point is that {\em TVM regards data kept in cells as sequences (strings, streams) of (up to 1023) bits, not of bytes}. In other words, TVM is a {\em bit-oriented machine}, not a byte-oriented machine. If necessary, an application is free to use, say, 21-bit integer fields inside records serialized into TVM cells, thus using fewer persistent storage bytes to represent the same data. \nxsubpoint\label{sp:cc.taxonomy} \emb{Taxonomy of cell creation (serialization) primitives} Cell creation primitives usually accept a {\em Builder\/} argument and an argument representing the value to be serialized. Additional arguments controlling some aspects of the serialization process (e.g., how many bits should be used for serialization) can be also provided, either in the stack or as an immediate value inside the instruction. The result of a cell creation primitive is usually another {\em Builder}, representing the concatenation of the original builder and the serialization of the value provided. Therefore, one can suggest a classification of cell serialization primitives according to the answers to the following questions: \begin{itemize} \item Which is the type of values being serialized? \item How many bits are used for serialization? If this is a variable number, does it come from the stack, or from the instruction itself? \item What happens if the value does not fit into the prescribed number of bits? Is an exception generated, or is a success flag equal to zero silently returned in the top of stack? \item What happens if there is insufficient space left in the {\em Builder}? Is an exception generated, or is a zero success flag returned along with the unmodified original {\em Builder}? \end{itemize} The mnemonics of cell serialization primitives usually begin with \texttt{ST}. Subsequent letters describe the following attributes: \begin{itemize} \item The type of values being serialized and the serialization format (e.g., \texttt{I} for signed integers, \texttt{U} for unsigned integers). \item The source of the field width in bits to be used (e.g., \texttt{X} for integer serialization instructions means that the bit width $n$ is supplied in the stack; otherwise it has to be embedded into the instruction as an immediate value). \item The action to be performed if the operation cannot be completed (by default, an exception is generated; ``quiet'' versions of serialization instructions are marked by a \texttt{Q} letter in their mnemonics). \end{itemize} This classification scheme is used to create a more complete taxonomy of cell serialization primitives, which can be found in~\ptref{sp:prim.ser}. \nxsubpoint\emb{Integer serialization primitives} Integer serialization primitives can be classified according to the above taxonomy as well. For example: \begin{itemize} \item There are signed and unsigned (big-endian) integer serialization primitives. \item The size $n$ of the bit field to be used ($1\leq n\leq 257$ for signed integers, $0\leq n\leq 256$ for unsigned integers) can either come from the top of stack or be embedded into the instruction itself. \item If the integer $x$ to be serialized is not in the range $-2^{n-1}\leq x<2^{n-1}$ (for signed integer serialization) or $0\leq x<2^n$ (for unsigned integer serialization), a range check exception is usually generated, and if $n$ bits cannot be stored into the provided {\em Builder}, a cell overflow exception is generated. \item Quiet versions of serialization instructions do not throw exceptions; instead, they push \texttt{-1} on top of the resulting {\em Builder} upon success, or return the original {\em Builder} with \texttt{0} on top of it to indicate failure. \end{itemize} Integer serialization instructions have mnemonics like \texttt{STU 20} (``store an unsigned 20-bit integer value'') or \texttt{STIXQ} (``quietly store an integer value of variable length provided in the stack''). The full list of these instructions---including their mnemonics, descriptions, and opcodes---is provided in~\ptref{sp:prim.ser}. \nxsubpoint\emb{Integers in cells are big-endian by default} Notice that the default order of bits in {\em Integer\/}s serialized into {\em Cell\/}s is {\em big-endian}, not little-endian.\footnote{Negative numbers are represented using two's complement. For instance, integer $-17$ is serialized by instruction {\tt STI 8} into bitstring {\tt xEF}.} In this respect {\em TVM is a big-endian machine}. However, this affects only the serialization of integers inside cells. The internal representation of the {\em Integer\/} value type is implementation-dependent and irrelevant for the operation of TVM. Besides, there are some special primitives such as {\tt STULE} for (de)serializing little-endian integers, which must be stored into an integral number of bytes (otherwise ``little-endianness'' does not make sense, unless one is also willing to revert the order of bits inside octets). Such primitives are useful for interfacing with the little-endian world---for instance, for parsing custom-format messages arriving to a TON Blockchain smart contract from the outside world. \nxsubpoint\emb{Other serialization primitives} Other cell creation primitives serialize bitstrings (i.e., cell slices without references), either taken from the stack or supplied as literal arguments; cell slices (which are concatenated to the cell builder in an obvious way); other {\em Builder\/}s (which are also concatenated); and cell references (\texttt{STREF}). \nxsubpoint\emb{Other cell creation primitives} In addition to the cell serialization primitives for certain built-in value types described above, there are simple primitives that create a new empty {\em Builder\/} and push it into the stack (\texttt{NEWC}), or transform a {\em Builder} into a {\em Cell} (\texttt{ENDC}), thus finishing the cell creation process. An \texttt{ENDC} can be combined with a \texttt{STREF} into a single instruction \texttt{ENDCST}, which finishes the creation of a cell and immediately stores a reference to it in an ``outer'' {\em Builder}. There are also primitives that obtain the quantity of data bits or references already stored in a {\em Builder}, and check how many data bits or references can be stored. \nxsubpoint\label{sp:cd.taxonomy}\emb{Taxonomy of cell deserialisation primitives} Cell parsing, or deserialization, primitives can be classified as described in~\ptref{sp:cc.taxonomy}, with the following modifications: \begin{itemize} \item They work with {\em Slice\/}s (representing the remainder of the cell being parsed) instead of {\em Builder\/}s. \item They return deserialized values instead of accepting them as arguments. \item They may come in two flavors, depending on whether they remove the deserialized portion from the {\em Slice\/} supplied (``fetch operations'') or leave it unmodified (``prefetch operations''). \item Their mnemonics usually begin with \texttt{LD} (or \texttt{PLD} for prefetch operations) instead of \texttt{ST}. \end{itemize} For example, an unsigned big-endian 20-bit integer previously serialized into a cell by a \texttt{STU 20} instruction is likely to be deserialized later by a matching \texttt{LDU 20} instruction. Again, more detailed information about these instructions is provided in~\ptref{sp:prim.deser}. \nxsubpoint\emb{Other cell slice primitives} In addition to the cell deserialisation primitives outlined above, TVM provides some obvious primitives for initializing and completing the cell deserialization process. For instance, one can convert a {\em Cell\/} into a {\em Slice\/} (\texttt{CTOS}), so that its deserialisation might begin; or check whether a {\em Slice\/} is empty, and generate an exception if it is not (\texttt{ENDS}); or deserialize a cell reference and immediately convert it into a {\em Slice\/} (\texttt{LDREFTOS}, equivalent to two instructions \texttt{LDREF} and \texttt{CTOS}). \nxsubpoint\label{sp:mod.val.cell}\emb{Modifying a serialized value in a cell} The reader might wonder how the values serialized inside a cell may be modified. Suppose a cell contains three serialized 29-bit integers, $(x,y,z)$, representing the coordinates of a point in space, and we want to replace $y$ with $y'=y+1$, leaving the other coordinates intact. How would we achieve this? TVM does not offer any ways to modify existing values (cf.~\ptref{sp:no.refs} and \ptref{sp:no.cyclic}), so our example can only be accomplished with a series of operations as follows: \begin{enumerate} \item Deserialize the original cell into three {\em Integer\/}s $x$, $y$, $z$ in the stack (e.g., by \texttt{CTOS; LDI 29; LDI 29; LDI 29; ENDS}). \item Increase $y$ by one (e.g., by \texttt{SWAP; INC; SWAP}). \item Finally, serialize the resulting {\em Integer\/}s into a new cell (e.g., by \texttt{XCHG s2; NEWC; STI 29; STI 29; STI 29; ENDC}). \end{enumerate} \nxsubpoint\emb{Modifying the persistent storage of a smart contract} If the TVM code wants to modify its persistent storage, represented by the tree of cells rooted at {\tt c4}, it simply needs to rewrite control register {\tt c4} by the root of the tree of cells containing the new value of its persistent storage. (If only part of the persistent storage needs to be modified, cf.~\ptref{sp:mod.val.cell}.) \mysubsection{Hashmaps, or dictionaries}\label{p:hashmaps} {\em Hashmaps\/}, or {\em dictionaries}, are a specific data structure represented by a tree of cells. Essentially, a hashmap represents a map from {\em keys}, which are bitstrings of either fixed or variable length, into {\em values\/} of an arbitrary type~$X$, in such a way that fast lookups and modifications be possible. While any such structure might be inspected or modified with the aid of generic cell serialization and deserialization primitives, TVM introduces special primitives to facilitate working with these hashmaps. \nxsubpoint\emb{Basic hashmap types} The two most basic hashmap types predefined in TVM are $\HashmapE\ n\ X$ or $\HashmapE(n,X)$, which represents a partially defined map from $n$-bit strings (called {\em keys}) for some fixed $0\leq n\leq 1023$ into {\em values\/} of some type $X$, and $\Hashmap(n,X)$, which is similar to $\HashmapE(n,X)$ but is not allowed to be empty (i.e., it must contain at least one key-value pair). Other hashmap types are also available---for example, one with keys of arbitrary length up to some predefined bound (up to 1023 bits). \nxsubpoint\label{sp:hm.patricia}\emb{Hashmaps as Patricia trees} The abstract representation of a hashmap in TVM is a {\em Patricia tree}, or a {\em compact binary trie}. It is a binary tree with edges labelled by bitstrings, such that the concatenation of all edge labels on a path from the root to a leaf equals a key of the hashmap. The corresponding value is kept in this leaf (for hashmaps with keys of fixed length), or optionally in the intermediate vertices as well (for hashmaps with keys of variable length). Furthermore, any intermediate vertex must have two children, and the label of the left child must begin with a binary zero, while the label of the right child must begin with a binary one. This enables us not to store the first bit of the edge labels explicitly. It is easy to see that any collection of key-value pairs (with distinct keys) is represented by a unique Patricia tree. \nxsubpoint\label{sp:hm.tlb}\emb{Serialization of hashmaps} The serialization of a hashmap into a tree of cells (or, more generally, into a {\em Slice\/}) is defined by the following TL-B scheme:\footnote{A description of an older version of TL may be found at~\url{https://core.telegram.org/mtproto/TL}.} \begin{verbatim} bit#_ _:(## 1) = Bit; hm_edge#_ {n:#} {X:Type} {l:#} {m:#} label:(HmLabel ~l n) {n = (~m) + l} node:(HashmapNode m X) = Hashmap n X; hmn_leaf#_ {X:Type} value:X = HashmapNode 0 X; hmn_fork#_ {n:#} {X:Type} left:^(Hashmap n X) right:^(Hashmap n X) = HashmapNode (n + 1) X; hml_short$0 {m:#} {n:#} len:(Unary ~n) s:(n * Bit) = HmLabel ~n m; hml_long$10 {m:#} n:(#<= m) s:(n * Bit) = HmLabel ~n m; hml_same$11 {m:#} v:Bit n:(#<= m) = HmLabel ~n m; unary_zero$0 = Unary ~0; unary_succ$1 {n:#} x:(Unary ~n) = Unary ~(n + 1); hme_empty$0 {n:#} {X:Type} = HashmapE n X; hme_root$1 {n:#} {X:Type} root:^(Hashmap n X) = HashmapE n X; true#_ = True; _ {n:#} _:(Hashmap n True) = BitstringSet n; \end{verbatim} \nxsubpoint\label{sp:tlb.brief}\emb{Brief explanation of TL-B schemes} A TL-B scheme, like the one above, includes the following components. The right-hand side of each ``equation'' is a {\em type}, either simple (such as {\tt Bit} or {\tt True}) or parametrized (such as {\tt Hashmap $n$ $X$}). The parameters of a type must be either natural numbers (i.e., non-negative integers, which are required to fit into 32 bits in practice), such as $n$ in {\tt Hashmap $n$ $X$}, or other types, such as $X$ in {\tt Hashmap $n$ $X$}. The left-hand side of each equation describes a way to define, or even to serialize, a value of the type indicated in the right-hand side. Such a description begins with the name of a {\em constructor}, such as {\tt hm\_edge} or {\tt hml\_long}, immediately followed by an optional {\em constructor tag}, such as {\tt \#\_} or {\tt \$10}, which describes the bitstring used to encode (serialize) the constructor in question. Such tags may be given in either binary (after a dollar sign) or hexadecimal notation (after a hash sign), using the conventions described in \ptref{p:bitstring.hex}. If a tag is not explicitly provided, TL-B computes a default 32-bit constructor tag by hashing the text of the ``equation'' defining this constructor in a certain fashion. Therefore, empty tags must be explicitly provided by {\tt \#\_} or {\tt \$\_}. All constructor names must be distinct, and constructor tags for the same type must constitute a prefix code (otherwise the deserialization would not be unique). The constructor and its optional tag are followed by {\em field definitions}. Each field definition is of the form ${\textit{ident}}:{\textit{type-expr}}$, where ${\textit{ident}}\/$ is an identifier with the name of the field\footnote{The field's name is useful for representing values of the type being defined in human-readable form, but it does not affect the binary serialization.} (replaced by an underscore for anonymous fields), and {\textit{type-expr}} is the field's type. The type provided here is a {\em type expression}, which may include simple types or parametrized types with suitable parameters. {\em Variables}---i.e., the (identifiers of the) previously defined fields of types $\#$ (natural numbers) or $\Type$ (type of types)---may be used as parameters for the parametrized types. The serialization process recursively serializes each field according to its type, and the serialization of a value ultimately consists of the concatenation of bitstrings representing the constructor (i.e., the constructor tag) and the field values. Some fields may be {\em implicit}. Their definitions are surrounded by curly braces, which indicate that the field is not actually present in the serialization, but that its value must be deduced from other data (usually the parameters of the type being serialized). Some occurrences of ``variables'' (i.e., already-defined fields) are prefixed by a tilde. This indicates that the variable's occurrence is used in the opposite way of the default behavior: in the left-hand side of the equation, it means that the variable will be deduced (computed) based on this occurrence, instead of substituting its previously computed value; in the right-hand side, conversely, it means that the variable will not be deduced from the type being serialized, but rather that it will be computed during the deserialization process. In other words, a tilde transforms an ``input argument'' into an ``output argument'', and vice versa.\footnote{This is the ``linear negation'' operation $(-)^\perp$ of linear logic, hence our notation \texttt{\~}.} Finally, some equalities may be included in curly brackets as well. These are certain ``equations'', which must be satisfied by the ``variables'' included in them. If one of the variables is prefixed by a tilde, its value will be uniquely determined by the values of all other variables participating in the equation (which must be known at this point) when the definition is processed from the left to the right. A caret (\texttt{\caret}) preceding a type $X$ means that instead of serializing a value of type $X$ as a bitstring inside the current cell, we place this value into a separate cell, and add a reference to it into the current cell. Therefore \texttt{\caret$X$} means ``the type of references to cells containing values of type $X$''. Parametrized type \texttt{\#<= $p$} with $p:\texttt{\#}$ (this notation means ``$p$ of type \texttt{\#}'', i.e., a natural number) denotes the subtype of the natural numbers type $\#$, consisting of integers $0\ldots p$; it is serialized into $\lceil\log_2(p+1)\rceil$ bits as an unsigned big-endian integer. Type \texttt{\#} by itself is serialized as an unsigned 32-bit integer. Parametrized type \texttt{\#\# $b$} with $b:\texttt{\#<=}31$ is equivalent to \texttt{\#<= $2^b-1$} (i.e., it is an unsigned $b$-bit integer). \nxsubpoint\emb{Application to the serialization of hashmaps} Let us explain the net result of applying the general rules described in \ptref{sp:tlb.brief} to the TL-B scheme presented in \ptref{sp:hm.tlb}. Suppose we wish to serialize a value of type $\HashmapE$ $n$ $X$ for some integer $0\leq n\leq 1023$ and some type $X$ (i.e., a dictionary with $n$-bit keys and values of type~$X$, admitting an abstract representation as a Patricia tree (cf.~\ptref{sp:hm.patricia})). First of all, if our dictionary is empty, it is serialized into a single binary {\tt 0}, which is the tag of nullary constructor {\tt hme\_empty}. Otherwise, its serialization consists of a binary {\tt 1} (the tag of {\tt hme\_root}), along with a reference to a cell containing the serialization of a value of type $\Hashmap$ $n$ $X$ (i.e., a necessarily non-empty dictionary). The only way to serialize a value of type $\Hashmap$ $n$ $X$ is given by the {\tt hm\_edge} constructor, which instructs us to serialize first the label {\tt label} of the edge leading to the root of the subtree under consideration (i.e., the common prefix of all keys in our (sub)dictionary). This label is of type {\tt HmLabel $l^\perp$ $n$}, which means that it is a bitstring of length at most $n$, serialized in such a way that the true length $l$ of the label, $0\leq l\leq n$, becomes known from the serialization of the label. (This special serialization method is described separately in~\ptref{sp:hm.label.ser}.) The label must be followed by the serialization of a {\tt node} of type {\em Hashmap\-Node $m$ $X$}, where $m=n-l$. It corresponds to a vertex of the Patricia tree, representing a non-empty subdictionary of the original dictionary with $m$-bit keys, obtained by removing from all the keys of the original subdictionary their common prefix of length~$l$. If $m=0$, a value of type {\tt HashmapNode $0$ $X$} is given by the {\tt hmn\_leaf} constructor, which describes a leaf of the Patricia tree---or, equivalently, a subdictionary with $0$-bit keys. A leaf simply consists of the corresponding {\tt value} of type $X$ and is serialized accordingly. On the other hand, if $m>0$, a value of type {\tt HashmapNode $m$ $X$} corresponds to a fork (i.e., an intermediate node) in the Patricia tree, and is given by the {\tt hmn\_fork} constructor. Its serialization consists of {\tt left} and {\tt right}, two references to cells containing values of type {\tt Hashmap $m-1$ $X$}, which correspond to the left and the right child of the intermediate node in question---or, equivalently, to the two subdictionaries of the original dictionary consisting of key-value pairs with keys beginning with a binary {\tt 0} or a binary {\tt 1}, respectively. Because the first bit of all keys in each of these subdictionaries is known and fixed, it is removed, and the resulting (necessarily non-empty) subdictionaries are recursively serialized as values of type {\tt Hashmap $m-1$ $X$}. \nxsubpoint\label{sp:hm.label.ser}\emb{Serialization of labels} There are several ways to serialize a label of length at most $n$, if its exact length is $l\leq n$ (recall that the exact length must be deducible from the serialization of the label itself, while the upper bound $n$ is known before the label is serialized or deserialized). These ways are described by the three constructors {\tt hml\_short}, {\tt hml\_long}, and {\tt hml\_same} of type {\tt HmLabel $l^\perp$ $n$}: \begin{itemize} \item {\tt hml\_short} --- Describes a way to serialize ``short'' labels, of small length $l\leq n$. Such a serialization consists of a binary {\tt 0} (the constructor tag of {\tt hml\_short}), followed by $l$ binary {\tt 1}s and one binary {\tt 0} (the unary representation of the length $l$), followed by $l$ bits comprising the label itself. \item {\tt hml\_long} --- Describes a way to serialize ``long'' labels, of arbitrary length $l\leq n$. Such a serialization consists of a binary {\tt 10} (the constructor tag of {\tt hml\_long}), followed by the big-endian binary representation of the length $0\leq l\leq n$ in $\lceil\log_2(n+1)\rceil$ bits, followed by $l$ bits comprising the label itself. \item {\tt hml\_same} --- Describes a way to serialize ``long'' labels, consisting of $l$ repetitions of the same bit $v$. Such a serialization consists of {\tt 11} (the constructor tag of {\tt hml\_same}), followed by the bit $v$, followed by the length $l$ stored in $\lceil\log_2(n+1)\rceil$ bits as before. \end{itemize} Each label can always be serialized in at least two different fashions, using {\tt hml\_short} or {\tt hml\_long} constructors. Usually the shortest serialization (and in the case of a tie---the lexicographically smallest among the shortest) is preferred and is generated by TVM hashmap primitives, while the other variants are still considered valid. This label encoding scheme has been designed to be efficient for dictionaries with ``random'' keys (e.g., hashes of some data), as well as for dictionaries with ``regular'' keys (e.g., big-endian representations of integers in some range). \nxsubpoint\emb{An example of dictionary serialization} Consider a dictionary with three 16-bit keys $13$, $17$, and $239$ (considered as big-endian integers) and corresponding 16-bit values $169$, $289$, and $57121$. In binary form: \begin{verbatim} 0000000000001101 => 0000000010101001 0000000000010001 => 0000000100100001 0000000011101111 => 1101111100100001 \end{verbatim} The corresponding Patricia tree consists of a root $A$, two intermediate nodes $B$ and $C$, and three leaf nodes $D$, $E$, and $F$, corresponding to 13, 17, and 239, respectively. The root $A$ has only one child, $B$; the label on the edge $AB$ is $00000000=0^8$. The node $B$ has two children: its left child is an intermediate node $C$ with the edge $BC$ labelled by $(0)00$, while its right child is the leaf $F$ with $BF$ labelled by $(1)1101111$. Finally, $C$ has two leaf children $D$ and $E$, with $CD$ labelled by $(0)1101$ and $CE$---by $(1)0001$. The corresponding value of type {\tt HashmapE 16 (\#\# 16)} may be written in human-readable form as: \begin{verbatim} (hme_root$1 root:^(hm_edge label:(hml_same$11 v:0 n:8) node:(hm_fork left:^(hm_edge label:(hml_short$0 len:$110 s:$00) node:(hm_fork left:^(hm_edge label:(hml_long$10 n:4 s:$1101) node:(hm_leaf value:169)) right:^(hm_edge label:(hml_long$10 n:4 s:$0001) node:(hm_leaf value:289)))) right:^(hm_edge label:(hml_long$10 n:7 s:$1101111) node:(hm_leaf value:57121))))) \end{verbatim} The serialization of this data structure into a tree of cells consists of six cells with the following binary data contained in them: \begin{verbatim} A := 1 A.0 := 11 0 01000 A.0.0 := 0 110 00 A.0.0.0 := 10 100 1101 0000000010101001 A.0.0.1 := 10 100 0001 0000000100100001 A.0.1 := 10 111 1101111 1101111100100001 \end{verbatim} Here $A$ is the root cell, $A.0$ is the cell at the first reference of $A$, $A.1$ is the cell at the second reference of $A$, and so on. This tree of cells can be represented more compactly using the hexadecimal notation described in~\ptref{p:bitstring.hex}, using indentation to reflect the tree-of-cells structure: \begin{verbatim} C_ C8 62_ A68054C_ A08090C_ BEFDF21 \end{verbatim} A total of 93 data bits and 5 references in 6 cells have been used to serialize this dictionary. Notice that a straightforward representation of three 16-bit keys and their corresponding 16-bit values would already require 96 bits (albeit without any references), so this particular serialization turns out to be quite efficient. \nxsubpoint\emb{Ways to describe the serialization of type~$X$} Notice that the built-in TVM primitives for dictionary manipulation need to know something about the serialization of type $X$; otherwise, they would not be able to work correctly with $\Hashmap$ $n$ $X$, because values of type~$X$ are immediately contained in the Patricia tree leaf cells. There are several options available to describe the serialization of type~$X$: \begin{itemize} \item The simplest case is when $X=\texttt{\caret} Y$ for some other type~$Y$. In this case the serialization of $X$ itself always consists of one reference to a cell, which in fact must contain a value of type~$Y$, something that is not relevant for dictionary manipulation primitives. \item Another simple case is when the serialization of any value of type $X$ always consists of $0\leq b\leq 1023$ data bits and $0\leq r\leq 4$ references. Integers $b$ and $r$ can then be passed to a dictionary manipulation primitive as a simple description of~$X$. (Notice that the previous case corresponds to $b=0$, $r=1$.) \item A more sophisticated case can be described by four integers $1\leq b_0,b_1\leq 1023$, $0\leq r_0,r_1\leq 4$, with $b_i$ and $r_i$ used when the first bit of the serialization equals~$i$. When $b_0=b_1$ and $r_0=r_1$, this case reduces to the previous one. \item Finally, the most general description of the serialization of a type~$X$ is given by a {\em splitting function\/} ${\textit split}_X$ for~$X$, which accepts one {\em Slice\/} parameter $s$, and returns two {\em Slice\/}s, $s'$ and $s''$, where $s'$ is the only prefix of $s$ that is the serialization of a value of type $X$, and $s''$ is the remainder of $s$. If no such prefix exists, the splitting function is expected to throw an exception. Notice that a compiler for a high-level language, which supports some or all algebraic TL-B types, is likely to automatically generate splitting functions for all types defined in the program. \end{itemize} \nxsubpoint\emb{A simplifying assumption on the serialization of~$X$} One may notice that values of type $X$ always occupy the remaining part of an {\tt hm\_edge}/{\tt hme\_leaf} cell inside the serialization of a $\HashmapE$ $n$ $X$. Therefore, if we do not insist on strict validation of all dictionaries accessed, we may assume that everything left unparsed in an {\tt hm\_edge}/{\tt hme\_leaf} cell after deserializing its {\tt label} is a value of type~$X$. This greatly simplifies the creation of dictionary manipulation primitives, because in most cases they turn out not to need any information about~$X$ at all. \nxsubpoint\label{sp:dict.ops}\emb{Basic dictionary operations} Let us present a classification of basic operations with dictionaries (i.e., values $D$ of type $\HashmapE$ $n$ $X$): \begin{itemize} \item $\textsc{Get}(D,k)$ --- Given $D:\HashmapE(n,X)$ and a key $k:n\cdot{\tt bit}$, returns the corresponding value $D[k]:X^?$ kept in~$D$. \item $\textsc{Set}(D,k,x)$ --- Given $D:\HashmapE(n,X)$, a key $k:n\cdot{\tt bit}$, and a value $x:X$, sets $D'[k]$ to $x$ in a copy $D'$ of~$D$, and returns the resulting dictionary $D'$ (cf.~\ptref{sp:no.refs}). \item $\textsc{Add}(D,k,x)$ --- Similar to {\sc Set}, but adds the key-value pair $(k,x)$ to $D$ only if key $k$ is absent in~$D$. \item $\textsc{Replace}(D,k,x)$ --- Similar to {\sc Set}, but changes $D'[k]$ to $x$ only if key $k$ is already present in~$D$. \item {\sc GetSet}, {\sc GetAdd}, {\sc GetReplace} --- Similar to \textsc{Set}, \textsc{Add}, and \textsc{Replace}, respectively, but returns the old value of $D[k]$ as well. \item $\textsc{Delete}(D,k)$ --- Deletes key $k$ from dictionary $D$, and returns the resulting dictionary~$D'$. \item $\textsc{GetMin}(D)$, $\textsc{GetMax}(D)$ --- Gets the minimal or maximal key~$k$ from dictionary~$D$, along with the associated value $x:X$. \item $\textsc{RemoveMin}(D)$, $\textsc{RemoveMax}(D)$ --- Similar to \textsc{GetMin} and \textsc{GetMax}, but also removes the key in question from dictionary~$D$, and returns the modified dictionary $D'$. May be used to iterate over all elements of~$D$, effectively using (a copy of)~$D$ itself as an iterator. \item $\textsc{GetNext}(D,k)$ --- Computes the minimal key $k'>k$ (or $k'\geq k$ in a variant) and returns it along with the corresponding value $x':X$. May be used to iterate over all elements of~$D$. \item $\textsc{GetPrev}(D,k)$ --- Computes the maximal key $k'0$) into the continuation at {\tt c0} as described in~\ptref{sp:op.cont} before setting the new value. \nxsubpoint\emb{Example: setting the number of arguments to a function in its code} The primitive {\tt LEAVEARGS $n$} demonstrates another application of continuations in an operation: it leaves only the top $n$ values of the current stack, and moves the remainder to the stack of the continuation in {\tt c0}. This primitive enables a called function to ``return'' unneeded arguments to its caller's stack, which is useful in some situations (e.g., those related to exception handling). \nxsubpoint\emb{Boolean circuits} A continuation $c$ may be thought of as a piece of code with two optional exit points kept in the savelist of~$c$: the principal exit point given by $c.\texttt{c0}:=c.\texttt{save}(\texttt{c0})$, and the auxiliary exit point given by $c.\texttt{c1}:=c.\texttt{save}(\texttt{c1})$. If executed, a continuation performs whatever action it was created for, and then (usually) transfers control to the principal exit point, or, on some occasions, to the auxiliary exit point. We sometimes say that a continuation $c$ with both exit points $c.{\tt c0}$ and $c.{\tt c1}$ defined is a {\em two-exit continuation}, or a {\em boolean circuit}, especially if the choice of the exit point depends on some internally-checked condition. \nxsubpoint\emb{Composition of continuations} One can {\em compose\/} two continuations $c$ and $c'$ simply by setting $c.\texttt{c0}$ or $c.\texttt{c1}$ to $c'$. This creates a new continuation denoted by $c\circ_0c'$ or $c\circ_1c'$, which differs from $c$ in its savelist. (Recall that if the savelist of $c$ already has an entry corresponding to the control register in question, such an operation silently does nothing as explained in~\ptref{sp:op.cont}). By composing continuations, one can build chains or other graphs, possibly with loops, representing the control flow. In fact, the resulting graph resembles a flow chart, with the boolean circuits corresponding to the ``condition nodes'' (containing code that will transfer control either to {\tt c0} or to {\tt c1} depending on some condition), and the one-exit continuations corresponding to the ``action nodes''. \nxsubpoint\emb{Basic continuation composition primitives} Two basic primitives for composing continuations are {\tt COMPOS} (also known as {\tt SETCONT c0} and {\tt BOOLAND}) and {\tt COMPOSALT} (also known as {\tt SETCONT c1} and {\tt BOOLOR}), which take $c$ and $c'$ from the stack, set $c.\texttt{c0}$ or $c.\texttt{c1}$ to $c'$, and return the resulting continuation $c''=c\circ_0c'$ or $c\circ_1c'$. All other continuation composition operations can be expressed in terms of these two primitives. \nxsubpoint\emb{Advanced continuation composition primitives} However, TVM can compose continuations not only taken from stack, but also taken from {\tt c0} or {\tt c1}, or from the current continuation {\tt cc}; likewise, the result may be pushed into the stack, stored into either {\tt c0} or {\tt c1}, or used as the new current continuation (i.e., the control may be transferred to it). Furthermore, TVM can define conditional composition primitives, performing some of the above actions only if an integer value taken from the stack is non-zero. For instance, {\tt EXECUTE} can be described as ${\tt cc}\leftarrow c\circ_0\tt{cc}$, with continuation $c$ taken from the original stack. Similarly, {\tt JMPX} is ${\tt cc}\leftarrow c$, and {\tt RET} (also known as {\tt RETTRUE} in a boolean circuit context) is ${\tt cc}\leftarrow\tt{c0}$. Other interesting primitives include {\tt THENRET} ($c'\leftarrow c\circ_0{\tt c0}$) and {\tt ATEXIT} (${\tt c0}\leftarrow c\circ_0{\tt c0}$). Finally, some ``experimental'' primitives also involve {\tt c1} and $\circ_1$. For example: \begin{itemize} \item {\tt RETALT} or {\tt RETFALSE} does ${\tt cc}\leftarrow{\tt c1}$. \item Conditional versions of {\tt RET} and {\tt RETALT} may also be useful: {\tt RETBOOL} takes an integer $x$ from the stack, and performs {\tt RETTRUE} if $x\neq0$, {\tt RETFALSE} otherwise. \item {\tt INVERT} does ${\tt c0}\leftrightarrow{\tt c1}$; if the two continuations in {\tt c0} and {\tt c1} represent the two branches we should select depending on some boolean expression, {\tt INVERT} negates this expression on the outer level. \item {\tt INVERTCONT} does $c.{\tt c0}\leftrightarrow c.{\tt c1}$ to a continuation $c$ taken from the stack. \item Variants of {\tt ATEXIT} include {\tt ATEXITALT} (${\tt c1}\leftarrow c\circ_1{\tt c1}$) and {\tt SETEXITALT} (${\tt c1}\leftarrow (c\circ_0{\tt c0})\circ_1{\tt c1}$). \item {\tt BOOLEVAL} takes a continuation $c$ from the stack and does ${\tt cc}\leftarrow \bigl((c\circ_0({\tt PUSH -1}))\circ_1({\tt PUSH 0})\bigr)\circ_0{\tt cc}$. If $c$ represents a boolean circuit, the net effect is to evaluate it and push either $-1$ or $0$ into the stack before continuing. \end{itemize} \mysubsection{Continuations as objects} \nxsubpoint\label{sp:cont.obj}\emb{Representing objects using continuations} Object-oriented programming in Small\-talk (or Objective C) style may be implemented with the aid of continuations. For this, an object is represented by a special continuation $o$. If it has any data fields, they can be kept in the stack of $o$, making $o$ a partial application (i.e., a continuation with a non-empty stack). When somebody wants to invoke a method $m$ of $o$ with arguments $x_1$, $x_2$, \dots, $x_n$, she pushes the arguments into the stack, then pushes a magic number corresponding to the method $m$, and then executes $o$ passing $n+1$ arguments (cf.~\ptref{sp:callx.num.args}). Then $o$ uses the top-of-stack integer $m$ to select the branch with the required method, and executes it. If $o$ needs to modify its state, it simply computes a new continuation $o'$ of the same sort (perhaps with the same code as $o$, but with a different initial stack). The new continuation $o'$ is returned to the caller along with whatever other return values need to be returned. \nxsubpoint\emb{Serializable objects} Another way of representing Smalltalk-style objects as continuations, or even as trees of cells, consists in using the {\tt JMPREFDATA} primitive (a variant of {\tt JMPXDATA}, cf.~\ptref{sp:call.cc}), which takes the first cell reference from the code of the current continuation, transforms the cell referred to into a simple ordinary continuation, and transfers control to it, first pushing the remainder of the current continuation as a {\em Slice} into the stack. In this way, an object might be represented by a cell $\tilde o$ that contains {\tt JMPREFDATA} at the beginning of its data, and the actual code of the object in the first reference (one might say that the first reference of cell $\tilde o$ is the {\em class} of object $\tilde o$). Remaining data and references of this cell will be used for storing the fields of the object. Such objects have the advantage of being trees of cells, and not just continuations, meaning that they can be stored into the persistent storage of a TON smart contract. \nxsubpoint\emb{Unique continuations and capabilities} It might make sense (in a future revision of TVM) to mark some continuations as {\em unique}, meaning that they cannot be copied, even in a delayed manner, by increasing their reference counter to a value greater than one. If an opaque continuation is unique, it essentially becomes a {\em capability}, which can either be used by its owner exactly once or be transferred to somebody else. For example, imagine a continuation that represents the output stream to a printer (this is an example of a continuation used as an object, cf.~\ptref{sp:cont.obj}). When invoked with one integer argument $n$, this continuation outputs the character with code $n$ to the printer, and returns a new continuation of the same kind reflecting the new state of the stream. Obviously, copying such a continuation and using the two copies in parallel would lead to some unintended side effects; marking it as unique would prohibit such adverse usage. \mysubsection{Exception handling} TVM's exception handling is quite simple and consists in a transfer of control to the continuation kept in control register {\tt c2}. \nxsubpoint\emb{Two arguments of the exception handler: exception parameter and exception number} Every exception is characterized by two arguments: the {\em exception number\/} (an {\em Integer}) and the {\em exception parameter\/} (any value, most often a zero {\em Integer}). Exception numbers 0--31 are reserved for TVM, while all other exception numbers are available for user-defined exceptions. \nxsubpoint\emb{Primitives for throwing an exception} There are several special primitives used for throwing an exception. The most general of them, {\tt THROWANY}, takes two arguments, $v$ and $0\leq n<2^{16}$, from the stack, and throws the exception with number $n$ and value $v$. There are variants of this primitive that assume $v$ to be a zero integer, store $n$ as a literal value, and/or are conditional on an integer value taken from the stack. User-defined exceptions may use arbitrary values as $v$ (e.g., trees of cells) if needed. \nxsubpoint\emb{Exceptions generated by TVM} Of course, some exceptions are generated by normal primitives. For example, an arithmetic overflow exception is generated whenever the result of an arithmetic operation does not fit into a signed 257-bit integer. In such cases, the arguments of the exception, $v$ and $n$, are determined by TVM itself. \nxsubpoint\emb{Exception handling} The exception handling itself consists in a control transfer to the exception handler---i.e., the continuation specified in control register {\tt c2}, with $v$ and $n$ supplied as the two arguments to this continuation, as if a {\tt JMP} to {\tt c2} had been requested with $n''=2$ arguments (cf.~\ptref{sp:jmp.sw.n} and~\ptref{sp:jmp.sw}). As a consequence, $v$ and $n$ end up in the top of the stack of the exception handler. The remainder of the old stack is discarded. Notice that if the continuation in {\tt c2} has a value for {\tt c2} in its savelist, it will be used to set up the new value of {\tt c2} before executing the exception handler. In particular, if the exception handler invokes \texttt{THROWANY}, it will re-throw the original exception with the restored value of {\tt c2}. This trick enables the exception handler to handle only some exceptions, and pass the rest to an outer exception handler. \nxsubpoint\emb{Default exception handler} When an instance of TVM is created, {\tt c2} contains a reference to the ``default exception handler continuation'', which is an {\tt ec\_fatal} extraordinary continuation (cf.~\ptref{sp:extraord.cont}). Its execution leads to the termination of the execution of TVM, with the arguments $v$ and $n$ of the exception returned to the outside caller. In the context of the TON Blockchain, $n$ will be stored as a part of the transaction's result. \nxsubpoint\emb{{\tt TRY} primitive} A {\tt TRY} primitive can be used to implement C++-like exception handling. This primitive accepts two continuations, $c$ and $c'$. It stores the old value of {\tt c2} into the savelist of $c'$, sets {\tt c2} to $c'$, and executes $c$ just as {\tt EXECUTE} would, but additionally saving the old value of {\tt c2} into the savelist of the new {\tt c0} as well. Usually a version of the {\tt TRY} primitive with an explicit number of arguments $n''$ passed to the continuation $c$ is used. The net result is roughly equivalent to C++'s {\tt try \{ $c$ \} catch(...) \{ $c'$ \}} operator. \nxsubpoint\label{sp:exc.list}\emb{List of predefined exceptions} Predefined exceptions of TVM correspond to exception numbers $n$ in the range 0--31. They include: \begin{itemize} \item {\em Normal termination} ($n=0$) --- Should never be generated, but it is useful for some tricks. \item {\em Alternative termination} ($n=1$) --- Again, should never be generated. \item {\em Stack underflow} ($n=2$) --- Not enough arguments in the stack for a primitive. \item {\em Stack overflow} ($n=3$) --- More values have been stored on a stack than allowed by this version of TVM. \item {\em Integer overflow} ($n=4$) --- Integer does not fit into $-2^{256}\leq x<2^{256}$, or a division by zero has occurred. \item {\em Range check error} ($n=5$) --- Integer out of expected range. \item {\em Invalid opcode} ($n=6$) --- Instruction or its immediate arguments cannot be decoded. \item {\em Type check error} ($n=7$) --- An argument to a primitive is of incorrect value type. \item {\em Cell overflow} ($n=8$) --- Error in one of the serialization primitives. \item {\em Cell underflow} ($n=9$) --- Deserialization error. \item {\em Dictionary error} ($n=10$) --- Error while deserializing a dictionary object. \item {\em Unknown error} ($n=11$) --- Unknown error, may be thrown by user programs. \item {\em Fatal error} ($n=12$) --- Thrown by TVM in situations deemed impossible. \item {\em Out of gas} ($n=13$) --- Thrown by TVM when the remaining gas ($g_r$) becomes negative. This exception usually cannot be caught and leads to an immediate termination of TVM. \end{itemize} Most of these exceptions have no parameter (i.e., use a zero integer instead). The order in which these exceptions are checked is outlined below in~\ptref{sp:exc.check.order}. \nxsubpoint\label{sp:exc.check.order} \emb{Order of stack underflow, type check, and range check exceptions} All TVM primitives first check whether the stack contains the required number of arguments, generating a stack underflow exception if this is not the case. Only then are the type tags of the arguments and their ranges (e.g., if a primitive expects an argument not only to be an {\em Integer}, but also to be in the range from 0 to 256) checked, starting from the value in the top of the stack (the last argument) and proceeding deeper into the stack. If an argument's type is incorrect, a type-checking exception is generated; if the type is correct, but the value does not fall into the expected range, a range check exception is generated. Some primitives accept a variable number of arguments, depending on the values of some small fixed subset of arguments located near the top of the stack. In this case, the above procedure is first run for all arguments from this small subset. Then it is repeated for the remaining arguments, once their number and types have been determined from the arguments already processed. \mysubsection{Functions, recursion, and dictionaries}\label{p:func.rec.dict} \nxsubpoint\emb{The problem of recursion} The conditional and iterated execution primitives described in~\ptref{p:cond.iter.exec}---along with the unconditional branch, call, and return primitives described in~\ptref{p:cont.subr}--- enable one to implement more or less arbitrary code with nested loops and conditional expressions, with one notable exception: one can only create new constant continuations from parts of the current continuation. (In particular, one cannot invoke a subroutine from itself in this way.) Therefore, the code being executed---i.e., the current continuation---gradually becomes smaller and smaller.\footnote{An important point here is that the tree of cells representing a TVM program cannot have cyclic references, so using {\tt CALLREF} along with a reference to a cell higher up the tree would not work.} \nxsubpoint\emb{$Y$-combinator solution: pass a continuation as an argument to itself} One way of dealing with the problem of recursion is by passing a copy of the continuation representing the body of a recursive function as an extra argument to itself. Consider, for example, the following code for a factorial function: \begin{verbatim} 71 PUSHINT 1 9C PUSHCONT { 22 PUSH s2 72 PUSHINT 2 B9 LESS DC IFRET 59 ROTREV 21 PUSH s1 A8 MUL 01 SWAP A5 DEC 02 XCHG s2 20 DUP D9 JMPX } 20 DUP D8 EXECUTE 30 DROP 31 NIP \end{verbatim} This roughly corresponds to defining an auxiliary function $\textit{body}$ with three arguments $n$, $x$, and $f$, such that $\textit{body}(n,x,f)$ equals $x$ if $n<2$ and $f(n-1,nx,f)$ otherwise, then invoking $\textit{body}(n,1,\textit{body})$ to compute the factorial of~$n$. The recursion is then implemented with the aid of the {\tt DUP}; {\tt EXECUTE} construction, or {\tt DUP}; {\tt JMPX} in the case of tail recursion. This trick is equivalent to applying $Y$-combinator to a function $\textit{body}$. \nxsubpoint\emb{A variant of $Y$-combinator solution} Another way of recursively computing the factorial, more closely following the classical recursive definition \begin{equation} \textit{fact}(n):= \begin{cases} 1&\quad\text{if $n<2$},\\ n\cdot\textit{fact}(n-1)&\quad\text{otherwise} \end{cases} \end{equation} is as follows: \begin{verbatim} 9D PUSHCONT { 21 OVER C102 LESSINT 2 92 PUSHCONT { 5B 2DROP 71 PUSHINT 1 } E0 IFJMP 21 OVER A5 DEC 01 SWAP 20 DUP D8 EXECUTE A8 MUL } 20 DUP D9 JMPX \end{verbatim} This definition of the factorial function is two bytes shorter than the previous one, but it uses general recursion instead of tail recursion, so it cannot be easily transformed into a loop. \nxsubpoint\emb{Comparison: non-recursive definition of the factorial function} Incidentally, a non-recursive definition of the factorial with the aid of a {\tt REPEAT} loop is also possible, and it is much shorter than both recursive definitions: \begin{verbatim} 71 PUSHINT 1 01 SWAP 20 DUP 94 PUSHCONT { 66 TUCK A8 MUL 01 SWAP A5 DEC } E4 REPEAT 30 DROP \end{verbatim} \nxsubpoint\emb{Several mutually recursive functions} If one has a collection $f_1$, \dots, $f_n$ of mutually recursive functions, one can use the same trick by passing the whole collection of continuations $\{f_i\}$ in the stack as an extra $n$ arguments to each of these functions. However, as $n$ grows, this becomes more and more cumbersome, since one has to reorder these extra arguments in the stack to work with the ``true'' arguments, and then push their copies into the top of the stack before any recursive call. \nxsubpoint\emb{Combining several functions into one tuple} One might also combine a collection of continuations representing functions $f_1$, \dots, $f_n$ into a ``tuple'' ${\mathbf f}:=(f_1,\ldots,f_n)$, and pass this tuple as one stack element ${\mathbf f}$. For instance, when $n\leq4$, each function can be represented by a cell $\tilde f_i$ (along with the tree of cells rooted in this cell), and the tuple may be represented by a cell $\tilde{\mathbf f}$, which has references to its component cells $\tilde f_i$. However, this would lead to the necessity of ``unpacking'' the needed component from this tuple before each recursive call. \nxsubpoint\emb{Combining several functions into a selector function} Another approach is to combine several functions $f_1$, \dots, $f_n$ into one ``selector function'' $f$, which takes an extra argument $i$, $1\leq i\leq n$, from the top of the stack, and invokes the appropriate function~$f_i$. Stack machines such as TVM are well-suited to this approach, because they do not require the functions~$f_i$ to have the same number and types of arguments. Using this approach, one would need to pass only one extra argument, $f$, to each of these functions, and push into the stack an extra argument~$i$ before each recursive call to $f$ to select the correct function to be called. \nxsubpoint\emb{Using a dedicated register to keep the selector function} However, even if we use one of the two previous approaches to combine all functions into one extra argument, passing this argument to all mutually recursive functions is still quite cumbersome and requires a lot of additional stack manipulation operations. Because this argument changes very rarely, one might use a dedicated register to keep it and transparently pass it to all functions called. This is the approach used by TVM by default. \nxsubpoint\emb{Special register {\tt c3} for the selector function} In fact, TVM uses a dedicated register {\tt c3} to keep the continuation representing the current or global ``selector function'', which can be used to invoke any of a family of mutually recursive functions. Special primitives {\tt CALL $nn$} or {\tt CALLDICT $nn$} (cf.~\ptref{sp:prim.dict.calls}) are equivalent to {\tt PUSHINT $nn$}; {\tt PUSH c3}; {\tt EXECUTE}, and similarly {\tt JMP $nn$} or {\tt JMPDICT $nn$} are equivalent to {\tt PUSHINT $nn$}; {\tt PUSH c3}; {\tt JMPX}. In this way a TVM program, which ultimately is a large collection of mutually recursive functions, may initialize {\tt c3} with the correct selector function representing the family of all the functions in the program, and then use {\tt CALL $nn$} to invoke any of these functions by its index (sometimes also called the {\em selector\/} of a function). \nxsubpoint\emb{Initialization of {\tt c3}} A TVM program might initialize {\tt c3} by means of a {\tt POP c3} instruction. However, because this usually is the very first action undertaken by a program (e.g., a smart contract), TVM makes some provisions for the automatic initialization of {\tt c3}. Namely, {\tt c3} is initialized by the code (the initial value of {\tt cc}) of the program itself, and an extra zero (or, in some cases, some other predefined number $s$) is pushed into the stack before the program's execution. This is approximately equivalent to invoking {\tt JMPDICT 0} (or {\tt JMPDICT $s$}) at the very beginning of a program---i.e., the function with index zero is effectively the {\tt main()} function for the program. \nxsubpoint\emb{Creating selector functions and {\tt switch} statements} TVM makes special provisions for simple and concise implementation of selector functions (which usually constitute the top level of a TVM program) or, more generally, arbitrary {\tt switch} or {\tt case} statements (which are also useful in {\tt TVM} programs). The most important primitives included for this purpose are {\tt IFBITJMP}, {\tt IFNBITJMP}, {\tt IFBITJMPREF}, and {\tt IFNBITJMPREF} (cf.~\ptref{sp:prim.cond.flow}). They effectively enable one to combine subroutines, kept either in separate cells or as subslices of certain cells, into a binary decision tree with decisions made according to the indicated bits of the integer passed in the top of the stack. Another instruction, useful for the implementation of sum-product types, is {\tt PLDUZ} (cf.~\ptref{sp:prim.deser}). This instruction preloads the first several bits of a {\em Slice\/} into an {\em Integer}, which can later be inspected by {\tt IFBITJMP} and other similar instructions. \nxsubpoint\emb{Alternative: using a hashmap to select the correct function} Yet another alternative is to use a {\em Hashmap\/} (cf.~\ptref{p:hashmaps}) to hold the ``collection'' or ``dictionary'' of the code of all functions in a program, and use the hashmap lookup primitives (cf.~\ptref{p:prim.dict}) to select the code of the required function, which can then be {\tt BLESS}ed into a continuation (cf.~\ptref{sp:prim.bless.cont}) and executed. Special combined ``lookup, bless, and execute'' primitives, such as {\tt DICTIGETJMP} and {\tt DICTIGETEXEC}, are also available (cf.~\ptref{sp:prim.dict.get.spec}). This approach may be more efficient for larger programs and {\tt switch} statements. \clearpage \mysection{Codepages and instruction encoding} This chapter describes the codepage mechanism, which allows TVM to be flexible and extendable while preserving backward compatibility with respect to previously generated code. We also discuss some general considerations about instruction encodings (applicable to arbitrary machine code, not just TVM), as well as the implications of these considerations for TVM and the choices made while designing TVM's (experimental) codepage zero. The instruction encodings themselves are presented later in Appendix~\ptref{app:opcodes}. \mysubsection{Codepages and interoperability of different TVM versions}\label{p:codepages} The {\em codepages\/} are an essential mechanism of backward compatibility and of future extensions to TVM. They enable transparent execution of code written for different revisions of TVM, with transparent interaction between instances of such code. The mechanism of the codepages, however, is general and powerful enough to enable some other originally unintended applications. \nxsubpoint\emb{Codepages in continuations} Every ordinary continuation contains a 16-bit {\em codepage} field {\tt cp} (cf.~\ptref{sp:ord.cont}), which determines the codepage that will be used to execute its code. If a continuation is created by a {\tt PUSHCONT} (cf.~\ptref{sp:lit.cont}) or similar primitive, it usually inherits the current codepage (i.e., the codepage of {\tt cc}).\footnote{This is not exactly true. A more precise statement is that usually the codepage of the newly-created continuation is a known function of the current codepage.} \nxsubpoint\emb{Current codepage} The current codepage {\tt cp} (cf.~\ptref{p:tvm.state}) is the codepage of the current continuation {\tt cc}. It determines the way the next instruction will be decoded from {\tt cc.code}, the remainder of the current continuation's code. Once the instruction has been decoded and executed, it determines the next value of the current codepage. In most cases, the current codepage is left unchanged. On the other hand, all primitives that switch the current continuation load the new value of {\tt cp} from the new current continuation. In this way, all code in continuations is always interpreted exactly as it was intended to be. \nxsubpoint\emb{Different versions of TVM may use different codepages} Different versions of TVM may use different codepages for their code. For example, the original version of TVM might use codepage zero. A newer version might use codepage one, which contains all the previously defined opcodes, along with some newly defined ones, using some of the previously unused opcode space. A subsequent version might use yet another codepage, and so on. However, a newer version of TVM will execute old code for codepage zero exactly as before. If the old code contained an opcode used for some new operations that were undefined in the original version of TVM, it will still generate an invalid opcode exception, because the new operations are absent in codepage zero. \nxsubpoint\label{sp:old.op.change}\emb{Changing the behavior of old operations} New codepages can also change the effects of some operations present in the old codepages while preserving their opcodes and mnemonics. For example, imagine a future 513-bit upgrade of TVM (replacing the current 257-bit design). It might use a 513-bit {\em Integer} type within the same arithmetic primitives as before. However, while the opcodes and instructions in the new codepage would look exactly like the old ones, they would work differently, accepting 513-bit integer arguments and results. On the other hand, during the execution of the same code in codepage zero, the new machine would generate exceptions whenever the integers used in arithmetic and other primitives do not fit into 257 bits.\footnote{This is another important mechanism of backward compatibility. All values of newly-added types, as well as values belonging to extended original types that do not belong to the original types (e.g., 513-bit integers that do not fit into 257 bits in the example above), are treated by all instructions (except stack manipulation instructions, which are naturally polymorphic, cf.~\ptref{sp:stack.prim.poly}) in the old codepages as ``values of incorrect type'', and generate type-checking exceptions accordingly.} In this way, the upgrade would not change the behavior of the old code. \nxsubpoint\emb{Improving instruction encoding} Another application for codepages is to change instruction encodings, reflecting improved knowledge of the actual frequencies of such instructions in the code base. In this case, the new codepage will have exactly the same instructions as the old one, but with different encodings, potentially of differing lengths. For example, one might create an experimental version of the first version of TVM, using a (prefix) bitcode instead of the original bytecode, aiming to achieve higher code density. \nxsubpoint\label{sp:context.instr.enc} \emb{Making instruction encoding context-dependent} Another way of using codepages to improve code density is to use several codepages with different subsets of the whole instruction set defined in each of them, or with the whole instruction set defined, but with different length encodings for the same instructions in different codepages. Imagine, for instance, a ``stack manipulation'' codepage, where stack manipulation primitives have short encodings at the expense of all other operations, and a ``data processing'' codepage, where all other operations are shorter at the expense of stack manipulation operations. If stack manipulation operations tend to come one after another, we can automatically switch to ``stack manipulation'' codepage after executing any such instruction. When a data processing instruction occurs, we switch back to ``data processing'' codepage. If conditional probabilities of the class of the next instruction depending on the class of the previous instruction are considerably different from corresponding unconditional probabilities, this technique---automatically switching into stack manipulation mode to rearrange the stack with shorter instructions, then switching back---might considerably improve the code density. \nxsubpoint\emb{Using codepages for status and control flags} Another potential application of multiple codepages inside the same revision of TVM consists in switching between several codepages depending on the result of the execution of some instructions. For example, imagine a version of TVM that uses two new codepages, 2 and 3. Most operations do not change the current codepage. However, the integer comparison operations will switch to codepage 2 if the condition is false, and to codepage 3 if it is true. Furthermore, a new operation {\tt ?EXECUTE}, similar to {\tt EXECUTE}, will indeed be equivalent to {\tt EXECUTE} in codepage 3, but will instead be a {\tt DROP} in codepage 2. Such a trick effectively uses bit 0 of the current codepage as a status flag. Alternatively, one might create a couple of codepages---say, 4 and 5---which differ only in their cell deserialisation primitives. For instance, in codepage 4 they might work as before, while in codepage 5 they might deserialize data not from the beginning of a {\em Slice}, but from its end. Two new instructions---say, {\tt CLD} and {\tt STD}---might be used for switching to codepage 4 or codepage 5. Clearly, we have now described a status flag, affecting the execution of some instructions in a certain new manner. \nxsubpoint\emb{Setting the codepage in the code itself}\label{sp:setcp.opc} For convenience, we reserve some opcode in all codepages---say, {\tt FF $n$}---for the instruction {\tt SETCP $n$}, with $n$ from 0 to 255 (cf.~\ptref{p:prim.codepage}). Then by inserting such an instruction into the very beginning of (the main function of) a program (e.g., a TON Blockchain smart contract) or a library function, we can ensure that the code will always be executed in the intended codepage. \mysubsection{Instruction encoding}\label{p:instr.encode} This section discusses the general principles of instruction encoding valid for all codepages and all versions of TVM. Later, \ptref{p:cp0.instr.enc} discusses the choices made for the experimental ``codepage zero''. \nxsubpoint\emb{Instructions are encoded by a binary prefix code} All complete instructions (i.e., instructions along with all their parameters, such as the names of stack registers $s(i)$ or other embedded constants) of a TVM codepage are encoded by a {\em binary prefix code}. This means that a (finite) binary string (i.e., a bitstring) corresponds to each complete instruction, in such a way that binary strings corresponding to different complete instructions do not coincide, and no binary string among the chosen subset is a prefix of another binary string from this subset. \nxsubpoint\emb{Determining the first instruction from a code stream} As a consequence of this encoding method, any binary string admits at most one prefix, which is an encoding of some complete instruction. In particular, the code {\tt cc.code} of the current continuation (which is a {\em Slice}, and thus a bitstring along with some cell references) admits at most one such prefix, which corresponds to the (uniquely determined) instruction that TVM will execute first. After execution, this prefix is removed from the code of the current continuation, and the next instruction can be decoded. \nxsubpoint\emb{Invalid opcode} If no prefix of {\tt cc.code} encodes a valid instruction in the current codepage, an {\em invalid opcode exception\/} is generated (cf.~\ptref{sp:exc.list}). However, the case of an empty {\tt cc.code} is treated separately as explained in~\ptref{sp:ord.cont.exec} (the exact behavior may depend on the current codepage). \nxsubpoint\emb{Special case: end-of-code padding} As an exception to the above rule, some codepages may accept some values of {\tt cc.code} that are too short to be valid instruction encodings as additional variants of {\tt NOP}, thus effectively using the same procedure for them as for an empty {\tt cc.code}. Such bitstrings may be used for padding the code near its end. For example, if binary string {\tt 00000000} (i.e., {\tt x00}, cf.~\ptref{sp:hex.bitst}) is used in a codepage to encode {\tt NOP}, its proper prefixes cannot encode any instructions. So this codepage may accept {\tt 0}, {\tt 00}, {\tt 000}, \dots, {\tt 0000000} as variants of {\tt NOP} if this is all that is left in {\tt cc.code}, instead of generating an invalid opcode exception. Such a padding may be useful, for example, if the {\tt PUSHCONT} primitive (cf.~\ptref{sp:lit.cont}) creates only continuations with code consisting of an integral number of bytes, but not all instructions are encoded by an integral number of bytes. \nxsubpoint\label{sp:bitcode} \emb{TVM code is a bitcode, not a bytecode} Recall that TVM is a bit-oriented machine in the sense that its {\em Cell\/}s (and {\em Slice\/}s) are naturally considered as sequences of bits, not just of octets (bytes), cf.~\ptref{sp:cells.of.bits}. Because the TVM code is also kept in cells (cf.~\ptref{sp:code.boc} and~\ptref{sp:ord.cont.exec}), there is no reason to use only bitstrings of length divisible by eight as encodings of complete instructions. In other words, generally speaking, {\em the TVM code is a bitcode, not a bytecode}. That said, some codepages (such as our experimental codepage zero) may opt to use a bytecode (i.e., to use only encodings consisting of an integral number of bytes)---either for simplicity, or for the ease of debugging and of studying memory (i.e., cell) dumps.\footnote{If the cell dumps are hexadecimal, encodings consisting of an integral number of hexadecimal digits (i.e., having length divisible by four bits) might be equally convenient.} \nxsubpoint\emb{Opcode space used by a complete instruction} Recall from coding theory that the lengths of bitstrings $l_i$ used in a binary prefix code satisfy Kraft--McMillan inequality $\sum_i2^{-l_i}\leq1$. This is applicable in particular to the (complete) instruction encoding used by a TVM codepage. We say that {\em a particular complete instruction} (or, more precisely, {\em the encoding of a complete instruction}) {\em utilizes the portion $2^{-l}$ of the opcode space}, if it is encoded by an $l$-bit string. One can see that all complete instructions together utilize at most $1$ (i.e., ``at most the whole opcode space''). \nxsubpoint\emb{Opcode space used by an instruction, or a class of instructions} The above terminology is extended to instructions (considered with all admissible values of their parameters), or even classes of instructions (e.g., all arithmetic instructions). We say that an (incomplete) instruction, or a class of instructions, occupies portion $\alpha$ of the opcode space, if $\alpha$ is the sum of the portions of the opcode space occupied by all complete instructions belonging to that class. \nxsubpoint\emb{Opcode space for bytecodes} A useful approximation of the above definitions is as follows: Consider all 256 possible values for the first byte of an instruction encoding. Suppose that $k$ of these values correspond to the specific instruction or class of instructions we are considering. Then this instruction or class of instructions occupies approximately the portion $k/256$ of the opcode space. This approximation shows why all instructions cannot occupy together more than the portion $256/256=1$ of the opcode space, at least without compromising the uniqueness of instruction decoding. \nxsubpoint\label{sp:opcode.sp.distr}\emb{Almost optimal encodings} Coding theory tells us that in an optimally dense encoding, the portion of the opcode space used by a complete instruction ($2^{-l}$, if the complete instruction is encoded in $l$ bits) should be approximately equal to the probability or frequency of its occurrence in real programs.\footnote{Notice that it is the probability of occurrence in the code that counts, not the probability of being executed. An instruction occurring in the body of a loop executed a million times is still counted only once.} The same should hold for (incomplete) instructions, or primitives (i.e., generic instructions without specified values of parameters), and for classes of instructions. \nxsubpoint\label{sp:stk.opcode.distr}\emb{Example: stack manipulation primitives} For instance, if stack manipulation instructions constitute approximately half of all instructions in a typical TVM program, one should allocate approximately half of the opcode space for encoding stack manipulation instructions. One might reserve the first bytes (``opcodes'') {\tt 0x00}--{\tt 0x7f} for such instructions. If a quarter of these instructions are {\tt XCHG}, it would make sense to reserve {\tt 0x00}--{\tt 0x1f} for {\tt XCHG}s. Similarly, if half of all {\tt XCHG}s involve the top of stack {\tt s0}, it would make sense to use {\tt 0x00}--{\tt 0x0f} to encode {\tt XCHG s0,s$(i)$}. \nxsubpoint\label{sp:simple.enc}\emb{Simple encodings of instructions} In most cases, {\em simple\/} encodings of complete instructions are used. Simple encodings begin with a fixed bitstring called the {\em opcode} of the instruction, followed by, say, 4-bit fields containing the indices $i$ of stack registers {\tt s$(i)$} specified in the instruction, followed by all other constant (literal, immediate) parameters included in the complete instruction. While simple encodings may not be exactly optimal, they admit short descriptions, and their decoding and encoding can be easily implemented. If a (generic) instruction uses a simple encoding with an $l$-bit opcode, then the instruction will utilize $2^{-l}$ portion of the opcode space. This observation might be useful for considerations described in~\ptref{sp:opcode.sp.distr} and~\ptref{sp:stk.opcode.distr}. \nxsubpoint\emb{Optimizing code density further: Huffman codes} One might construct optimally dense binary code for the set of all complete instructions, provided their probabilities or frequences in real code are known. This is the well-known Huffman code (for the given probability distribution). However, such code would be highly unsystematic and hard to decode. \nxsubpoint\label{sp:pract.enc}\emb{Practical instruction encodings} In practice, instruction encodings used in TVM and other virtual machines offer a compromise between code density and ease of encoding and decoding. Such a compromise may be achieved by selecting simple encodings (cf.~\ptref{sp:simple.enc}) for all instructions (maybe with separate simple encodings for some often used variants, such as {\tt XCHG s0,s$(i)$} among all {\tt XCHG s$(i)$,s$(j)$}), and allocating opcode space for such simple encodings using the heuristics outlined in \ptref{sp:opcode.sp.distr} and~\ptref{sp:stk.opcode.distr}; this is the approach currently used in TVM. \mysubsection{Instruction encoding in codepage zero}\label{p:cp0.instr.enc} This section provides details about the experimental instruction encoding for codepage zero, as described elsewhere in this document (cf.\ Appendix~\ptref{app:opcodes}) and used in the preliminary test version of TVM. \nxsubpoint\emb{Upgradability} First of all, even if this preliminary version somehow gets into the production version of the TON Blockchain, the codepage mechanism (cf.~\ptref{p:codepages}) enables us to introduce better versions later without compromising backward compatibility.\footnote{Notice that any modifications after launch cannot be done unilaterally; rather they would require the support of at least two-thirds of validators.} So in the meantime, we are free to experiment. \nxsubpoint\emb{Choice of instructions} We opted to include many ``experimental'' and not strictly necessary instructions in codepage zero just to see how they might be used in real code. For example, we have both the basic (cf.~\ptref{sp:stack.basic}) and the compound (cf.~\ptref{sp:stack.comp}) stack manipulation primitives, as well as some ``unsystematic'' ones such as {\tt ROT} (mostly borrowed from Forth). If such primitives are rarely used, their inclusion just wastes some part of the opcode space and makes the encodings of other instructions slightly less effective, something we can afford at this stage of TVM's development. \nxsubpoint\emb{Using experimental instructions} Some of these experimental instructions have been assigned quite long opcodes, just to fit more of them into the opcode space. One should not be afraid to use them just because they are long; if these instructions turn out to be useful, they will receive shorter opcodes in future revisions. Codepage zero is not meant to be fine-tuned in this respect. \nxsubpoint\emb{Choice of bytecode} We opted to use a bytecode (i.e., to use encodings of complete instructions of lengths divisible by eight). While this may not produce optimal code density, because such a length restriction makes it more difficult to match portions of opcode space used for the encoding of instructions with estimated frequencies of these instructions in TVM code (cf.~\ptref{sp:simple.enc} and \ptref{sp:opcode.sp.distr}), such an approach has its advantages: it admits a simpler instruction decoder and simplifies debugging (cf.~\ptref{sp:bitcode}). After all, we do not have enough data on the relative frequencies of different instructions right now, so our code density optimizations are likely to be very approximate at this stage. The ease of debugging and experimenting and the simplicity of implementation are more important at this point. \nxsubpoint\emb{Simple encodings for all instructions} For similar reasons, we opted to use simple encodings for all instructions (cf.~\ptref{sp:simple.enc} and \ptref{sp:pract.enc}), with separate simple encodings for some very frequently used subcases as outlined in~\ptref{sp:pract.enc}. That said, we tried to distribute opcode space using the heuristics described in~\ptref{sp:opcode.sp.distr} and~\ptref{sp:stk.opcode.distr}. \nxsubpoint\emb{Lack of context-dependent encodings} This version of TVM also does not use context-dependent encodings (cf.~\ptref{sp:context.instr.enc}). They may be added at a later stage, if deemed useful. \nxsubpoint\emb{The list of all instructions} The list of all instructions available in codepage zero, along with their encodings and (in some cases) short descriptions, may be found in Appendix~\ptref{app:opcodes}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % % BIBLIOGRAPHY % % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \clearpage \markbothsame{\textsc{References}} \begin{thebibliography}{2} %\bibitem{Birman} % {\sc K.~Birman}, {\sl Reliable Distributed Systems: Technologies, Web Services and Applications}, Springer, 2005. \bibitem{TON} {\sc N.~Durov}, {\sl Telegram Open Network}, 2017. \end{thebibliography} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % % APPENDICES % % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \clearpage \appendix \myappendix{Instructions and opcodes}\label{app:opcodes} This appendix lists all instructions available in the (experimental) codepage zero of TVM, as explained in~\ptref{p:cp0.instr.enc}. We list the instructions in lexicographical opcode order. However, the opcode space is distributed in such way as to make all instructions in each category (e.g., arithmetic primitives) have neighboring opcodes. So we first list a number of stack manipulation primitives, then constant primitives, arithmetic primitives, comparison primitives, cell primitives, continuation primitives, dictionary primitives, and finally application-specific primitives. We use hexadecimal notation (cf. ~\ptref{p:bitstring.hex}) for bitstrings. Stack registers {\tt s$(i)$} usually have $0\leq i\leq 15$, and $i$ is encoded in a 4-bit field (or, on a few rare occasions, in an 8-bit field). Other immediate parameters are usually 4-bit, 8-bit, or variable length. The stack notation described in~\ptref{sp:stack.notat} is extensively used throughout this appendix. \mysubsection{Gas prices} \def\gas#1{{\em ($#1$)}} The gas price for most primitives equals the {\em basic gas price}, computed as $P_b:=10+b+5r$, where $b$ is the instruction length in bits and $r$ is the number of cell references included in the instruction. When the gas price of an instruction differs from this basic price, it is indicated in parentheses after its mnemonics, either as \gas{x}, meaning that the total gas price equals $x$, or as \gas{+x}, meaning $P_b+x$. Apart from integer constants, the following expressions may appear: \begin{itemize} \item $C_r$ --- The total price of ``reading'' cells (i.e., transforming cell references into cell slices). Currently equal to 20 gas units per cell. \item $L$ --- The total price of loading cells. Depends on the loading action required. \item $B_w$ --- The total price of creating new {\em Builder}s. Currently equal to 100 gas units per builder. \item $C_w$ --- The total price of creating new {\em Cell}s from {\em Builder}s). Currently equal to 100 gas units per cell. \end{itemize} \mysubsection{Stack manipulation primitives} This section includes both the basic (cf.~\ptref{sp:stack.basic}) and the compound (cf.~\ptref{sp:stack.comp}) stack manipulation primitives, as well as some ``unsystematic'' ones. Some compound stack manipulation primitives, such as {\tt XCPU} or {\tt XCHG2}, turn out to have the same length as an equivalent sequence of simpler operations. We have included these primitives regardless, so that they can easily be allocated shorter opcodes in a future revision of TVM---or removed for good. Some stack manipulation instructions have two mnemonics: one Forth-style (e.g., {\tt -ROT}), the other conforming to the usual rules for identifiers (e.g., {\tt ROTREV}). Whenever a stack manipulation primitive (e.g., {\tt PICK}) accepts an integer parameter $n$ from the stack, it must be within the range $0\dots255$; otherwise a range check exception happens before any further checks. \nxsubpoint\emb{Basic stack manipulation primitives} \begin{itemize} \item {\tt 00} --- {\tt NOP}, does nothing. \item {\tt 01} --- {\tt XCHG s1}, also known as {\tt SWAP}. \item {\tt 0$i$} --- {\tt XCHG s$(i)$} or {\tt XCHG s0,s$(i)$}, interchanges the top of the stack with {\tt s$(i)$}, $1\leq i\leq 15$. \item {\tt 10$ij$} --- {\tt XCHG s$(i)$,s$(j)$}, $1\leq i0$. \item {\tt B9} --- {\tt LESS} ($x$ $y$ -- $xy$). \item {\tt BD} --- {\tt NEQ} ($x$ $y$ -- $x\neq y$), equivalent to {\tt EQUAL}; {\tt NOT}. \item {\tt BE} --- {\tt GEQ} ($x$ $y$ -- $x\geq y$), equivalent to {\tt LESS}; {\tt NOT}. \item {\tt BF} --- {\tt CMP} ($x$ $y$ -- $\sgn(x-y)$), computes the sign of $x-y$: $-1$ if $xy$. No integer overflow can occur here unless $x$ or $y$ is a {\tt NaN}. \item {\tt C0$yy$} --- {\tt EQINT $yy$} ($x$ -- $x=yy$) for $-2^7\leq yy<2^7$. \item {\tt C000} --- {\tt ISZERO}, checks whether an integer is zero. Corresponds to Forth's {\tt 0=}. \item {\tt C1$yy$} --- {\tt LESSINT $yy$} ($x$ -- $xyy$) for $-2^7\leq yy<2^7$. \item {\tt C200} --- {\tt ISPOS}, checks whether an integer is positive. Corresponds to Forth's {\tt 0>}. \item {\tt C2FF} --- {\tt ISNNEG}, checks whether an integer is non-negative. \item {\tt C3$yy$} --- {\tt NEQINT $yy$} ($x$ -- $x\neq yy$) for $-2^7\leq yy<2^7$. \item {\tt C4} --- {\tt ISNAN} ($x$ -- $x={\tt NaN}$), checks whether $x$ is a {\tt NaN}. \item {\tt C5} --- {\tt CHKNAN} ($x$ -- $x$), throws an arithmetic overflow exception if $x$ is a {\tt NaN}. \item {\tt C6} --- reserved for integer comparison. \end{itemize} \nxsubpoint\emb{Other comparison} Most of these ``other comparison'' primitives actually compare the data portions of {\em Slice\/}s as bitstrings. \begin{itemize} \item {\tt C700} --- {\tt SEMPTY} ($s$ -- $s=\emptyset$), checks whether a {\em Slice\/ $s$} is empty (i.e., contains no bits of data and no cell references). \item {\tt C701} --- {\tt SDEMPTY} ($s$ -- $s\approx\emptyset$), checks whether {\em Slice\/ $s$} has no bits of data. \item {\tt C702} --- {\tt SREMPTY} ($s$ -- $r(s)=0$), checks whether {\em Slice\/} $s$ has no references. \item {\tt C703} --- {\tt SDFIRST} ($s$ -- $s_0=1$), checks whether the first bit of {\em Slice\/} $s$ is a one. \item {\tt C704} --- {\tt SDLEXCMP} ($s$ $s'$ -- $c$), compares the data of $s$ lexicographically with the data of $s'$, returning $-1$, 0, or 1 depending on the result. \item {\tt C705} --- {\tt SDEQ} ($s$ $s'$ -- $s\approx s'$), checks whether the data parts of $s$ and $s'$ coincide, equivalent to {\tt SDLEXCMP}; {\tt ISZERO}. \item {\tt C708} --- {\tt SDPFX} ($s$ $s'$ -- $?$), checks whether $s$ is a prefix of $s'$. \item {\tt C709} --- {\tt SDPFXREV} ($s$ $s'$ -- $?$), checks whether $s'$ is a prefix of $s$, equivalent to {\tt SWAP}; {\tt SDPFX}. \item {\tt C70A} --- {\tt SDPPFX} ($s$ $s'$ -- $?$), checks whether $s$ is a proper prefix of $s'$ (i.e., a prefix distinct from $s'$). \item {\tt C70B} --- {\tt SDPPFXREV} ($s$ $s'$ -- $?$), checks whether $s'$ is a proper prefix of $s$. \item {\tt C70C} --- {\tt SDSFX} ($s$ $s'$ -- $?$), checks whether $s$ is a suffix of $s'$. \item {\tt C70D} --- {\tt SDSFXREV} ($s$ $s'$ -- $?$), checks whether $s'$ is a suffix of $s$. \item {\tt C70E} --- {\tt SDPSFX} ($s$ $s'$ -- $?$), checks whether $s$ is a proper suffix of $s'$. \item {\tt C70F} --- {\tt SDPSFXREV} ($s$ $s'$ -- $?$), checks whether $s'$ is a proper suffix of $s$. \item {\tt C710} --- {\tt SDCNTLEAD0} ($s$ -- $n$), returns the number of leading zeroes in $s$. \item {\tt C711} --- {\tt SDCNTLEAD1} ($s$ -- $n$), returns the number of leading ones in $s$. \item {\tt C712} --- {\tt SDCNTTRAIL0} ($s$ -- $n$), returns the number of trailing zeroes in $s$. \item {\tt C713} --- {\tt SDCNTTRAIL1} ($s$ -- $n$), returns the number of trailing ones in $s$. \end{itemize} \mysubsection{Cell primitives} The cell primitives are mostly either {\em cell serialization primitives}, which work with {\em Builder\/}s, or {\em cell deserialization primitives}, which work with {\em Slice\/}s. \nxsubpoint\emb{Cell serialization primitives}\label{sp:prim.ser} All these primitives first check whether there is enough space in the Builder, and only then check the range of the value being serialized. \begin{itemize} \item {\tt C8} --- {\tt NEWC} ( -- $b$), creates a new empty {\em Builder}. \item {\tt C9} --- {\tt ENDC} ($b$ -- $c$), converts a {\em Builder} into an ordinary {\em Cell}. \item {\tt CA$cc$} --- {\tt STI $cc+1$} ($x$ $b$ -- $b'$), stores a signed $cc+1$-bit integer $x$ into {\em Builder\/} $b$ for $0\leq cc\leq 255$, throws a range check exception if $x$ does not fit into $cc+1$ bits. \item {\tt CB$cc$} --- {\tt STU $cc+1$} ($x$ $b$ -- $b'$), stores an unsigned $cc+1$-bit integer $x$ into {\em Builder\/} $b$. In all other respects it is similar to {\tt STI}. \item {\tt CC} --- {\tt STREF} ($c$ $b$ -- $b'$), stores a reference to {\em Cell\/} $c$ into {\em Builder\/} $b$. \item {\tt CD} --- {\tt STBREFR} or {\tt ENDCST} ($b$ $b''$ -- $b$), equivalent to {\tt ENDC}; {\tt SWAP}; {\tt STREF}. \item {\tt CE} --- {\tt STSLICE} ($s$ $b$ -- $b'$), stores {\em Slice\/} $s$ into {\em Builder\/} $b$. \item {\tt CF00} --- {\tt STIX} ($x$ $b$ $l$ -- $b'$), stores a signed $l$-bit integer $x$ into $b$ for $0\leq l\leq 257$. \item {\tt CF01} --- {\tt STUX} ($x$ $b$ $l$ -- $b'$), stores an unsigned $l$-bit integer $x$ into $b$ for $0\leq l\leq 256$. \item {\tt CF02} --- {\tt STIXR} ($b$ $x$ $l$ -- $b'$), similar to {\tt STIX}, but with arguments in a different order. \item {\tt CF03} --- {\tt STUXR} ($b$ $x$ $l$ -- $b'$), similar to {\tt STUX}, but with arguments in a different order. \item {\tt CF04} --- {\tt STIXQ} ($x$ $b$ $l$ -- $x$ $b$ $f$ or $b'$ $0$), a quiet version of {\tt STIX}. If there is no space in $b$, sets $b'=b$ and $f=-1$. If $x$ does not fit into $l$ bits, sets $b'=b$ and $f=1$. If the operation succeeds, $b'$ is the new {\em Builder\/} and $f=0$. However, $0\leq l\leq 257$, with a range check exception if this is not so. \item {\tt CF05} --- {\tt STUXQ} ($x$ $b$ $l$ -- $b'$ $f$). \item {\tt CF06} --- {\tt STIXRQ} ($b$ $x$ $l$ -- $b$ $x$ $f$ or $b'$ $0$). \item {\tt CF07} --- {\tt STUXRQ} ($b$ $x$ $l$ -- $b$ $x$ $f$ or $b'$ $0$). \item {\tt CF08$cc$} --- a longer version of {\tt STI $cc+1$}. \item {\tt CF09$cc$} --- a longer version of {\tt STU $cc+1$}. \item {\tt CF0A$cc$} --- {\tt STIR $cc+1$} ($b$ $x$ -- $b'$), equivalent to {\tt SWAP}; {\tt STI $cc+1$}. \item {\tt CF0B$cc$} --- {\tt STUR $cc+1$} ($b$ $x$ -- $b'$), equivalent to {\tt SWAP}; {\tt STU $cc+1$}. \item {\tt CF0C$cc$} --- {\tt STIQ $cc+1$} ($x$ $b$ -- $x$ $b$ $f$ or $b'$ $0$). \item {\tt CF0D$cc$} --- {\tt STUQ $cc+1$} ($x$ $b$ -- $x$ $b$ $f$ or $b'$ $0$). \item {\tt CF0E$cc$} --- {\tt STIRQ $cc+1$} ($b$ $x$ -- $b$ $x$ $f$ or $b'$ $0$). \item {\tt CF0F$cc$} --- {\tt STURQ $cc+1$} ($b$ $x$ -- $b$ $x$ $f$ or $b'$ $0$). \item {\tt CF10} --- a longer version of {\tt STREF} ($c$ $b$ -- $b'$). \item {\tt CF11} --- {\tt STBREF} ($b'$ $b$ -- $b''$), equivalent to {\tt SWAP}; {\tt STBREFREV}. \item {\tt CF12} --- a longer version of {\tt STSLICE} ($s$ $b$ -- $b'$). \item {\tt CF13} --- {\tt STB} ($b'$ $b$ -- $b''$), appends all data from {\em Builder\/} $b'$ to {\em Builder\/} $b$. \item {\tt CF14} --- {\tt STREFR} ($b$ $c$ -- $b'$). \item {\tt CF15} --- {\tt STBREFR} ($b$ $b'$ -- $b''$), a longer encoding of {\tt STBREFR}. \item {\tt CF16} --- {\tt STSLICER} ($b$ $s$ -- $b'$). \item {\tt CF17} --- {\tt STBR} ($b$ $b'$ -- $b''$), concatenates two {\em Builder\/}s, equivalent to {\tt SWAP}; {\tt STB}. \item {\tt CF18} --- {\tt STREFQ} ($c$ $b$ -- $c$ $b$ $-1$ or $b'$ $0$). \item {\tt CF19} --- {\tt STBREFQ} ($b'$ $b$ -- $b'$ $b$ $-1$ or $b''$ $0$). \item {\tt CF1A} --- {\tt STSLICEQ} ($s$ $b$ -- $s$ $b$ $-1$ or $b'$ $0$). \item {\tt CF1B} --- {\tt STBQ} ($b'$ $b$ -- $b'$ $b$ $-1$ or $b''$ $0$). \item {\tt CF1C} --- {\tt STREFRQ} ($b$ $c$ -- $b$ $c$ $-1$ or $b'$ $0$). \item {\tt CF1D} --- {\tt STBREFRQ} ($b$ $b'$ -- $b$ $b'$ $-1$ or $b''$ $0$). \item {\tt CF1E} --- {\tt STSLICERQ} ($b$ $s$ -- $b$ $s$ $-1$ or $b''$ $0$). \item {\tt CF1F} --- {\tt STBRQ} ($b$ $b'$ -- $b$ $b'$ $-1$ or $b''$ $0$). \item {\tt CF20} --- {\tt STREFCONST}, equivalent to {\tt PUSHREF}; {\tt STREFR}. \item {\tt CF21} --- {\tt STREF2CONST}, equivalent to {\tt STREFCONST}; {\tt STREFCONST}. \item {\tt CF23} --- {\tt ENDXC} ($b$ $x$ -- $c$), if $x\neq0$, creates a {\em special\/} or {\em exotic\/} cell (cf.~\ptref{sp:exotic.cells}) from {\em Builder\/} $b$. The type of the exotic cell must be stored in the first 8 bits of~$b$. If $x=0$, it is equivalent to {\tt ENDC}. Otherwise some validity checks on the data and references of $b$ are performed before creating the exotic cell. \item {\tt CF28} --- {\tt STILE4} ($x$ $b$ -- $b'$), stores a little-endian signed 32-bit integer. \item {\tt CF29} --- {\tt STULE4} ($x$ $b$ -- $b'$), stores a little-endian unsigned 32-bit integer. \item {\tt CF2A} --- {\tt STILE8} ($x$ $b$ -- $b'$), stores a little-endian signed 64-bit integer. \item {\tt CF2B} --- {\tt STULE8} ($x$ $b$ -- $b'$), stores a little-endian unsigned 64-bit integer. \item {\tt CF31} --- {\tt BBITS} ($b$ -- $x$), returns the number of data bits already stored in {\em Builder\/} $b$. \item {\tt CF32} --- {\tt BREFS} ($b$ -- $y$), returns the number of cell references already stored in $b$. \item {\tt CF33} --- {\tt BBITREFS} ($b$ -- $x$ $y$), returns the numbers of both data bits and cell references in $b$. \item {\tt CF35} --- {\tt BREMBITS} ($b$ -- $x'$), returns the number of data bits that can still be stored in $b$. \item {\tt CF36} --- {\tt BREMREFS} ($b$ -- $y'$). \item {\tt CF37} --- {\tt BREMBITREFS} ($b$ -- $x'$ $y'$). \item {\tt CF38$cc$} --- {\tt BCHKBITS $cc+1$} ($b$ --), checks whether $cc+1$ bits can be stored into $b$, where $0\leq cc\leq 255$. \item {\tt CF39} --- {\tt BCHKBITS} ($b$ $x$ -- ), checks whether $x$ bits can be stored into $b$, $0\leq x\leq 1023$. If there is no space for $x$ more bits in $b$, or if $x$ is not within the range $0\ldots1023$, throws an exception. \item {\tt CF3A} --- {\tt BCHKREFS} ($b$ $y$ -- ), checks whether $y$ references can be stored into $b$, $0\leq y\leq 7$. \item {\tt CF3B} --- {\tt BCHKBITREFS} ($b$ $x$ $y$ -- ), checks whether $x$ bits and $y$ references can be stored into $b$, $0\leq x\leq 1023$, $0\leq y\leq 7$. \item {\tt CF3C$cc$} --- {\tt BCHKBITSQ $cc+1$} ($b$ -- $?$), checks whether $cc+1$ bits can be stored into $b$, where $0\leq cc\leq 255$. \item {\tt CF3D} --- {\tt BCHKBITSQ} ($b$ $x$ -- $?$), checks whether $x$ bits can be stored into $b$, $0\leq x\leq 1023$. \item {\tt CF3E} --- {\tt BCHKREFSQ} ($b$ $y$ -- $?$), checks whether $y$ references can be stored into $b$, $0\leq y\leq 7$. \item {\tt CF3F} --- {\tt BCHKBITREFSQ} ($b$ $x$ $y$ -- $?$), checks whether $x$ bits and $y$ references can be stored into $b$, $0\leq x\leq 1023$, $0\leq y\leq 7$. \item {\tt CF40} --- {\tt STZEROES} ($b$ $n$ -- $b'$), stores $n$ binary zeroes into {\em Builder} $b$. \item {\tt CF41} --- {\tt STONES} ($b$ $n$ -- $b'$), stores $n$ binary ones into {\em Builder} $b$. \item {\tt CF42} --- {\tt STSAME} ($b$ $n$ $x$ -- $b'$), stores $n$ binary $x$es ($0\leq x\leq1$) into {\em Builder} $b$. \item {\tt CFC0\_$xysss$} --- {\tt STSLICECONST $sss$} ($b$ -- $b'$), stores a constant subslice $sss$ consisting of $0\leq x\leq 3$ references and up to $8y+1$ data bits, with $0\leq y\leq 7$. Completion bit is assumed. \item {\tt CF81} --- {\tt STSLICECONST `0'} or {\tt STZERO} ($b$ -- $b'$), stores one binary zero. \item {\tt CF83} --- {\tt STSLICECONST `1'} or {\tt STONE} ($b$ -- $b'$), stores one binary one. \item {\tt CFA2} --- equivalent to {\tt STREFCONST}. \item {\tt CFA3} --- almost equivalent to {\tt STSLICECONST `1'}; {\tt STREFCONST}. \item {\tt CFC2} --- equivalent to {\tt STREF2CONST}. \item {\tt CFE2} --- {\tt STREF3CONST}. \end{itemize} \nxsubpoint\emb{Cell deserialization primitives}\label{sp:prim.deser} \begin{itemize} \item {\tt D0} --- {\tt CTOS} ($c$ -- $s$), converts a {\em Cell\/} into a {\em Slice}. Notice that $c$ must be either an ordinary cell, or an exotic cell (cf.~\ptref{sp:exotic.cells}) which is automatically {\em loaded\/} to yield an ordinary cell $c'$, converted into a {\em Slice} afterwards. \item {\tt D1} --- {\tt ENDS} ($s$ -- ), removes a {\em Slice\/} $s$ from the stack, and throws an exception if it is not empty. \item {\tt D2$cc$} --- {\tt LDI $cc+1$} ($s$ -- $x$ $s'$), loads (i.e., parses) a signed $cc+1$-bit integer $x$ from {\em Slice\/} $s$, and returns the remainder of $s$ as $s'$. \item {\tt D3$cc$} --- {\tt LDU $cc+1$} ($s$ -- $x$ $s'$), loads an unsigned $cc+1$-bit integer $x$ from {\em Slice\/} $s$. \item {\tt D4} --- {\tt LDREF} ($s$ -- $c$ $s'$), loads a cell reference $c$ from $s$. \item {\tt D5} --- {\tt LDREFRTOS} ($s$ -- $s'$ $s''$), equivalent to {\tt LDREF}; {\tt SWAP}; {\tt CTOS}. \item {\tt D6$cc$} --- {\tt LDSLICE $cc+1$} ($s$ -- $s''$ $s'$), cuts the next $cc+1$ bits of $s$ into a separate {\em Slice\/} $s''$. \item {\tt D700} --- {\tt LDIX} ($s$ $l$ -- $x$ $s'$), loads a signed $l$-bit ($0\leq l\leq 257$) integer $x$ from {\em Slice\/} $s$, and returns the remainder of $s$ as~$s'$. \item {\tt D701} --- {\tt LDUX} ($s$ $l$ -- $x$ $s'$), loads an unsigned $l$-bit integer $x$ from (the first $l$ bits of) $s$, with $0\leq l\leq 256$. \item {\tt D702} --- {\tt PLDIX} ($s$ $l$ -- $x$), preloads a signed $l$-bit integer from {\em Slice\/} $s$, for $0\leq l\leq 257$. \item {\tt D703} --- {\tt PLDUX} ($s$ $l$ -- $x$), preloads an unsigned $l$-bit integer from $s$, for $0\leq l\leq 256$. \item {\tt D704} --- {\tt LDIXQ} ($s$ $l$ -- $x$ $s'$ $-1$ or $s$ $0$), quiet version of {\tt LDIX}: loads a signed $l$-bit integer from $s$ similarly to {\tt LDIX}, but returns a success flag, equal to $-1$ on success or to $0$ on failure (if $s$ does not have $l$ bits), instead of throwing a cell underflow exception. \item {\tt D705} --- {\tt LDUXQ} ($s$ $l$ -- $x$ $s'$ $-1$ or $s$ $0$), quiet version of {\tt LDUX}. \item {\tt D706} --- {\tt PLDIXQ} ($s$ $l$ -- $x$ $-1$ or $0$), quiet version of {\tt PLDIX}. \item {\tt D707} --- {\tt PLDUXQ} ($s$ $l$ -- $x$ $-1$ or $0$), quiet version of {\tt PLDUX}. \item {\tt D708$cc$} --- {\tt LDI $cc+1$} ($s$ -- $x$ $s'$), a longer encoding for {\tt LDI}. \item {\tt D709$cc$} --- {\tt LDU $cc+1$} ($s$ -- $x$ $s'$), a longer encoding for {\tt LDU}. \item {\tt D70A$cc$} --- {\tt PLDI $cc+1$} ($s$ -- $x$), preloads a signed $cc+1$-bit integer from {\em Slice\/} $s$. \item {\tt D70B$cc$} --- {\tt PLDU $cc+1$} ($s$ -- $x$), preloads an unsigned $cc+1$-bit integer from $s$. \item {\tt D70C$cc$} --- {\tt LDIQ $cc+1$} ($s$ -- $x$ $s'$ $-1$ or $s$ $0$), a quiet version of {\tt LDI}. \item {\tt D70D$cc$} --- {\tt LDUQ $cc+1$} ($s$ -- $x$ $s'$ $-1$ or $s$ $0$), a quiet version of {\tt LDU}. \item {\tt D70E$cc$} --- {\tt PLDIQ $cc+1$} ($s$ -- $x$ $-1$ or $0$), a quiet version of {\tt PLDI}. \item {\tt D70F$cc$} --- {\tt PLDUQ $cc+1$} ($s$ -- $x$ $-1$ or $0$), a quiet version of {\tt PLDU}. \item {\tt D714\_$c$} --- {\tt PLDUZ $32(c+1)$} ($s$ -- $s$ $x$), preloads the first $32(c+1)$ bits of {\em Slice\/} $s$ into an unsigned integer $x$, for $0\leq c\leq 7$. If $s$ is shorter than necessary, missing bits are assumed to be zero. This operation is intended to be used along with {\tt IFBITJMP} and similar instructions. \item {\tt D718} --- {\tt LDSLICEX} ($s$ $l$ -- $s''$ $s'$), loads the first $0\leq l\leq 1023$ bits from {\em Slice\/} $s$ into a separate {\em Slice\/} $s''$, returning the remainder of $s$ as $s'$. \item {\tt D719} --- {\tt PLDSLICEX} ($s$ $l$ -- $s''$), returns the first $0\leq l\leq 1023$ bits of $s$ as $s''$. \item {\tt D71A} --- {\tt LDSLICEXQ} ($s$ $l$ -- $s''$ $s'$ $-1$ or $s$ $0$), a quiet version of {\tt LDSLICEX}. \item {\tt D71B} --- {\tt PLDSLICEXQ} ($s$ $l$ -- $s'$ $-1$ or $0$), a quiet version of {\tt LDSLICEXQ}. \item {\tt D71C$cc$} --- {\tt LDSLICE $cc+1$} ($s$ -- $s''$ $s'$), a longer encoding for {\tt LDSLICE}. \item {\tt D71D$cc$} --- {\tt PLDSLICE $cc+1$} ($s$ -- $s''$), returns the first $0= 1 } rewrite_pfx:(bits depth) = Anycast; addr_std$10 anycast:(Maybe Anycast) workchain_id:int8 address:bits256 = MsgAddressInt; addr_var$11 anycast:(Maybe Anycast) addr_len:(## 9) workchain_id:int32 address:(bits addr_len) = MsgAddressInt; _ _:MsgAddressInt = MsgAddress; _ _:MsgAddressExt = MsgAddress; int_msg_info$0 ihr_disabled:Bool bounce:Bool bounced:Bool src:MsgAddress dest:MsgAddressInt value:CurrencyCollection ihr_fee:Grams fwd_fee:Grams created_lt:uint64 created_at:uint32 = CommonMsgInfoRelaxed; ext_out_msg_info$11 src:MsgAddress dest:MsgAddressExt created_lt:uint64 created_at:uint32 = CommonMsgInfoRelaxed; \end{verbatim} A deserialized {\tt MsgAddress} is represented by a {\em Tuple\/}~$t$ as follows: \begin{itemize} \item {\tt addr\_none} is represented by $t=(0)$, i.e., a {\em Tuple\/} containing exactly one {\em Integer\/} equal to zero. \item {\tt addr\_extern} is represented by $t=(1,s)$, where {\em Slice\/}~$s$ contains the field {\tt external\_address}. In other words, $t$ is a pair (a {\em Tuple\/} consisting of two entries), containing an {\em Integer\/} equal to one and {\em Slice}~$s$. \item {\tt addr\_std} is represented by $t=(2,u,x,s)$, where $u$ is either a {\em Null\/} (if {\tt anycast} is absent) or a {\em Slice\/}~$s'$ containing {\tt rewrite\_pfx} (if {\tt anycast} is present). Next, {\em Integer\/}~$x$ is the {\tt workchain\_id}, and {\em Slice\/}~$s$ contains the {\tt address}. \item {\tt addr\_var} is represented by $t=(3,u,x,s)$, where $u$, $x$, and $s$ have the same meaning as for {\tt addr\_std}. \end{itemize} The following primitives, which use the above conventions, are defined: \begin{itemize} \item {\tt FA40} --- {\tt LDMSGADDR} ($s$ -- $s'$ $s''$), loads from {\em CellSlice\/}~$s$ the only prefix that is a valid {\tt MsgAddress}, and returns both this prefix $s'$ and the remainder $s''$ of~$s$ as {\em CellSlice\/}s. \item {\tt FA41} --- {\tt LDMSGADDRQ} ($s$ -- $s'$ $s''$ $-1$ or $s$ $0$), a quiet version of {\tt LDMSGADDR}: on success, pushes an extra $-1$; on failure, pushes the original~$s$ and a zero. \item {\tt FA42} --- {\tt PARSEMSGADDR} ($s$ -- $t$), decomposes {\em CellSlice\/}~$s$ containing a valid {\tt MsgAddress} into a {\em Tuple\/}~$t$ with separate fields of this {\tt MsgAddress}. If $s$ is not a valid {\tt MsgAddress}, a cell deserialization exception is thrown. \item {\tt FA43} --- {\tt PARSEMSGADDRQ} ($s$ -- $t$ $-1$ or $0$), a quiet version of {\tt PARSEMSGADDR}: returns a zero on error instead of throwing an exception. \item {\tt FA44} --- {\tt REWRITESTDADDR} ($s$ -- $x$ $y$), parses {\em CellSlice\/}~$s$ containing a valid {\tt MsgAddressInt} (usually a {\tt msg\_addr\_std}), applies rewriting from the {\tt anycast} (if present) to the same-length prefix of the address, and returns both the workchain $x$ and the 256-bit address $y$ as {\em Integer\/}s. If the address is not 256-bit, or if $s$ is not a valid serialization of {\tt MsgAddressInt}, throws a cell deserialization exception. \item {\tt FA45} --- {\tt REWRITESTDADDRQ} ($s$ -- $x$ $y$ $-1$ or $0$), a quiet version of primitive {\tt REWRITESTDADDR}. \item {\tt FA46} --- {\tt REWRITEVARADDR} ($s$ -- $x$ $s'$), a variant of {\tt REWRITESTDADDR} that returns the (rewritten) address as a {\em Slice\/} s, even if it is not exactly 256 bit long (represented by a {\tt msg\_addr\_var}). \item {\tt FA47} --- {\tt REWRITEVARADDRQ} ($s$ -- $x$ $s'$ $-1$ or $0$), a quiet version of primitive {\tt REWRITEVARADDR}. \item {\tt FA48}--{\tt FA5F} --- Reserved for message and address manipulation primitives. \end{itemize} \nxsubpoint\emb{Outbound message and output action primitives} \begin{itemize} \item {\tt FB00} --- {\tt SENDRAWMSG} ($c$ $x$ -- ), sends a raw message contained in {\em Cell $c$}, which should contain a correctly serialized object {\tt Message $X$}, with the only exception that the source address is allowed to have dummy value {\tt addr\_none} (to be automatically replaced with the current smart-contract address), and {\tt ihr\_fee}, {\tt fwd\_fee}, {\tt created\_lt} and {\tt created\_at} fields can have arbitrary values (to be rewritten with correct values during the action phase of the current transaction). Integer parameter $x$ contains the flags. Currently $x=0$ is used for ordinary messages; $x=128$ is used for messages that are to carry all the remaining balance of the current smart contract (instead of the value originally indicated in the message); $x=64$ is used for messages that carry all the remaining value of the inbound message in addition to the value initially indicated in the new message (if bit 0 is not set, the gas fees are deducted from this amount); $x'=x+1$ means that the sender wants to pay transfer fees separately; $x'=x+2$ means that any errors arising while processing this message during the action phase should be ignored. \item {\tt FB02} --- {\tt RAWRESERVE} ($x$ $y$ -- ), creates an output action which would reserve exactly $x$ nanograms (if $y=0$), at most $x$ nanograms (if $y=2$), or all but $x$ nanograms (if $y=1$ or $y=3$), from the remaining balance of the account. It is roughly equivalent to creating an outbound message carrying $x$ nanograms (or $b-x$ nanograms, where $b$ is the remaining balance) to oneself, so that the subsequent output actions would not be able to spend more money than the remainder. Bit $+2$ in $y$ means that the external action does not fail if the specified amount cannot be reserved; instead, all remaining balance is reserved. Currently $x$ must be a non-negative integer, and $y$ must be in the range $0\ldots 3$. \item {\tt FB03} --- {\tt RAWRESERVEX} ($s$ $y$ -- ), similar to {\tt RAWRESERVE}, but accepts a {\em Slice $s$} with a {\em CurrencyCollection\/} as an argument. In this way currencies other than Grams can be reserved. \item {\tt FB04} --- {\tt SETCODE} ($c$ -- ), creates an output action that would change this smart contract code to that given by {\em Cell\/}~$c$. Notice that this change will take effect only after the successful termination of the current run of the smart contract. \item {\tt FB05}--{\tt FB3F} --- Reserved for output action primitives. \end{itemize} \mysubsection{Debug primitives}\label{p:prim.debug} Opcodes beginning with {\tt FE} are reserved for the {\em debug primitives}. These primitives have known fixed operation length, and behave as (multibyte) NOP operations. In particular, they never change the stack contents, and never throw exceptions, unless there are not enough bits to completely decode the opcode. However, when invoked in a TVM instance with debug mode enabled, these primitives can produce specific output into the text debug log of the TVM instance, never affecting the TVM state (so that from the perspective of TVM the behavior of debug primitives in debug mode is exactly the same). For instance, a debug primitive might dump all or some of the values near the top of the stack, display the current state of TVM and so on. \nxsubpoint\emb{Debug primitives as multibyte NOPs} \begin{itemize} \item {\tt FE$nn$} --- {\tt DEBUG $nn$}, for $0\leq nn<240$, is a two-byte NOP. \item {\tt FEF$nssss$} --- {\tt DEBUGSTR $ssss$}, for $0\leq n<16$, is an $(n+3)$-byte NOP, with the $(n+1)$-byte ``contents string'' $ssss$ skipped as well. \end{itemize} \nxsubpoint\emb{Debug primitives as operations without side-effect} Next we describe the debug primitives that might (and actually are) implemented in a version of TVM. Notice that another TVM implementation is free to use these codes for other debug purposes, or treat them as multibyte NOPs. Whenever these primitives need some arguments from the stack, they inspect these arguments, but leave them intact in the stack. If there are insufficient values in the stack, or they have incorrect types, debug primitives may output error messages into the debug log, or behave as NOPs, but they cannot throw exceptions. \begin{itemize} \item {\tt FE00} --- {\tt DUMPSTK}, dumps the stack (at most the top 255 values) and shows the total stack depth. \item {\tt FE0$n$} --- {\tt DUMPSTKTOP $n$}, $1\leq n<15$, dumps the top $n$ values from the stack, starting from the deepest of them. If there are $d r1; r0 := b, r1 := a MOV r6,r0 // r6 := b IMUL r2 // r0 := r0*r2; r0 := bc MOV r7,r0 // r7 := bc MOV r0,r8 // r0 := a IMUL r3 // r0 := ad SUB r7 // r0 := ad-bc = D XCHG r1 // r1 := D, r0 := b IMUL r5 // r0 := bf XCHG r3 // r0 := d, r3 := bf IMUL r4 // r0 := de SUB r3 // r0 := de-bf = Dx IDIV r1 // r0 := Dx/D = x XCHG r2 // r0 := c, r2 := x IMUL r4 // r0 := ce XCHG r5 // r0 := f, r5 := ce IMUL r8 // r0 := af SUB r5 // r0 := af-ce = Dy IDIV r1 // r0 := Dy/D = y MOV r1,r0 // r1 := y MOV r0,r2 // r0 := x RET \end{verbatim} We have used 23 operations; if we assume one-byte encoding for all arithmetic operations and \texttt{XCHG}, and two-byte encodings for \texttt{MOV}, the total size of the code will be 29 bytes. Notice, however, that to obtain the compact code shown above we had to choose a specific order of computation, and made heavy use of the commutativity of multiplication. (For example, we compute $bc$ before $af$, and $af-bc$ immediately after $af$.) It is not clear whether a compiler would be able to make all such optimizations by itself. \nxsubpoint\label{sp:cmp1.stack.base} \emb{Stack machine with basic stack primitives} The machine code for a stack machine equipped with basic stack manipulation primitives described in~\ptref{sp:stack.basic} might look as follows: \begin{verbatim} PUSH s5 // a b c d e f a PUSH s3 // a b c d e f a d IMUL // a b c d e f ad PUSH s5 // a b c d e f ad b PUSH s5 // a b c d e f ad b c IMUL // a b c d e f ad bc SUB // a b c d e f ad-bc XCHG s3 // a b c ad-bc e f d PUSH s2 // a b c ad-bc e f d e IMUL // a b c ad-bc e f de XCHG s5 // a de c ad-bc e f b PUSH s1 // a de c ad-bc e f b f IMUL // a de c ad-bc e f bf XCHG s1,s5 // a f c ad-bc e de bf SUB // a f c ad-bc e de-bf XCHG s3 // a f de-bf ad-bc e c IMUL // a f de-bf ad-bc ec XCHG s3 // a ec de-bf ad-bc f XCHG s1,s4 // ad-bc ec de-bf a f IMUL // D ec Dx af XCHG s1 // D ec af Dx XCHG s2 // D Dx af ec SUB // D Dx Dy XCHG s1 // D Dy Dx PUSH s2 // D Dy Dx D IDIV // D Dy x XCHG s2 // x Dy D IDIV // x y RET \end{verbatim} We have used 29 operations; assuming one-byte encodings for all stack operations involved (including \texttt{XCHG s1,s$(i)$}), we have used 29 code bytes as well. Notice that with one-byte encoding, the ``unsystematic'' operation \texttt{ROT} (equivalent to \texttt{XCHG s1; XCHG s2}) would reduce the operation and byte count to 28. This shows that such ``unsystematic'' operations, borrowed from Forth, may indeed reduce the code size on some occasions. Notice as well that we have implicitly used the commutativity of multiplication in this code, computing $de-bf$ instead of $ed-bf$ as specified in high-level language source code. If we were not allowed to do so, an extra \texttt{XCHG s1} would need to be inserted before the third \texttt{IMUL}, increasing the total size of the code by one operation and one byte. The code presented above might have been produced by a rather unsophisticated compiler that simply computed all expressions and subexpressions in the order they appear, then rearranged the arguments near the top of the stack before each operation as outlined in~\ptref{sp:basic.stack.suff}. The only ``manual'' optimization done here involves computing $ec$ before $af$; one can check that the other order would lead to slightly shorter code of 28 operations and bytes (or 29, if we are not allowed to use the commutativity of multiplication), but the \texttt{ROT} optimization would not be applicable. \nxsubpoint\emb{Stack machine with compound stack primitives} A stack machine with compound stack primitives (cf.~\ptref{sp:stack.comp}) would not significantly improve code density of the code presented above, at least in terms of bytes used. The only difference is that, if we were not allowed to use commutativity of multiplication, the extra \texttt{XCHG s1} inserted before the third \texttt{IMUL} might be combined with two previous operations \texttt{XCHG s3}, \texttt{PUSH s2} into one compound operation \texttt{PUXC s2,s3}; we provide the resulting code below. To make this less redundant, we show a version of the code that computes subexpression $af$ before $ec$ as specified in the original source file. We see that this replaces six operations (starting from line 15) with five other operations, and disables the \texttt{ROT} optimization: \begin{verbatim} PUSH s5 // a b c d e f a PUSH s3 // a b c d e f a d IMUL // a b c d e f ad PUSH s5 // a b c d e f ad b PUSH s5 // a b c d e f ad b c IMUL // a b c d e f ad bc SUB // a b c d e f ad-bc PUXC s2,s3 // a b c ad-bc e f e d IMUL // a b c ad-bc e f ed XCHG s5 // a ed c ad-bc e f b PUSH s1 // a ed c ad-bc e f b f IMUL // a ed c ad-bc e f bf XCHG s1,s5 // a f c ad-bc e ed bf SUB // a f c ad-bc e ed-bf XCHG s4 // a ed-bf c ad-bc e f XCHG s1,s5 // e Dx c D a f IMUL // e Dx c D af XCHG s2 // e Dx af D c XCHG s1,s4 // D Dx af e c IMUL // D Dx af ec SUB // D Dx Dy XCHG s1 // D Dy Dx PUSH s2 // D Dy Dx D IDIV // D Dy x XCHG s2 // x Dy D IDIV // x y RET \end{verbatim} We have used a total of 27 operations and 28 bytes, the same as the previous version (with the \texttt{ROT} optimization). However, we did not use the commutativity of multiplication here, so we can say that compound stack manipulation primitives enable us to reduce the code size from 29 to 28 bytes. Yet again, notice that the above code might have been generated by an unsophisticated compiler. Manual optimizations might lead to more compact code; for instance, we could use compound operations such as \texttt{XCHG3} to prepare in advance not only the correct values of \texttt{s0} and \texttt{s1} for the next arithmetic operation, but also the value of \texttt{s2} for the arithmetic operation after that. The next section provides an example of such an optimization. \nxsubpoint\label{sp:cmp1.stack.comp} \emb{Stack machine with compound stack primitives and manually optimized code} The previous version of code for a stack machine with compound stack primitives can be manually optimized as follows. By interchanging \texttt{XCHG} operations with preceding \texttt{XCHG}, \texttt{PUSH}, and arithmetic operations whenever possible, we obtain code fragment \texttt{XCHG s2,s6}; \texttt{XCHG s1,s0}; \texttt{XCHG s0,s5}, which can then be replaced by compound operation \texttt{XCHG3 s6,s0,s5}. This compound operation would admit a two-byte encoding, thus leading to 27-byte code using only 21 operations: \begin{verbatim} PUSH2 s5,s2 // a b c d e f a d IMUL // a b c d e f ad PUSH2 s5,s4 // a b c d e f ad b c IMUL // a b c d e f ad bc SUB // a b c d e f ad-bc PUXC s2,s3 // a b c ad-bc e f e d IMUL // a b c D e f ed XCHG3 s6,s0,s5 // (same as XCHG s2,s6; XCHG s1,s0; XCHG s0,s5) // e f c D a ed b PUSH s5 // e f c D a ed b f IMUL // e f c D a ed bf SUB // e f c D a ed-bf XCHG s4 // e Dx c D a f IMUL // e Dx c D af XCHG2 s4,s2 // D Dx af e c IMUL // D Dx af ec SUB // D Dx Dy XCPU s1,s2 // D Dy Dx D IDIV // D Dy x XCHG s2 // x Dy D IDIV // x y RET \end{verbatim} It is interesting to note that this version of stack machine code contains only 9 stack manipulation primitives for 11 arithmetic operations. It is not clear, however, whether an optimizing compiler would be able to reorganize the code in such a manner by itself. \mysubsection{Comparison of machine code for sample leaf function}\label{sp:cmp1.summary} Table \ptref{tab:cmp1.code} summarizes the properties of machine code corresponding to the same source file described in \ptref{sp:cmp1.source}, generated for a hypothetical three-address register machine (cf.~\ptref{sp:cmp1.3addr}), with both ``optimistic'' and ``realistic'' instruction encodings; a two-address machine (cf.~\ptref{sp:cmp1.2addr}); a one-address machine (cf.~\ptref{sp:cmp1.1addr}); and a stack machine, similar to TVM, using either only the basic stack manipulation primitives (cf.~\ptref{sp:cmp1.stack.base}) or both the basic and the composite stack primitives (cf.~\ptref{sp:cmp1.stack.comp}). The meaning of the columns in Table \ptref{tab:cmp1.code} is as follows: \begin{itemize} \item ``Operations'' --- The quantity of instructions used, split into ``data'' (i.e., register move and exchange instructions for register machines, and stack manipulation instructions for stack machines) and ``arithmetic'' (instructions for adding, subtracting, multiplying and dividing integer numbers). The ``total'' is one more than the sum of these two, because there is also a one-byte \texttt{RET} instruction at the end of machine code. \item ``Code bytes'' --- The total amount of code bytes used. \item ``Opcode space'' --- The portion of ``opcode space'' (i.e., of possible choices for the first byte of the encoding of an instruction) used by data and arithmetic instructions in the assumed instruction encoding. For example, the ``optimistic'' encoding for the three-address machine assumes two-byte encodings for all arithmetic instructions {\em op\/} \texttt{r$(i)$, r$(j)$, r$(k)$}. Each arithmetic instruction would then consume portion $16/256=1/16$ of the opcode space. Notice that for the stack machine we have assumed one-byte encodings for \texttt{XCHG s$(i)$}, \texttt{PUSH s$(i)$} and \texttt{POP s$(i)$} in all cases, augmented by \texttt{XCHG s1,s$(i)$} for the basic stack instructions case only. As for the compound stack operations, we have assumed two-byte encodings for \texttt{PUSH3}, \texttt{XCHG3}, \texttt{XCHG2}, \texttt{XCPU}, \texttt{PUXC}, \texttt{PUSH2}, but not for \texttt{XCHG s1,s$(i)$}. \end{itemize} \begin{table}\captionsetup{font=footnotesize}{\footnotesize \setlength{\tabcolsep}{5pt} \begin{tabular}{|l|ccc|cc>{\bfseries}c|c>{\bfseries}cc|} \hline &\multicolumn{3}{|c|}{Operations}&\multicolumn{3}{|c|}{Code bytes}&\multicolumn{3}{|c|}{Opcode space}\\ Machine&data&arith&total&data&arith&total&data&arith&total\\ \hline 3-addr. (opt.)& 0 & 11 & 12 & 0 & 22 & 23 & 0/256 & 64/256 & 65/256 \\ 3-addr. (real.)& 0 & 11 & 12 & 0 & 30 & 31 & 0/256 & 34/256 & 35/256 \\ \hline 2-addr. & 4 & 11 & 16 & 8 & 22 & 31 & 1/256 & 4/256 & 6/256 \\ \hline 1-addr. & 11 & 11 & 23 & 17 & 11 & 29 & 17/256 & 64/256 & 82/256 \\ \hline stack (basic)& 16 & 11 & 28 & 16 & 11 & 28 & 64/256 & 4/256 & 69/256 \\ stack (comp.)& 9 & 11 & 21 & 15 & 11 & 27 & 84/256 & 4/256 & 89/256 \\ \hline \end{tabular} } \caption{A summary of machine code properties for hypothetical 3-address, 2-address, 1-address, and stack machines, generated for a sample leaf function (cf.~\ptref{sp:cmp1.source}). The two most important columns, reflecting {\bf code density} and {\bf extendability} to other operations, are marked by bold font. Smaller values are better in both of these columns.}\label{tab:cmp1.code} \end{table} The ``code bytes'' column reflects the density of the code for the specific sample source. However, ``opcode space'' is also important, because it reflects the extendability of the achieved density to other classes of operations (e.g., if one were to complement arithmetic operations with string manipulation operations and so on). Here the ``arithmetic'' subcolumn is more important than the ``data'' subcolumn, because no further data manipulation operations would be required for such extensions. We see that the three-address register machine with the ``optimistic'' encoding, assuming two-byte encodings for all three-register arithmetic operations, achieves the best code density, requiring only 23 bytes. However, this comes at a price: each arithmetic operation consumes 1/16 of the opcode space, so the four operations already use a quarter of the opcode space. At most 11 other operations, arithmetic or not, might be added to this architecture while preserving such high code density. On the other hand, when we consider the ``realistic'' encoding for the three-address machine, using two-byte encodings only for the most frequently used addition/subtraction operations (and longer encodings for less frequently used multiplication/division operations, reflecting the fact that the possible extension operations would likely fall in this class), then the three-address machine ceases to offer such attractive code density. In fact, the two-address machine becomes equally attractive at this point: it is capable of achieving the same code size of 31 bytes as the three-address machine with the ``realistic'' encoding, using only 6/256 of the opcode space for this! However, 31 bytes is the worst result in this table. The one-address machine uses 29 bytes, slightly less than the two-address machine. However, it utilizes a quarter of the opcode space for its arithmetic operations, hampering its extendability. In this respect it is similar to the three-address machine with the ``optimistic'' encoding, but requires 29 bytes instead of 23! So there is no reason to use the one-address machine at all, in terms of extendability (reflected by opcode space used for arithmetic operations) compared to code density. Finally, the stack machine wins the competition in terms of code density (27 or 28 bytes), losing only to the three-address machine with the ``optimistic'' encoding (which, however, is terrible in terms of extendability). To summarize: the two-address machine and stack machine achieve the best extendability with respect to additional arithmetic or data processing instructions (using only 1/256 of code space for each such instruction), while the stack machine additionally achieves the best code density by a small margin. The stack machine utilizes a significant part of its code space (more than a quarter) for data (i.e., stack) manipulation instructions; however, this does not seriously hamper extendability, because the stack manipulation instructions occupy a constant part of the opcode stace, regardless of all other instructions and extensions. While one might still be tempted to use a two-address register machine, we will explain shortly (cf.~\ptref{sect:sample.nonleaf}) why the two-address register machine offers worse code density and extendability in practice than it appears based on this table. As for the choice between a stack machine with only basic stack manipulation primitives or one supporting compound stack primitives as well, the case for the more sophisticated stack machine appears to be weaker: it offers only one or two fewer bytes of code at the expense of using considerably more opcode space for stack manipulation, and the optimized code using these additional instructions is hard for programmers to write and for compilers to automatically generate. \nxsubpoint\emb{Register calling conventions: some registers must be preserved by functions} Up to this point, we have considered the machine code of only one function, without taking into account the interplay between this function and other functions in the same program. Usually a program consists of more than one function, and when a function is not a ``simple'' or ``leaf'' function, it must call other functions. Therefore, it becomes important whether a called function preserves all or at least some registers. If it preserves all registers except those used to return results, the caller can safely keep its local and temporary variables in certain registers; however, the callee needs to save all the registers it will use for its temporary values somewhere (usually into the stack, which also exists on register machines), and then restore the original values. On the other hand, if the called function is allowed to destroy all registers, it can be written in the manner described in \ptref{sp:cmp1.3addr}, \ptref{sp:cmp1.2addr}, and \ptref{sp:cmp1.1addr}, but the caller will now be responsible for saving all its temporary values into the stack before the call, and restoring these values afterwards. In most cases, calling conventions for register machines require preservation of some but not all registers. We will assume that $m\leq n$ registers will be preserved by functions (unless they are used for return values), and that these registers are $\texttt{r}(n-m)\dots\texttt{r}(n-1)$. Case $m=0$ corresponds to the case ``the callee is free to destroy all registers'' considered so far; it is quite painful for the caller. Case $m=n$ corresponds to the case ``the callee must preserve all registers''; it is quite painful for the callee, as we will see in a moment. Usually a value of $m$ around $n/2$ is used in practice. The following sections consider cases $m=0$, $m=8$, and $m=16$ for our register machines with $n=16$ registers. \nxsubpoint\emb{Case $m=0$: no registers to preserve} This case has been considered and summarized in \ptref{sp:cmp1.summary} and Table~\ptref{tab:cmp1.code} above. \nxsubpoint\label{sp:cmp1.16}\emb{Case $m=n=16$: all registers must be preserved} This case is the most painful one for the called function. It is especially difficult for leaf functions like the one we have been considering, which do not benefit at all from the fact that other functions preserve some registers when called---they do not call any functions, but instead must preserve all registers themselves. In order to estimate the consequences of assuming $m=n=16$, we will assume that all our register machines are equipped with a stack, and with one-byte instructions \texttt{PUSH r$(i)$} and \texttt{POP r$(i)$}, which push or pop a register into/from the stack. For example, the three-address machine code provided in \ptref{sp:cmp1.3addr} destroys the values in registers \texttt{r2}, \texttt{r3}, \texttt{r6}, and \texttt{r7}; this means that the code of this function must be augmented by four instructions \texttt{PUSH r2}; \texttt{PUSH r3}; \texttt{PUSH r6}; \texttt{PUSH r7} at the beginning, and by four instructions \texttt{POP r7}; \texttt{POP r6}; \texttt{POP r3}; \texttt{POP r2} right before the \texttt{RET} instruction, in order to restore the original values of these registers from the stack. These four additional \texttt{PUSH}/\texttt{POP} pairs would increase the operation count and code size in bytes by $4\times 2=8$. A similar analysis can be done for other register machines as well, leading to Table~\ptref{tab:cmp1.code.b}. \begin{table}\captionsetup{font=footnotesize}{\footnotesize \setlength{\tabcolsep}{4pt} \begin{tabular}{|l|>{\textit\bgroup}c<{\egroup}|ccc|cc>{\bfseries}c|c>{\bfseries}cc|} \hline &&\multicolumn{3}{|c|}{Operations}&\multicolumn{3}{|c|}{Code bytes}&\multicolumn{3}{|c|}{Opcode space}\\ Machine&$r$&data&arith&total&data&arith&total&data&arith&total\\ \hline 3-addr. (opt.)& 4 & 8 & 11 & 20 & 8 & 22 & 31 & 32/256 & 64/256 & 97/256 \\ 3-addr. (real.)& 4 & 8 & 11 & 20 & 8 & 30 & 39 & 32/256 & 34/256 & 67/256 \\ \hline 2-addr. & 5 & 14 & 11 & 26 & 18 & 22 & 41 & 33/256 & 4/256 & 38/256 \\ \hline 1-addr. & 6 & 23 & 11 & 35 & 29 & 11 & 41 & 49/256 & 64/256 & 114/256 \\ \hline stack (basic)& 0 & 16 & 11 & 28 & 16 & 11 & 28 & 64/256 & 4/256 & 69/256 \\ stack (comp.)& 0 & 9 & 11 & 21 & 15 & 11 & 27 & 84/256 & 4/256 & 89/256 \\ \hline \end{tabular} }\caption{A summary of machine code properties for hypothetical 3-address, 2-address, 1-address, and stack machines, generated for a sample leaf function (cf.~\ptref{sp:cmp1.source}), assuming all of the 16 registers must be preserved by called functions ($m=n=16$). The new column labeled $r$ denotes the number of registers to be saved and restored, leading to $2r$ more operations and code bytes compared to Table~\ptref{tab:cmp1.code}. Newly-added \texttt{PUSH} and \texttt{POP} instructions for register machines also utilize $32/256$ of the opcode space. The two rows corresponding to stack machines remain unchanged.}\label{tab:cmp1.code.b} \end{table} We see that under these assumptions the stack machines are the obvious winners in terms of code density, and are in the winning group with respect to extendability. \nxsubpoint\label{sp:cmp1.8}\emb{Case $m=8$, $n=16$: registers \texttt{r8}\dots\texttt{r15} must be preserved} The analysis of this case is similar to the previous one. The results are summarized in Table~\ptref{tab:cmp1.code.c}. \begin{table}\captionsetup{font=footnotesize}{\footnotesize \setlength{\tabcolsep}{4pt} \begin{tabular}{|l|>{\textit\bgroup}c<{\egroup}|ccc|cc>{\bfseries}c|c>{\bfseries}cc|} \hline &&\multicolumn{3}{|c|}{Operations}&\multicolumn{3}{|c|}{Code bytes}&\multicolumn{3}{|c|}{Opcode space}\\ Machine&$r$&data&arith&total&data&arith&total&data&arith&total\\ \hline 3-addr. (opt.)& 0 & 0 & 11 & 12 & 0 & 22 & 23 & 32/256 & 64/256 & 97/256 \\ 3-addr. (real.)& 0 & 0 & 11 & 12 & 0 & 30 & 31 & 32/256 & 34/256 & 67/256 \\ \hline 2-addr. & 0 & 4 & 11 & 16 & 8 & 22 & 31 & 33/256 & 4/256 & 38/256 \\ \hline 1-addr. & 1 & 13 & 11 & 25 & 19 & 11 & 31 & 49/256 & 64/256 & 114/256 \\ \hline stack (basic)& 0 & 16 & 11 & 28 & 16 & 11 & 28 & 64/256 & 4/256 & 69/256 \\ stack (comp.)& 0 & 9 & 11 & 21 & 15 & 11 & 27 & 84/256 & 4/256 & 89/256 \\ \hline \end{tabular} }\caption{A summary of machine code properties for hypothetical 3-address, 2-address, 1-address and stack machines, generated for a sample leaf function (cf.~\ptref{sp:cmp1.source}), assuming that only the last 8 of the 16 registers must be preserved by called functions ($m=8$, $n=16$). This table is similar to Table~\ptref{tab:cmp1.code.b}, but has smaller values of $r$.}\label{tab:cmp1.code.c} \end{table} Notice that the resulting table is very similar to Table~\ptref{tab:cmp1.code}, apart from the ``Opcode space'' columns and the row for the one-address machine. Therefore, the conclusions of \ptref{sp:cmp1.summary} still apply in this case, with some minor modifications. We must emphasize, however, that {\em these conclusions are valid only for leaf functions, i.e., functions that do not call other functions}. Any program aside from the very simplest will have many non-leaf functions, especially if we are minimizing resulting machine code size (which prevents inlining of functions in most cases). \nxsubpoint\label{sp:cmp1.fair} \emb{A fairer comparison using a binary code instead of a byte code} The reader may have noticed that our preceding discussion of $k$-address register machines and stack machines depended very much on our insistence that complete instructions be encoded by an integer number of bytes. If we had been allowed to use a ``bit'' or ``binary code'' instead of a byte code for encoding instructions, we could more evenly balance the opcode space used by different machines. For instance, the opcode of {\tt SUB} for a three-address machine had to be either 4-bit (good for code density, bad for opcode space) or 12-bit (very bad for code density), because the complete instruction has to occupy a multiple of eight bits (e.g., 16 or 24 bits), and $3\cdot 4=12$ of those bits have to be used for the three register names. Therefore, let us get rid of this restriction. Now that we can use any number of bits to encode an instruction, we can choose all opcodes of the same length for all the machines considered. For instance, all arithmetic instructions can have 8-bit opcodes, as the stack machine does, using $1/256$ of the opcode space each; then the three-address register machine will use 20 bits to encode each complete arithmetic instruction. All {\tt MOV}s, {\tt XCHG}s, {\tt PUSH}es, and {\tt POP}s on register machines can be assumed to have 4-bit opcodes, because this is what we do for the most common stack manipulation primitives on a stack machine. The results of these changes are shown in Table~\ptref{tab:cmp1.code.z}. We can see that the performance of the various machines is much more balanced, with the stack machine still the winner in terms of the code density, but with the three-address machine enjoying the second place it really merits. If we were to consider the decoding speed and the possibility of parallel execution of instructions, we would have to choose the three-address machine, because it uses only 12 instructions instead of 21. \begin{table}\captionsetup{font=footnotesize}{\footnotesize \setlength{\tabcolsep}{4pt} \begin{tabular}{|l|>{\textit\bgroup}c<{\egroup}|ccc|cc>{\bfseries}c|c>{\bfseries}cc|} \hline &&\multicolumn{3}{|c|}{Operations}&\multicolumn{3}{|c|}{Code bytes}&\multicolumn{3}{|c|}{Opcode space}\\ Machine&$r$&data&arith&total&data&arith&total&data&arith&total\\ \hline 3-addr.& 0 & 0 & 11 & 12 & 0 & 27.5 & 28.5 & 64/256 & 4/256 & 69/256 \\ \hline 2-addr. & 0 & 4 & 11 & 16 & 6 & 22 & 29 & 64/256 & 4/256 & 69/256 \\ \hline 1-addr. & 1 & 13 & 11 & 25 & 16 & 16.5 & 32.5 & 64/256 & 4/256 & 69/256 \\ \hline stack (basic)& 0 & 16 & 11 & 28 & 16 & 11 & 28 & 64/256 & 4/256 & 69/256 \\ stack (comp.)& 0 & 9 & 11 & 21 & 15 & 11 & 27 & 84/256 & 4/256 & 89/256 \\ \hline \end{tabular} }\caption{A summary of machine code properties for hypothetical 3-address, 2-address, 1-address and stack machines, generated for a sample leaf function (cf. \ptref{sp:cmp1.source}), assuming that only 8 of the 16 registers must be preserved by functions ($m=8$, $n=16$). This time we can use fractions of bytes to encode instructions, so as to match opcode space used by different machines. All arithmetic instructions have 8-bit opcodes, all data/stack manipulation instructions have 4-bit opcodes. In other respects this table is similar to Table~\ptref{tab:cmp1.code.c}.}\label{tab:cmp1.code.z} \end{table} \mysubsection{Sample non-leaf function}\label{sect:sample.nonleaf} This section compares the machine code for different register machines for a sample non-leaf function. Again, we assume that either $m=0$, $m=8$, or $m=16$ registers are preserved by called functions, with $m=8$ representing the compromise made by most modern compilers and operating systems. \nxsubpoint\label{sp:cmp2.source}\emb{Sample source code for a non-leaf function} A sample source file may be obtained by replacing the built-in integer type with a custom {\em Rational\/} type, represented by a pointer to an object in memory, in our function for solving systems of two linear equations (cf.~\ptref{sp:cmp1.source}): \begin{verbatim} struct Rational; typedef struct Rational *num; extern num r_add(num, num); extern num r_sub(num, num); extern num r_mul(num, num); extern num r_div(num, num); (num, num) r_f(num a, num b, num c, num d, num e, num f) { num D = r_sub(r_mul(a, d), r_mul(b, c)); // a*d-b*c num Dx = r_sub(r_mul(e, d), r_mul(b, f)); // e*d-b*f num Dy = r_sub(r_mul(a, f), r_mul(e, c)); // a*f-e*c return (r_div(Dx, D), r_div(Dy, D)); // Dx/D, Dy/D } \end{verbatim} We will ignore all questions related to allocating new objects of type \textit{Rational\/} in memory (e.g., in heap), and to preventing memory leaks. We may assume that the called subroutines \texttt{r\_sub}, \texttt{r\_mul}, and so on allocate new objects simply by advancing some pointer in a pre-allocated buffer, and that unused objects are later freed by a garbage collector, external to the code being analysed. Rational numbers will now be represented by pointers, addresses, or references, which will fit into registers of our hypothetical register machines or into the stack of our stack machines. If we want to use TVM as an instance of these stack machines, we should use values of type \textit{Cell} to represent such references to objects of type \textit{Rational} in memory. We assume that subroutines (or functions) are called by a special \texttt{CALL} instruction, which is encoded by three bytes, including the specification of the function to be called (e.g., the index in a ``global function table''). \nxsubpoint\label{sp:cmp2.2addr.0}\emb{Three-address and two-address register machines, $m=0$ preserved registers} Because our sample function does not use built-in arithmetic instructions at all, compilers for our hypothetical three-address and two-address register machines will produce the same machine code. Apart from the previously introduced \texttt{PUSH r$(i)$} and \texttt{POP r$(i)$} one-byte instructions, we assume that our two- and three-address machines support the following two-byte instructions: \texttt{MOV r$(i)$,s$(j)$}, \texttt{MOV s$(j)$,r$(i)$}, and \texttt{XCHG r$(i)$,s$(j)$}, for $0\leq i,j\leq 15$. Such instructions occupy only 3/256 of the opcode space, so their addition seems quite natural. We first assume that $m=0$ (i.e., that all subroutines are free to destroy the values of all registers). In this case, our machine code for \texttt{r\_f} does not have to preserve any registers, but has to save all registers containing useful values into the stack before calling any subroutines. A size-optimizing compiler might produce the following code: \begin{verbatim} PUSH r4 // STACK: e PUSH r1 // STACK: e b PUSH r0 // .. e b a PUSH r6 // .. e b a f PUSH r2 // .. e b a f c PUSH r3 // .. e b a f c d MOV r0,r1 // b MOV r1,r2 // c CALL r_mul // bc PUSH r0 // .. e b a f c d bc MOV r0,s4 // a MOV r1,s1 // d CALL r_mul // ad POP r1 // bc; .. e b a f c d CALL r_sub // D:=ad-bc XCHG r0,s4 // b ; .. e D a f c d MOV r1,s2 // f CALL r_mul // bf POP r1 // d ; .. e D a f c PUSH r0 // .. e D a f c bf MOV r0,s5 // e CALL r_mul // ed POP r1 // bf; .. e D a f c CALL r_sub // Dx:=ed-bf XCHG r0,s4 // e ; .. Dx D a f c POP r1 // c ; .. Dx D a f CALL r_mul // ec XCHG r0,s1 // a ; .. Dx D ec f POP r1 // f ; .. Dx D ec CALL r_mul // af POP r1 // ec; .. Dx D CALL r_sub // Dy:=af-ec XCHG r0,s1 // Dx; .. Dy D MOV r1,s0 // D CALL r_div // x:=Dx/D XCHG r0,s1 // Dy; .. x D POP r1 // D ; .. x CALL r_div // y:=Dy/D MOV r1,r0 // y POP r0 // x ; .. RET \end{verbatim} We have used 41 instructions: 17 one-byte (eight \texttt{PUSH}/\texttt{POP} pairs and one \texttt{RET}), 13 two-byte (\texttt{MOV} and \texttt{XCHG}; out of them 11 ``new'' ones, involving the stack), and 11 three-byte (\texttt{CALL}), for a total of $17\cdot1+13\cdot2+11\cdot3=76$ bytes.\footnote{Code produced for this function by an optimizing compiler for x86-64 architecture with size-optimization enabled actually occupied 150 bytes, due mostly to the fact that actual instruction encodings are about twice as long as we had optimistically assumed.} \nxsubpoint\label{sp:cmp2.2addr.8}\emb{Three-address and two-address register machines, $m=8$ preserved registers} Now we have eight registers, \texttt{r8} to \texttt{r15}, that are preserved by subroutine calls. We might keep some intermediate values there instead of pushing them into the stack. However, the penalty for doing so consists in a \texttt{PUSH}/\texttt{POP} pair for every such register that we choose to use, because our function is also required to preserve its original value. It seems that using these registers under such a penalty does not improve the density of the code, so the optimal code for three- and two-address machines for $m=8$ preserved registers is the same as that provided in \ptref{sp:cmp2.2addr.0}, with a total of 42 instructions and 74 code bytes. \nxsubpoint\label{sp:cmp2.2addr.16}\emb{Three-address and two-address register machines, $m=16$ preserved registers} This time {\em all\/} registers must be preserved by the subroutines, excluding those used for returning the results. This means that our code must preserve the original values of \texttt{r2} to \texttt{r5}, as well as any other registers it uses for temporary values. A straightforward way of writing the code of our subroutine would be to push registers \texttt{r2} up to, say, \texttt{r8} into the stack, then perform all the operations required, using \texttt{r6}--\texttt{r8} for intermediate values, and finally restore registers from the stack. However, this would not optimize code size. We choose another approach: \begin{verbatim} PUSH r0 // STACK: a PUSH r1 // STACK: a b MOV r0,r1 // b MOV r1,r2 // c CALL r_mul // bc PUSH r0 // .. a b bc MOV r0,s2 // a MOV r1,r3 // d CALL r_mul // ad POP r1 // bc; .. a b CALL r_sub // D:=ad-bc XCHG r0,s0 // b; .. a D MOV r1,r5 // f CALL r_mul // bf PUSH r0 // .. a D bf MOV r0,r4 // e MOV r1,r3 // d CALL r_mul // ed POP r1 // bf; .. a D CALL r_sub // Dx:=ed-bf XCHG r0,s1 // a ; .. Dx D MOV r1,r5 // f CALL r_mul // af PUSH r0 // .. Dx D af MOV r0,r4 // e MOV r1,r2 // c CALL r_mul // ec MOV r1,r0 // ec POP r0 // af; .. Dx D CALL r_sub // Dy:=af-ec XCHG r0,s1 // Dx; .. Dy D MOV r1,s0 // D CALL r_div // x:=Dx/D XCHG r0,s1 // Dy; .. x D POP r1 // D ; .. x CALL r_div // y:=Dy/D MOV r1,r0 // y POP r0 // x RET \end{verbatim} We have used 39 instructions: 11 one-byte, 17 two-byte (among them 5 ``new'' instructions), and 11 three-byte, for a total of $11\cdot1+17\cdot2+11\cdot3=78$ bytes. Somewhat paradoxically, the code size in bytes is slightly longer than in the previous case (cf.~\ptref{sp:cmp2.2addr.0}), contrary to what one might have expected. This is partially due to the fact that we have assumed two-byte encodings for ``new'' \texttt{MOV} and \texttt{XCHG} instructions involving the stack, similarly to the ``old'' instructions. Most existing architectures (such as x86-64) use longer encodings (maybe even twice as long) for their counterparts of our ``new'' move and exchange instructions compared to the ``usual'' register-register ones. Taking this into account, we see that we would have obtained here 83 bytes (versus 87 for the code in~\ptref{sp:cmp2.2addr.0}) assuming three-byte encodings of new operations, and 88 bytes (versus 98) assuming four-byte encodings. This shows that, for two-address architectures without optimized encodings for register-stack move and exchange operations, $m=16$ preserved registers might result in slightly shorter code for some non-leaf functions, at the expense of leaf functions (cf.~\ptref{sp:cmp1.16} and~\ptref{sp:cmp1.8}), which would become considerably longer. \nxsubpoint\label{sp:cmp2.1addr.0}\emb{One-address register machine, $m=0$ preserved registers} For our one-address register machine, we assume that new register-stack instructions work through the accumulator only. Therefore, we have three new instructions, \texttt{LD s$(j)$} (equivalent to \texttt{MOV r0,s$(j)$} of two-address machines), \texttt{ST s$(j)$} (equivalent to \texttt{MOV s$(j)$,r0}), and \texttt{XCHG s$(j)$} (equivalent to \texttt{XCHG r0,s$(j)$}). To make the comparison with two-address machines more interesting, we assume one-byte encodings for these new instructions, even though this would consume $48/256=3/16$ of the opcode space. By adapting the code provided in~\ptref{sp:cmp2.2addr.0} to the one-address machine, we obtain the following: \begin{verbatim} PUSH r4 // STACK: e PUSH r1 // STACK: e b PUSH r0 // .. e b a PUSH r6 // .. e b a f PUSH r2 // .. e b a f c PUSH r3 // .. e b a f c d LD s1 // r0:=c XCHG r1 // r0:=b, r1:=c CALL r_mul // bc PUSH r0 // .. e b a f c d bc LD s1 // d XCHG r1 // r1:=d LD s4 // a CALL r_mul // ad POP r1 // bc; .. e b a f c d CALL r_sub // D:=ad-bc XCHG s4 // b ; .. e D a f c d XCHG r1 LD s2 // f XCHG r1 // r0:=b, r1:=f CALL r_mul // bf POP r1 // d ; .. e D a f c PUSH r0 // .. e D a f c bf LD s5 // e CALL r_mul // ed POP r1 // bf; .. e D a f c CALL r_sub // Dx:=ed-bf XCHG s4 // e ; .. Dx D a f c POP r1 // c ; .. Dx D a f CALL r_mul // ec XCHG s1 // a ; .. Dx D ec f POP r1 // f ; .. Dx D ec CALL r_mul // af POP r1 // ec; .. Dx D CALL r_sub // Dy:=af-ec XCHG s1 // Dx; .. Dy D POP r1 // D ; .. Dy PUSH r1 // .. Dy D CALL r_div // x:=Dx/D XCHG s1 // Dy; .. x D POP r1 // D ; .. x CALL r_div // y:=Dy/D XCHG r1 // r1:=y POP r0 // r0:=x ; .. RET \end{verbatim} We have used 45 instructions: 34 one-byte and 11 three-byte, for a total of 67 bytes. Compared to the 76 bytes used by two- and three-address machines in \ptref{sp:cmp2.2addr.0}, we see that, again, the one-address register machine code may be denser than that of two-register machines, at the expense of utilizing more opcode space (just as shown in~\ptref{sp:cmp1.summary}). However, this time the extra 3/16 of the opcode space was used for data manipulation instructions, which do not depend on specific arithmetic operations or user functions invoked. \nxsubpoint\label{sp:cmp2.1addr.8}\emb{One-address register machine, $m=8$ preserved registers} As explained in~\ptref{sp:cmp2.2addr.8}, the preservation of \texttt{r8}--\texttt{r15} between subroutine calls does not improve the size of our previously written code, so the one-address machine will use for $m=8$ the same code provided in~\ptref{sp:cmp2.1addr.0}. \nxsubpoint\label{sp:cmp2.1addr.16}\emb{One-address register machine, $m=16$ preserved registers} We simply adapt the code provided in~\ptref{sp:cmp2.2addr.16} to the one-address register machine: \begin{verbatim} PUSH r0 // STACK: a PUSH r1 // STACK: a b MOV r0,r1 // b MOV r1,r2 // c CALL r_mul // bc PUSH r0 // .. a b bc LD s2 // a MOV r1,r3 // d CALL r_mul // ad POP r1 // bc; .. a b CALL r_sub // D:=ad-bc XCHG s0 // b; .. a D MOV r1,r5 // f CALL r_mul // bf PUSH r0 // .. a D bf MOV r0,r4 // e MOV r1,r3 // d CALL r_mul // ed POP r1 // bf; .. a D CALL r_sub // Dx:=ed-bf XCHG s1 // a ; .. Dx D MOV r1,r5 // f CALL r_mul // af PUSH r0 // .. Dx D af MOV r0,r4 // e MOV r1,r2 // c CALL r_mul // ec MOV r1,r0 // ec POP r0 // af; .. Dx D CALL r_sub // Dy:=af-ec XCHG s1 // Dx; .. Dy D POP r1 // D ; .. Dy PUSH r1 // .. Dy D CALL r_div // x:=Dx/D XCHG s1 // Dy; .. x D POP r1 // D ; .. x CALL r_div // y:=Dy/D MOV r1,r0 // y POP r0 // x RET \end{verbatim} We have used 40 instructions: 18 one-byte, 11 two-byte, and 11 three-byte, for a total of $18\cdot1+11\cdot2+11\cdot3=73$ bytes. \nxsubpoint\label{sp:cmp2.stack.base} \emb{Stack machine with basic stack primitives} We reuse the code provided in~\ptref{sp:cmp1.stack.base}, simply replacing arithmetic primitives (VM instructions) with subroutine calls. The only substantive modification is the insertion of the previously optional \texttt{XCHG s1} before the third multiplication, because even an optimizing compiler cannot now know whether \texttt{CALL r\_mul} is a commutative operation. We have also used the ``tail recursion optimization'' by replacing the final \texttt{CALL r\_div} followed by \texttt{RET} with \texttt{JMP r\_div}. \begin{verbatim} PUSH s5 // a b c d e f a PUSH s3 // a b c d e f a d CALL r_mul // a b c d e f ad PUSH s5 // a b c d e f ad b PUSH s5 // a b c d e f ad b c CALL r_mul // a b c d e f ad bc CALL r_sub // a b c d e f ad-bc XCHG s3 // a b c ad-bc e f d PUSH s2 // a b c ad-bc e f d e XCHG s1 // a b c ad-bc e f e d CALL r_mul // a b c ad-bc e f ed XCHG s5 // a ed c ad-bc e f b PUSH s1 // a ed c ad-bc e f b f CALL r_mul // a ed c ad-bc e f bf XCHG s1,s5 // a f c ad-bc e ed bf CALL r_sub // a f c ad-bc e ed-bf XCHG s3 // a f ed-bf ad-bc e c CALL r_mul // a f ed-bf ad-bc ec XCHG s3 // a ec ed-bf ad-bc f XCHG s1,s4 // ad-bc ec ed-bf a f CALL r_mul // D ec Dx af XCHG s1 // D ec af Dx XCHG s2 // D Dx af ec CALL r_sub // D Dx Dy XCHG s1 // D Dy Dx PUSH s2 // D Dy Dx D CALL r_div // D Dy x XCHG s2 // x Dy D JMP r_div // x y \end{verbatim} We have used 29 instructions; assuming one-byte encodings for all stack operations, and three-byte encodings for \texttt{CALL} and \texttt{JMP} instructions, we end up with 51 bytes. \nxsubpoint\label{sp:cmp2.stack.comp} \emb{Stack machine with compound stack primitives} We again reuse the code provided in~\ptref{sp:cmp1.stack.comp}, replacing arithmetic primitives with subroutine calls and making the tail recursion optimization: \begin{verbatim} PUSH2 s5,s2 // a b c d e f a d CALL r_mul // a b c d e f ad PUSH2 s5,s4 // a b c d e f ad b c CALL r_mul // a b c d e f ad bc CALL r_sub // a b c d e f ad-bc PUXC s2,s3 // a b c ad-bc e f e d CALL r_mul // a b c D e f ed XCHG3 s6,s0,s5 // (same as XCHG s2,s6; XCHG s1,s0; XCHG s0,s5) // e f c D a ed b PUSH s5 // e f c D a ed b f CALL r_mul // e f c D a ed bf CALL r_sub // e f c D a ed-bf XCHG s4 // e Dx c D a f CALL r_mul // e Dx c D af XCHG2 s4,s2 // D Dx af e c CALL r_mul // D Dx af ec CALL r_sub // D Dx Dy XCPU s1,s2 // D Dy Dx D CALL r_div // D Dy x XCHG s2 // x Dy D JMP r_div // x y \end{verbatim} This code uses only 20 instructions, 9 stack-related and 11 control flow-related (\texttt{CALL} and \texttt{JMP}), for a total of 48 bytes. \mysubsection{Comparison of machine code for sample non-leaf function}\label{sp:cmp2.summary} Table \ptref{tab:cmp2.code} summarizes the properties of machine code corresponding to the same source file provided in \ptref{sp:cmp2.source}. We consider only the ``realistically'' encoded three-address machines. Three-address and two-address machines have the same code density properties, but differ in the utilization of opcode space. The one-address machine, somewhat surprisingly, managed to produced shorter code than the two-address and three-address machines, at the expense of using up more than half of all opcode space. The stack machine is the obvious winner in this code density contest, without compromizing its excellent extendability (measured in opcode space used for arithmetic and other data transformation instructions). \def\tworow#1{\multirow{2}{*}{#1}} \begin{table}\captionsetup{font=footnotesize}{\footnotesize \setlength{\tabcolsep}{4pt} \begin{tabular}{|l|>{\textit\bgroup}c<{\egroup}|ccc|cc>{\bfseries}c|c>{\bfseries}cc|} \hline &&\multicolumn{3}{|c|}{Operations}&\multicolumn{3}{|c|}{Code bytes}&\multicolumn{3}{|c|}{Opcode space}\\ Machine&$m$&data&cont.&total&data&cont.&total&data&arith&total\\ \hline \tworow{3-addr.}& 0,8 & 29 & 12 & 41 & 42 & 34 & 76 & \tworow{35/256} & \tworow{34/256} & \tworow{72/256} \\ & 16 & 27 & 12 & 39 & 44 & 34 & 78 &&& \\ \hline \tworow{2-addr.}& 0,8 & 29 & 12 & 41 & 42 & 34 & 76 & \tworow{37/256} & \tworow{4/256} & \tworow{44/256} \\ & 16 & 27 & 12 & 39 & 44 & 34 & 78 &&&\\ \hline \tworow{1-addr.}& 0,8 & 33 & 12 & 45 & 33 & 34 & 67 & \tworow{97/256} & \tworow{64/256} & \tworow{164/256} \\ & 16 & 28 & 12 & 40 & 39 & 34 & 73 &&& \\ \hline stack (basic)& $-$ & 18 & 11 & 29 & 18 & 33 & 51 & 64/256 & 4/256 & 71/256 \\ stack (comp.)& $-$ & 9 & 11 & 20 & 15 & 33 & 48 & 84/256 & 4/256 & 91/256 \\ \hline \end{tabular} }\caption{A summary of machine code properties for hypothetical 3-address, 2-address, 1-address, and stack machines, generated for a sample non-leaf function (cf. \ptref{sp:cmp2.source}), assuming $m$ of the 16 registers must be preserved by called subroutines.}\label{tab:cmp2.code} \end{table} \nxsubpoint\emb{Combining with results for leaf functions} It is instructive to compare this table with the results in \ptref{sp:cmp1.summary} for a sample leaf function, summarized in Table~\ptref{tab:cmp1.code} (for $m=0$ preserved registers) and the very similar Table~\ptref{tab:cmp1.code.c} (for $m=8$ preserved registers), and, if one is still interested in case $m=16$ (which turned out to be worse than $m=8$ in almost all situations), also to Table~\ptref{tab:cmp1.code.b}. We see that the stack machine beats all register machines on non-leaf functions. As for the leaf functions, only the three-address machine with the ``optimistic'' encoding of arithmetic instructions was able to beat the stack machine, winning by 15\%, by compromising its extendability. However, the same three-address machine produces 25\% longer code for non-leaf functions. If a typical program consists of a mixture of leaf and non-leaf functions in approximately equal proportion, then the stack machine will still win. \nxsubpoint\emb{A fairer comparison using a binary code instead of a byte code}\label{sp:cmp2.fair} Similarly to~\ptref{sp:cmp1.fair}, we may offer a fairer comparison of different register machines and the stack machine by using arbitrary binary codes instead of byte codes to encode instructions, and matching the opcode space used for data manipulation and arithmetic instructions by different machines. The results of this modified comparison are summarized in Table~\ptref{tab:cmp2.code.z}. We see that the stack machines still win by a large margin, while using less opcode space for stack/data manipulation. \begin{table}\captionsetup{font=footnotesize}{\footnotesize \setlength{\tabcolsep}{4pt} \begin{tabular}{|l|>{\textit\bgroup}c<{\egroup}|ccc|cc>{\bfseries}c|c>{\bfseries}cc|} \hline &&\multicolumn{3}{|c|}{Operations}&\multicolumn{3}{|c|}{Code bytes}&\multicolumn{3}{|c|}{Opcode space}\\ Machine&$m$&data&cont.&total&data&cont.&total&data&arith&total\\ \hline \tworow{3-addr.}& 0,8 & 29 & 12 & 41 & 35.5 & 34 & 69.5 & \tworow{110/256} & \tworow{4/256} & \tworow{117/256} \\ & 16 & 27 & 12 & 39 & 35.5 & 34 & 69.5 &&& \\ \hline \tworow{2-addr.}& 0,8 & 29 & 12 & 41 & 35.5 & 34 & 69.5 & \tworow{110/256} & \tworow{4/256} & \tworow{117/256} \\ & 16 & 27 & 12 & 39 & 35.5 & 34 & 69.5 &&&\\ \hline \tworow{1-addr.}& 0,8 & 33 & 12 & 45 & 33 & 34 & 67 & \tworow{112/256} & \tworow{4/256} & \tworow{119/256} \\ & 16 & 28 & 12 & 40 & 33.5 & 34 & 67.5 &&& \\ \hline stack (basic)& $-$ & 18 & 11 & 29 & 18 & 33 & 51 & 64/256 & 4/256 & 71/256 \\ stack (comp.)& $-$ & 9 & 11 & 20 & 15 & 33 & 48 & 84/256 & 4/256 & 91/256 \\ \hline \end{tabular} }\caption{A summary of machine code properties for hypothetical 3-address, 2-address, 1-address, and stack machines, generated for a sample non-leaf function (cf. \ptref{sp:cmp2.source}), assuming $m$ of the 16 registers must be preserved by called subroutines. This time we use fractions of bytes to encode instructions, enabling a fairer comparison. Otherwise, this table is similar to Table~\ptref{tab:cmp2.code}.}\label{tab:cmp2.code.z} \end{table} \nxsubpoint\emb{Comparison with real machines} Note that our hypothetical register machines have been considerably optimized to produce shorter code than actually existing register machines; the latter are subject to other design considerations apart from code density and extendability, such as backward compatibility, faster instruction decoding, parallel execution of neighboring instructions, ease of automatically producing optimized code by compilers, and so on. For example, the very popular two-address register architecture x86-64 produces code that is approximately twice as long as our ``ideal'' results for the two-address machines. On the other hand, our results for the stack machines are directly applicable to TVM, which has been explicitly designed with the considerations presented in this appendix in mind. Furthermore, the actual TVM code is even {\em shorter\/} (in bytes) than shown in Table~\ptref{tab:cmp2.code} because of the presence of the two-byte \texttt{CALL} instruction, allowing TVM to call up to 256 user-defined functions from the dictionary at \texttt{c3}. This means that one should subtract 10 bytes from the results for stack machines in Table~\ptref{tab:cmp2.code} if one wants to specifically consider TVM, rather than an abstract stack machine; this produces a code size of approximately 40 bytes (or shorter), almost half that of an abstract two-address or three-address machine. \nxsubpoint\emb{Automatic generation of optimized code} An interesting point is that the stack machine code in our samples might have been generated automatically by a very simple optimizing compiler, which rearranges values near the top of the stack appropriately before invoking each primitive or calling a function as explained in~\ptref{sp:basic.stack.suff} and~\ptref{sp:sem.comp.stk}. The only exception is the unimportant ``manual'' \texttt{XCHG3} optimization described in~\ptref{sp:cmp1.stack.comp}, which enabled us to shorten the code by one more byte. By contrast, the heavily optimized (with respect to size) code for register machines shown in \ptref{sp:cmp2.2addr.0} and \ptref{sp:cmp2.2addr.8} is unlikely to be produced automatically by an optimizing compiler. Therefore, if we had compared compiler-generated code instead of manually-generated code, the advantages of stack machines with respect to code density would have been even more striking. \end{document}