More Magic

FOSS for digital sovereignty in the EU

2026-01-21T11:48:15Z

The European Commission has posted a "call for evidence" on open source for digital sovereignty. This seeks feedback from the public on how to reduce its dependency on software from non-EU companies through Free and Open Source Software (FOSS).

This is my response, with proper formatting (the web form replies all seem to have gotten their spaces collapsed) and for future reference.

The added value of FOSS

In times where international relations are tense, it is wise to invest in digital sovereignty. For example, recently there was a controversy surrounding the International Criminal Court losing access to e-mail hosted by Microsoft, a US company, for political reasons.

A year earlier, a faulty CrowdStrike update caused the largest IT outage in history. This was an accident, but it was a good reminder of the power that rests in foreign hands. We have to consider the possibility of a foreign government pressuring a company to issue a malicious update on purpose. This update could target only specific countries.

Bringing essential infrastructure into EU hands makes sense. But why does this have to be FOSS? For instance, the CrowdStrike incident could also have happened with FOSS.

With FOSS, one does not have to trust a single company to maintain high code quality and security. Independent security researchers and programmers will be looking at this code with a fresh perspective. It is also an industry truism that FOSS code tends to be of higher quality, simply because releasing bad code is too embarrassing.

FOSS also reduces vendor lock-in. One can switch vendors and keep using the same product when for example the vendor:

goes bankrupt,
drops support for the product,
drastically increases prices,
decides on a different direction for the product than the user wants,
or gets acquired by a foreign company.

Therefore, FOSS brings sovereignty by not being at the mercy of a single vendor.

Public sector and consultancies

The EU can set a good example by starting in the public sector: government EU organisations and those of the member states, as well as semi-government organisations like universities and libraries. Closed source software still reigns supreme there. Only "established" companies may apply to tenders. These often employ professionals certified in proprietary tech. This encourages vendor lock-in. The existing dependency ensures lock-in for future projects, as compatibility is often a key requirement.

These same vendors are ruthless and have repeatedly sabotaged FOSS migrations. Microsoft was involved in multiple bribery scandals in The Netherlands, Romania, Italy and Hungary, for example. There have also been allegations of illegal deals that were never investigated, such as with the LiMux project in Munich.

How the EU can help:

Fully commit to FOSS. Set a date by which all software used by the public sector must be FOSS and running on hardware within the EU, at fully EU-owned companies. No compromises, no excuses and no easy outs - those were the bane of previous efforts.
Map out missing requirements and pay EU consultancy firms to improve FOSS where it is lacking. This will also make said software also more attractive for large private organisations that provide essential services in the EU.

Concrete examples:

Many EU and member state institutes rely on American services for hosting or securing their e-mail. E-mail software is a complete commodity, for which there are good European alternatives, based on FOSS. It should be easy to switch.
Workstations for public servants typically run on Windows and use Microsoft Office. Switch these to a proven open operating system like Linux and office suite like LibreOffice.

In schools, informatics is typically taught using proprietary software. This is often cloud software. Schools do not have the expertise or funds to run their own servers. Therefore, they use the easy option that teachers are familiar with: "free" online offerings from US Big Tech. Network effects ensure deeper entrenchment. Big Tech offers steep discounts for educational licenses for these exact reasons.

Vocational schools focus on proprietary tech most used in industry. This goes beyond IT studies. For example, statistics and psychology courses use SPSS over PSPP or R. Mathematics and physics courses use MATLAB over GNU Octave. Engineering courses use AutoCAD instead of FreeCAD or LibreCAD.

A focus on the impact of tech choices in education could change the situation from the ground up. In high school, there could be a place (e.g. in civic education class) to focus on the impact of tech choices on society. This goes beyond domestic versus foreign "cloud" hosting and open versus proprietary code. For example, studies show that social media can have harmful effects on mental well-being, societal cohesion and even democracy.

How the EU can help:

Provide funding for course material, and/or create a certification programme for suitable course material to wean schools off of Big Tech software.
Start an education campaign aimed at the broader public in order to explain why closed software and the non-EU cloud are harmful. For example, it could focus on concrete issues that affect anyone like data protection, privacy and resistance against "enshittification" such as unwanted ads, price hikes and feature removal.
For the existing work force, the EU can fund training in open alternatives so that people feel confident with these alternatives. Such training should include a theoretical component to discuss the benefits of using open alternatives to ensure people are fully on board.

Existing FOSS companies and economic situation

The EU has plenty of FOSS businesses already. A handful of examples: SUSE was one of the first companies to provide FOSS server and desktop operating systems for the enterprise. Tuta and Proton Mail provide innovative secure e-mail solutions. Nextcloud offers cloud-based content collaboration tools. GitLab and Codeberg offer code hosting platforms.

These companies are innovative and profitable, but small in the global market place. Competitors from the US benefit from economies of scale. The initial US market is a large country with a single language and minimal legislation. This allows for quick domestic growth followed by global expansion. The EU market is more fragmented so it is harder to gain a foothold, requiring more up front investment to e.g. support the languages spoken in the EU.

Venture capital is also less likely to invest in the EU because of stricter legislation. Because FOSS solutions give competing companies a chance to offer the product, the returns on investment are lower than with proprietary software where a single company has a monopoly on the software.

Some EU companies have realised that this legislation is an asset: it allows for differentiation from US-based offerings. EU software can compete in the global market place on its own merits.

How the EU can help:

Promote tech sovereignty to countries across the world. Start with countries who are not formally allied to the US. This could help EU companies to expand into the global market.
Help EU companies become more well-known by organising trade shows exhibiting only FOSS EU companies.
Provide funding to organisations like the FSF Europe to run awareness campaigns about FOSS alternatives.
Perhaps controversial: heavily tax proprietary, non-EU software or provide tax breaks for FOSS EU software to level the playing field.
Even more controversially: prevent foreign-owned companies from operating data centers in the EU. Make it as hard as possible for them to offer high-speed cloud software here. These data centers are already unpopular, as they use precious water and land, and they only make foreign companies more powerful.

Conclusion

The reasons for dependency on foreign proprietary solutions are systemic. The causes are various: from inertia and ignorance to market effects and bribery. The solutions must be equally systemic: from education to policy and funding, all points must be attacked in order to succeed. This is the only way we can get rid of our dependency on non-EU software.

Trustworthy software through non-profits?

2025-12-04T05:52:32Z

I feel a change is happening in how people produce and (want to) consume software, and I want to give my two cents on the matter.

It has become more mainstream to see people critical of "Big Tech". Enshittification has become a familiar term even outside the geek community. Obnoxious "AI" features that nobody asked for get crammed into products. Software that spies on its users is awfully common. Software updates have started crippling existing features, or have deliberately stopped being available, so more new devices can be sold. Finally, it is increasingly common to get obnoxious ads shoved in your face, even in software you have already paid for.

In short, it has become hard to really trust software. It often does not act in the user's best interest. At the same time, we are entrusting software with more and more of our lives.

Thankfully, new projects are springing up which are using a different governance model. Instead of a for-profit commercial business, there is a non-profit backing them. Some examples of more or less popular projects:

Signal and Matrix for instant messaging,
Bluesky and Mastodon for social media,
Mozilla for web browser, e-mail client and more,
Proton for e-mail hosting, VPN and more,
Codeberg for code repository hosting,
Wikipedia and Internet Archive for all the world's knowledge.

Some of these are older projects, but there seems to be something in the air that is causing more projects to move to non-profit governance, and for people to choose these.

As I was preparing this article, I saw an announcement that ghostty now has a non-profit organisation behind it. At the same time, I see more reports from developers leaving GitHub for Codeberg, and in the mainstream more and more people are switching to Signal.

Why free and open source software is not enough

From a user perspective, free software and open source software (FOSS) has advantages over proprietary software. For instance, you can study the code to see what it does. This alone can deter manufacturers from putting in user-hostile features. You can also remove or change what you dislike or add features you would like to see. If you are unable to code, you can usually find someone else to do it for you.

Unfortunately, this is not enough. Simply having the ability to see and change the code does not help when the program is a web service. Network effects will ensure that the "main instance" is the only viable place to use this; you have all your data there, and all your friends are there. And hosting the software yourself is hard for non-technical people. Even highly technical people often find it too much of a hassle.

Also, code can be very complex! Often, only the team behind it can realistically further develop it. This means you can run it yourself, but still are dependent on the manufacturer for the direction of the product. This is how you get, for example, AI features in GitLab and ads in Ubuntu Linux. One can technically remove or disable those features, but it is hard to keep such a modified version (a fork) up with the manufacturer's more desirable changes.

The reason is that the companies creating these products are still motivated by profit and increasing shareholder value. As long as the product still provides (enough) value, users will put up with misfeatures. The (perceived) cost of switching is too high.

Non-profit is not a panacea

Let us say a non-profit is behind the software. It is available under a 100% FOSS license. Then there are still ways things can go downhill. I think this happens most commonly if the funding is not in order.

For example, Mozilla is often criticised for receiving funding from Google. In return, it uses Google as the default search. To make it less dependent on Google, Mozilla acquired Pocket and integrated it into the browser. It also added ads on the home screen. Both of these actions have also been criticised. I do not want to pick on Mozilla (I use Firefox every day). It has clearly been struggling to make ends meet in a way that is consistent with its goals and values.

I think the biggest risk factor is (ironically) if the non-profit does not have a sustainable business model and has to rely on funding from other groups. This can compromise the vision, like in Mozilla's case. For web software, the obvious business model is a SaaS platform that offers the software. This allows the non-profit to make money from the convenience of not having to administer it yourself.

There is another, probably even better, way to ensure the non-profit will make good decisions. If the organisation is democratically led and open for anyone to become a member like Codeberg e.V. is, it can be steered by the very users it serves. This means there is no top-down leadership that may make questionable decisions. Many thanks to Technomancy for pointing this out.

What about volunteer driven efforts?

Ah, good old volunteer driven FOSS. Personally, I prefer using such software in general. There is no profit motive in sight and the developers are just scratching their own itch. Nobody is focused on growth and attracting more customers. Instead, the software does only what it has to do with a minimum of fuss.

I love that aspect, but it is also a problem. Developers often do not care about ease of use for beginners. Software like this is often a power tool for power users, with lots of sharp edges. Perfect for developers, not so much for the general public.

More importantly, volunteer driven FOSS has other limits. Developer burn-out happens more than we would like to admit, and for-profit companies tend to strip-mine the commons.

There are some solutions available for volunteer-driven projects. For example Clojurists together, thanks.dev, the Apache Foundation, the Software Freedom Conservancy and NLnet all financially support volunteer-driven projects. But it is not easy to apply to these, and volunteer-driven projects are often simply not organised in a way to receive money.

Conclusion

With a non-profit organisation employing the maintainers of a project, there is more guarantee of continuity. It also can ensure that the "boring" but important work gets done. Good interface design, documentation, customer support. All that good stuff. If there are paying users, I expect that you get some of the benefits of corporate-driven software and less of the drawbacks.

That is why I believe these types of projects will be the go-to source for sustainable, trustworthy software for end-users. I think it is important to increase awareness about such projects. They offer alternatives to Big Tech software that are palatable to non-technical users.

Let's CRUNCH!

2024-12-16T12:44:16Z

NOTE: This is another guest post by Felix Winkelmann, the founder and one of the current maintainers of CHICKEN Scheme.

Introduction

Hi! This post is about a new project of mine, called "CRUNCH", a compiler for a statically typed subset of the programming language Scheme, specifically, the R7RS (small) standard.

The compiler runs on top of the CHICKEN Scheme system and produces portable C99 that can then be compiled and executed on any platform that has a decent C compiler.

So, why another Scheme implementation, considering that there already exists such a large number of interpreters and compilers for this language? What motivated me was the emergence of the PreScheme restoration project, a modernisation of "PreScheme", a statically typed compiler for Scheme that is used in the Scheme48 implementation. The original PreScheme was embedded into S48 and was used to generate the virtual machine that is targeted by the latter system. Andrew Whatson couragously started a project to port PreScheme to modern R7RS Scheme systems (PreScheme is written in Scheme, of course) with the intention of extending it and keep the quite sophisticated and interesting compiler alive.

The announcement of the project and some of the reactions that it spawned made me realize that there seems to be a genuine demand for a statically typed high-performance compiler for Scheme (even if just for a subset) that would close a gap in the spectrum of Scheme systems currently available.

There are compilers and interpreters for all sorts of platforms, ranging from tiny, minimal interpreters to state-of-the-art compilers, targeting about every imaginable computer system. But most Schemes need relatively complex runtime systems, have numerous dependencies, or have slow performance, which is simply due to the powerful semantics of the language: dynamic typing, automatic memory management (garbage collection), first class continuations, etc. which all have a cost in terms of overhead.

What is needed is a small, portable compiler that generates more or less "natural" C code with minimal dependencies and runtime system that supports at least the basic constructs of the language and that puts an emphasis on producing efficient code, even if some of the more powerful features of Scheme are not available. Such a system would be perfect for writing games, virtual machines, or performance-sensitive libraries for other programs where you still want to use a high-level language to master the task of implementing complex algorithms, while keeping as close to C/C++ as possible. Another use is as a tool to write bare-metal code for embedded systems, device drivers and kernels for operating systems.

There are some high-performance compilers like Bigloo or Stalin. But the former still needs a non-trivial runtime-system and the latter is brittle and not actively maintained. Also, one doesn't necessarily need support for the full Scheme language and if one is willing to drop the requirement of dynamic typing, a lot of performance can be gained while still having a relatively simple compiler implementation. Even without continuations, dynamic typing, the full numeric tower and general tail call optimization, the powerful metaprogramming facilities of Scheme and the clear and simple syntax make it a useful notation for many uses that require a high level of abstraction. Using type inference mostly avoids having to annotate a source program with type information and thus allows creating code which still is to a large part standard Scheme code that can (with a little care) be tested on a normal Scheme system before compiling it to more efficient native code.

History

There was a previous extension for CHICKEN, also called "crunch", that compiled to C++, used a somewhat improvised type-inferencing algorithm and was severely restricted. It was used to allow embedding statically typed code into normal CHICKEN Scheme programs. The new CRUNCH picks up this specific way of use, but is a complete reimplementation that targets C99, has a more sophisticated type system, offers some powerful optimizations and has the option to create standalone programs or separately compilable C modules.

Installation

CRUNCH is only available for the new major release of CHICKEN (version 6). You will need to build and install a development snapshot containing the sources of this release, which is still unofficial and under development:

 $ wget https://code.call-cc.org/dev-snapshots/2024/12/09/chicken-6.0.0pre1.tar.gz
 $ tar xfz chicken-6.0.0pre1.tar.gz
 $ cd chicken-6.0.0pre1
 $ ./configure --prefix <install location>
 $ make
 $ make install
 $ <install location>/bin/chicken-install -test crunch

CHICKEN has minimal dependencies (a C compiler, sh(1) and GNU make(1)), so don't be put off to give it a try.

Basic Operation and Usage

CRUNCH can be used as a batch compiler, translating Scheme to standalone C programs or can be used at compile time for embedded fragments of Scheme code, automatically creating the necessary glue to use the compiled code from CHICKEN Scheme. The compiler itself is also exposed as a library function, making various scenarios possible where you want to programmatically convert Scheme into native code.

There are four modes of using CRUNCH:

1. Embedding:

;; embed compiled code into Scheme (called using the foreign function interface):
(import crunch)
(crunch
  (define (stuff arg) ...) )
(stuff 123)

2. Standalone:

 $ cat hello.scm
 (define (main) (display "Hello world\n"))
 $ chicken-crunch hello.scm -o hello.c
 $ cc hello.c $(chicken-crunch -cflags -libs)
 $ ./a.out

3. Wrap compiled code in Scheme stubs to use it from CHICKEN:

 $ cat fast-stuff.scm
 (module fast-stuff (do-something)
   (import (scheme base))
   (define (do-something) ...))

 $ cat use-fast-stuff.scm
 (import fast-stuff)
 (fast-wait)

 $ chicken-crunch -emit-wrappers wrap.scm -J fast-stuff.scm -o fast-stuff.c
 $ csc -s wrap.scm fast-stuff.c -o wrap.so
 $ csc use-fast-stuff.scm -o a.out

4. Using CRUNCH as a library:

#;1> (import (crunch compiler))
#;2> (crunch
       '(begin (define (main) (display "Hello world\n"))
       '(output-file "out.c") )

Module system and integration into CHICKEN

CRUNCH uses the module system and syntactic metaprogramming facilities of CHICKEN. Syntax defined in CHICKEN modules can be used in CRUNCH code and vice versa. CRUNCHed code can produce "import libraries", like in CHICKEN to provide separate compilation of modules.

Modules compiled by CRUNCH may only export procedures and a standalone program is expected to export a procedure called main. This simplifies interfacing to C and makes callbacks from C into Scheme straightforward.

As in PreScheme, toplevel code is evaluated at compile time. Most assigned values can be accessed in compiled code.

;; build a table of sine values at compile time
(define sines
  (list->f64vector
    (list-tabulate 360
      (lambda (n) (sin (/ (* n π) 180))) ) ) )

Restrictions

A number of significant restrictions apply to Scheme code compiled with CRUNCH:

No support for multiple values
No support for first class continuations
Tail calls can only be optimized into loops for local procedure calls or calls that can be inlined
Closures (procedures capturing free variables) are not supported
Procedures can have no "rest" argument
Imported global variables can not be modified
Currently only 2-argument arithmetic and comparison operators are supported
It must be possible to eliminate all free variables via inlining and lambda-lifting

This list looks quite severe but it should be noted that a large amount of idiomatic Scheme code can still be compiled that way. Also, CRUNCH does not attempt to be a perfect replacement for a traditional Scheme system, it merely tries to provide an efficient programming system for domains where performance and interoperability with native code are of high importance.

Datums are restricted to the following types:

basic types: integer, float, complex, boolean, char, pointer
procedure types
strings
vectors of any of the basic types, and vectors for specific numeric types
structs and unions

Note the absence of pairs, lists and symbols. Structures and unions are representations of the equivalent C object and can be passed by value or by pointer.

The Runtime System

The runtime system required to run compiled code is minimal and contained in a single C header file. CRUNCH supports UNICODE and the code for UNICODE-aware case conversions and some other non-trivial operations is provided in a separate C file. UNICODE support is optional and can be disabled.

No garbage collector is needed. Non-atomic data like strings and vectors are managed using reference counting without any precautions taken to avoid circular data, which is something that is unlikely to happen by accident with the data types currently supported.

Optimizations

CRUNCH provides a small number of powerful optimizations to ensure decent performance and to allow more or less idiomatic Scheme code to be compiled. The type system is not fully polymorphic, but allows overloading of many standard procedures to handle generic operations that accept a number of different argument types. Additionally, a "monomorphization" optimization is provided that clones user procedures that are called with different argument types. Standard procedures that accept procedures are often expanded inline which further increases the opportunities for inlining of procedure calls - this reduces the chance of having "free" variables, which the compiler must be able to eliminate as it doesn't support closures. Aggressively moving lexically bound variables to toplevel (making them globals) can further reduce the amount of free variables.

Procedures that are called only once are inlined at the call site ("integrated"). Fully general inlining is not supported, we leave that to the C compiler. Integrated procedures that call themselves recursively in tail position are turned into loops.

A crucial transformation to eliminate free variables is "lambda lifting", which passes free variables as extra arguments to procedures that do not escape and whose argument list can be modified by the compiler without interfering with user code:

(let ((x 123))
  ; ... more code ...
  (define (foo y) (+ x y))
  ; ... more code ...
  (foo 99) )

  ~>

(let ((x 123))
  ; ... more code ...
  (define (foo y x) (+ x y))
  ; ... more code ...
  (foo 99 x) )

Monomorphization duplicates procedures called with arguments of (potentially) different types:

(define (inc x) (+ x 1))
(foo (inc 123) (inc 99.0))

~>

;; a "variant" represents several instantiations of the original procedure
(define inc
  (%variant
    (lambda (x'int) (+ x'int 1)) 	; "+" will be specialized to integer
    (lambda (x'float) (+ x'float 1)))))	; ... and here to float
(foo (inc'int 123) (inc'float 99.0))

Certain higher-order primitives are expanded inline:

(vector-for-each
  v
  (lambda (x) ...) )

~>   ; (roughly)

(let loop ((i 0))
  (unless (>= i (vector-length v))
    (let ((x (vector-ref v i))) ... (loop (+ i 1))) ) )

A final pass removes unused variables and procedure arguments and code that has no side effects and has unused results.

Together these transformations can get you far enough to write relatively complex Scheme programs while ensuring the generated C code is tight, and with a little effort, easy to understand (in case you need to verify the translation) and (hopefully) does what it is intended to do.

Performance

Code compiled with CRUNCH should be equivalent to a straightforward translation of the Scheme code to C. Scalar values are not tagged nor boxed and are represented with the most fitting underlying C type. There is no extra overhead introduced by the translation, with the following exceptions:

Vector- and string accesses perform bound checks (these can be disabled)
Using vectors and strings will add some reference counting overhead

If you study the generated code you will encounter many useless variable assignments and some unused values in statement position, these will be removed by the C compiler, also unexported procedures are declared static and so can also very often be inlined by the C compiler leading to little or no overhead.

The Debugger

For analyzing type errors, a static debugger is included, that presents a graphical user interface. When the -debug option is given, a Tcl/Tk script is invoked in a subprocess that shows the internal node tree and can be used to examine the transformed code and the types of sub-expressions, together with the corresponding source code line (if available). Should the compilation abort with an error, the shown node-tree is the state of the program at the point where the error occurred.

Differences to PreScheme

CRUNCH is inspired by and very similar to PreScheme, but has a number of noteworthy differences. CRUNCH tries to be as conformant to R7RS (small) as possible and handles UNICODE characters and strings. It also is tightly integrated into CHICKEN, allowing nearly seamless embedding of high-performance code sections. Macros and top-level code can take full advantage of the full CHICKEN Scheme language and its ecosystem of extension libraries.

PreScheme supports multiple values, while CRUNCH currently does not.

PreScheme uses explicit allocation and deallocation for compound data objects, while CRUNCH utilizes reference counting, removing the need to manually clean up resources.

I'm not too familiar with the PreScheme compiler itself, but I assume it provides more sophisticated optimizations, as it does convert to Static Single Assignment form (SSA), so it is to be expected that the effort to optimise the code is quite high. On the other hand, modern C compilers already provide a multitude of powerful optimizations, so it is not clear how many advantages lower-level optimizations will bring.

Future Plans

There is a lot of room for improvements. Support of multiple vales would be nice, and not too hard to implement, but will need to follow a convention that should not be too awkward to use on the C side. Also, the support for optional arguments is currently quite weak; the ability to specify default values is something that needs to be added.

Primitives for many POSIX libc system calls and library functions should be straightforward to use in CRUNCH code, at least the operations provided by the (chicken file posix) module.

What would be particularly nice would be if the compiler detects state machines - mutually recursive procedures that call each other in tail position.

Other targets are possible, like GPUs. I don't know anything about that, so if you are interested and think you can contribute, please don't hesitate to contact me.

Disclaimer

CRUNCH is currently alpha-software. It certainly contains numerous bugs and shortcomings that will hopefully be found and corrected as the compiler is used. If you are interested, I invite you to give it a try. Contact me directly or join the #chicken IRC channel on Libera.chat, if you have questions, want to report bugs, if you would like to suggest improvements or if you just want to know more about it.

All feedback is very welcome!

Links

The CRUNCH manual can be found at the CHICKEN wiki, the source code repository is here.

What to expect from CHICKEN 6

2024-11-18T14:27:11Z

NOTE: This is a guest post by Felix Winkelmann, the founder and one of the current maintainers of CHICKEN Scheme.

Introduction

This article is about the changes in the next major version of CHICKEN Scheme.

The current version is 5.4.0. The next major version, 6.0.0, is mostly implemented. Currently we are adding final touches before preparing the next release. If you are already familiar with CHICKEN, this article will help you get a more detailed view of what has been done. If you are not into CHICKEN or even Scheme, this article may still give you an insight into what's involved in the development and maintenance of a programming language project. You may also find it interesting to know how we address issues like portability and backwards-compatibility. There are also some juicy details on implementation techniques.

No previous knowledge of CHICKEN Scheme is required, but it will help if you are familiar with common terms used in programming language design and Lisp-y languages.

Versions

CHICKEN uses the usual major.minor.patch versioning scheme, where we bump major versions only for significant changes that break compatibility with older versions. CHICKEN has a relatively large community for a non-mainstream language and has lots of contributed extension libraries called "eggs". Breaking backwards compatibility in non-major versions adds implementation effort for users that just want keep their CHICKEN up to date and any eggs they're using should also keep working.

The process of avoiding breakage can sometimes become challenging. We may, for example, provide two different implementations for one and the same thing, to allow adoption of the new feature while keeping old code working. Typically this happens when the new feature is safer, easier to use, faster or more standards compliant than the old. We also remove features in stages. First we mark it as deprecated, and delete it later. This allows users to upgrade their code gradually.

On major version changes, we create a new mirror of the egg "repository", where most of the extensions are stored. Externally contributed eggs can choose when and how to publish an extension for specific versions. This is described in a previous post on this blog.

Major version changes also usually bump the "binary version". This is a suffix to the shared runtime library name to avoid intermixing tools and programs written for one version with the libraries for another.

We started CHICKEN major version 6 to introduce new features that were dearly required but were impossible to add without changing external interfaces, in particular what modules exist and what they export. Specifically, we needed full support for UNICODE strings and compliance with the R7RS (small) language standard.

Both features were already available in a rudimentary manner as external libraries. They were not fully integrated, a fact that showed in various places and could lead to subtle bugs as core and extension code assumed differences in string representation and the behaviour of standard procedures. The same approach was used for integrating the "full numeric tower" (the full set of numeric types, including rationals, big integers and complex numbers), which was formerly a library and was moved into the core system with the 5.0.0 release.

We'll now describe the most noteworthy changes made during this major version transition.

UNICODE

A major shortcoming of CHICKEN up to now was that it didn't support full UNICODE strings. All string data was assumed to consist of 8-bit characters in the ASCII or Latin-1 range. Internationalisation of source code and applications is unavoidable and libraries and operating systems have moved towards UTF-8 as a standard character encoding. So we saw no choice but to find a reasonably efficient way of using the full range of possible characters in all parts of the system.

There is an open-ended space of design choices regarding how to efficiently implement UNICODE strings. The Damocles Sword of unanticipated performance degradation constantly looms over the head of the language implementer, so finding the ideal solution in terms of memory use, speed and simplicity is not easy. Some implementations go to great lengths by providing complex storage schemes or multiple representations depending on a strings' content. In the end, we think a simple representation serves everybody better by being easier to understand and maintain.

Also it is not entirely sure that (say) a dynamic multi-representation approach pays off sufficiently. Too many factors come into play, like string usage patterns in applications, memory management and implementation overhead. Data exchange at the boundaries of operating system and foreign code also have to be taken into account. You want to avoid unnecessary conversions and copying, especially because CHICKEN is geared towards easy interfacing to external libraries and OS functionality.

We decided to use an UTF-8 representation internally. Many operating systems and applications already support UTF-8, which removes the need for costly conversions. UTF-8 is also backwards compatible to ASCII and keeps storage requirements at a minimum.

Since UTF-8 is a multi-byte encoding, character lookup in a string is of linear complexity. Characters have to be scanned when addressing a specific index in a string. To avoid repeatedly scanning the same section of a string when iterating over it, we use a simple cache slot in the string that maps byte-offsets to code point indices and holds the offset/index pair of the last access.

String representation

A quick recap regarding how strings are currently stored in CHICKEN is in order. If you are interested, there's also an older post on this blog with a more detailed explanation of the data representation used by CHICKEN.

Characters are stored as immediates with the special bit pattern 1110 (type code 111):

This gives us enough bits (even on 32-bit platforms) to hold all code points in the Basic Multilingual Plane (code points in the range 32 - 0x1fffff). CHICKEN 5 already supported characters in that range, but strings were still restricted to having 8-bit elements.

Up to CHICKEN 5 a string was represented by a single byteblock:

In CHICKEN 6 we add an indirection:

As you can see, the string itself is a normal block containing a pointer to a byteblock holding the UTF-8 byte sequence and a trailing zero byte. The "Count" slot is a fixnum with the number of code points in the string (the length). "Offset" and "Index" act as a cache for the byte-offset and code point-index of the last access. They are reset to zero if the string changes its length due to a character's width changing at an offset before the cached point.

An indirection is necessary regardless of the other details of the representation: as UTF-8 is a multi-byte encoding, destructively modifying characters in a string may grow or shrink the total byte-sequence. The string must still keep its identity and be indistinguishable from the original string (in terms of the primitive eq? procedure).

One obvious optimisation we can do here is that character accesses for strings where the length ("Count") and the length of the character data (encoded in the header of the byteblock) minus one is equal: then we can simply index by byte offset.

Alternative representations are possible. I believe the following is used in Chibi Scheme: keep a list of "chunks", i.e. multiple byte-offset/code point-index pairs per string. You can traverse this list to obtain the first chunk of string data containing the index you want to access.

In CHICKEN this would probably be represented thus:

This way we can find the offset for the index right at or before the location addressed. This reduces the amount of scanning to the part inside a chunk pointed to by the offset/index pair.

Such an approach of maintaining the offset/index list is more complex and keeping it up to date is more effort than the simple offset/index cache. Should the performance impact of the simpler representation turn out to be too large, the alternative approach may still be an option to try.

This leads me to the subject of benchmarks. We do have a benchmark suite holding a number of micro-benchmarks and CHICKEN 6 has been compared to previous versions using this suite. Nevertheless, the results of micro-benchmarking have to be taken with a grain of salt.

The behaviour of real-world applications may differ considerably depending on the memory consumption, memory address patterns and the type and amount of data processed. Reasoning about performance is especially difficult due to the complex caching mechanisms of contemporary hardware and operating systems. Therefore we will try to use the simplest approach that has a good chance of providing sufficient performance while keeping the implementation effort and complexity at a minimum.

Dealing with the outside world

There's another point we have to address when considering strings that are received from external sources like OS APIs, from files and from the foreign function interface. What to do when encountering invalid UTF-8 byte sequences, for example when reading file or directory names? Signal an error? Convert to some other representation or mark the location in the byte sequence using some special pattern?

We decided to basically ignore the problem. Strings are accepted whether valid or not. Only when decoding a string we distinguish between valid and invalid sequences. When indexing and scanning a string's byte sequence, we return the invalid byte as a "surrogate pair end" code point that has the value 0xDCxx. This approach allows us to store the byte in the lower 8 bits of the code point. When inserting a character in a string that has such a value, we simply copy the byte. As I understand it, this is the approach used by the "surrogateescape" error handler in PEP 383. However, we do this masking every time we decode an unexpected byte (and do the inverse when encoding).

Say we received a string from an external source containing the following byte sequence:

This is invalid UTF-8. Extracting character by character with the described method would yield the following code point values:

Inserting such a surrogate pair end code point will inject the value of the lower byte. For example, calling (string-set! str #\xdcfe), where str is the above invalid utf-8 encoded string would yield the following byte sequence:

The advantage is that we can simply accept and pass strings containing invalid sequences as if nothing happened. We don't need to do any conversions and checks. We can also access items in the character sequence and store them back without having to worry about the validity of the sequence with respect to the UTF-8 encoding. The process is "lossless".

The disadvantage is that we have to perform an additional check when encoding or decoding UTF-8 sequences. Again we decided to reduce copying and transcoding overhead and rather have complete transparency regardless of the source of the string. Furthermore no validation is performed for incoming strings. The R7RS standard procedure utf8->string which converts bytevectors into strings does validation, though to at least ensure that strings created from binary data are always correctly encoded.

A final issue is UNICODE handling on Windows platforms. There, the OS API entry-points that are UNICODE aware use UTF-16 for strings like filenames or the values of environment variables. On Windows there is no choice but to encode and decode from UTF-8 to UTF-16 and back when interfacing with the OS.

Port encodings

When accessing files, we still want to be able to read and write data in an 8-bit encoding or in binary. We may also want to support additional encodings, even though these are currently not provided by default so we'll need to associate an "encoding" property to "ports". Ports are the Scheme objects that provide access to files and other streams of data, like traffic received from a network. The encoding is selected when opening a port using additional arguments to standard procedures that create ports for textual input and output like open-input-port/open-output-port.

Ports internally hold a table of "port methods", roughly similar to how object-oriented programming systems attach behaviour to data objects of a particular class. The port methods in former versions of CHICKEN included the low-level counterpart of the standard operations peek-char, read-char and read-string for input ports and write-char/write-string (among others) for output ports. The major change here is to replace the string I/O methods with methods that act upon bytevectors.

An internal mechanism delegates the operations for encoding and decoding or scanning for the next character in a stream to an internal hook procedure. Additional encodings can be registered by using a built-in procedure to extend the hook. Currently supported are binary, UTF-8 and Latin-1 encodings. Support for a larger set of encodings can be done through extensions and thus can be developed separately from the core system. Port encodings can be accessed using port-encoding. They can also be changed using the SRFI-17 setter for port-encoding, because encodings need sometimes to be changed while the port is still open.

R7RS does not define whether binary I/O standard procedures are allowed to operate on textual ports and vice versa. In CHICKEN we do not restrict the set of operations depending on the port type, so you can read and write bytevectors to and from a textual port and the other way round. We think this is more practical and intuitive - it makes more sense to read and write binary data as bytevectors and have dedicated operations for this.

R7RS support

The second big change for CHICKEN 6 is proper R7RS (small) compliance. Like with the UTF-8 support this was previously provided by an extension library, but using it needed some minor extra steps to set up. Now, all syntactic definitions of the (scheme base) module are available by default (most notably define-library) without requiring any further action.

Deciding what is available by default in a compilation unit or interpretation environment is a bit of a problem: to make it easy to get going when experimenting or scripting, we defaulted to having all standard R5RS procedures and macros available in the base environment of the interpreter (csi), together with the imports from the (chicken base) module. Compiled code was slightly more restricted but defaulted also to R5RS.

In CHICKEN 5 the scheme module referred to the R5RS standard procedures. To avoid breaking too much existing code this is still the case. So now, scheme is an alias for the R7RS (scheme r5rs) library module that exports all bindings of the former language standard. But to avoid duplicating the set of exported identifiers over several core modules, certain functionality has been moved from (chicken base) to (scheme base) and is thus not available in the default toplevel environment.

To make a long story short, the switch makes it necessary to access certain standard bindings like open-input-string by importing additional modules like (scheme base). This is not necessarily a bad thing, as it incentivises the user to write code in a more standard compliant way. But it all feels a bit clunky and may have been a suboptimal choice. Note that just adding an (import (scheme base)) is usually enough to make existing code run. We will see how that works out.

All required (scheme ...) modules are implemented and can be imported using their full functionality. Two notable changes that influence backwards compatibility are generative record types and hexadecimal escape sequences in string literals.

Formerly record types defined with define-record-type were non-generative: re-defining a record type with the same name, even if the type definition is completely different, would not create a new type. Instances of the former definition would also be of the newly defined type, provided the type name is the same. Now every definition of a record type creates a completely new type, regardless of the name. This is of course more correct and much safer as it doesn't invalidate previously defined instances.

A second (and much more annoying) change is that R7RS requires hex escape sequences in string literals to be terminated by a semicolon. Unfortunately the change is incompatible to the convention used in most programming languages, including existing many Lisp and Scheme implementations.

What in CHICKEN 5 would looked like this:

   "the color \x1b[31mRED\x1b[0m"

in CHICKEN 6 (and R7RS) must now be (note the semicolon):

   "the color \x1b;[31mRED\x1b;[0m"

The motivation here was probably to avoid having dedicated escape sequences for values that require more than 2 hex digits (e.g. \uNNNN). The change is particularly bad from a compatibility point of view. All string literals that contain the old style of escape sequences must be changed. To keep code working in both CHICKEN 5 and 6 you can use the 4-digit \uNNNN escape sequence which is still valid in all versions.

Foreign Function Interface changes

CHICKEN has a very easy to use foreign function interface, which mostly derives from the fact that the compiler generates C. The alternative approach are binary FFIs that use native code to interface with C code, like libffi, which must reproduce a lot of ABI details to safely interface with C libraries, things like struct alignment and padding, how arguments of various lengths are passed on the stack, etc.

The most notable FFI-related change for CHICKEN 6 is that it allows passing and returning C structs and unions to and from foreign code by value. The contents are copied in and out of bytevectors and wrapped in a block. The components of the struct or union can not be directly manipulated in Scheme but can be passed on to other foreign functions. Additionally, for completeness, you can now also directly pass C99 complex numbers. Note that complex numbers in the FFI are always passed as inexact (i.e., floating-point), as opposed to Scheme complex numbers that may have exact (integer or even rational) real and imaginary components.

Platform support and build system

Here two major changes have been introduced. First, we now have a proper configuration ("configure") script. Formerly, all build parameters were passed as extra arguments to make(1), like this:

   make PREFIX=$HOME/.local ...

This required that for every make(1) invocation, the same set of parameters had to be given to avoid inconsistencies. A configuration script written in portable POSIX sh(1) notation is now provided to perform the configuration step once before building. It also follows the de facto convention used in many GNU programs, where the usual incantation is:

   ./configure; make; make install

Note that we don't use the dreaded "autotools" (autoconf, automake and libtool), which have arcane syntax, are very brittle and produce more problems that they are trying to solve. They were originally designed to port code to now dead or obscure UNIX derivatives, yet fail to provide a straightforward and easy to use configuration tool for modern systems (Linux/BSD, Mac, Windows, mostly). Our configure script is hand written instead of auto-generated, and only checks for the platform differences that are relevant to those platforms that CHICKEN actually supports.

The old style of passing variables to make(1) should still work, but is deprecated.

The second change in the build system is that we cleaned up the confusion about toolchains on Windows systems. There is a bunch of half-maintained flavors of GNU-based C development environments for Windows systems ("MingGW", "MingGW-w64", "MSYS2", etc.) and it is a constant source of pain to support and test all these variants.

There is now one official "blessed" toolchain that we support. Specifically, we recommend Chris Wellon's excellent w64devkit. It contains compilers, build tools and enough of a POSIX environment for building CHICKEN on Windows. We also require a POSIX sh(1) (contained in w64devkit) and have dropped all support for building on a non-POSIX shell, i.e. cmd.exe. This simplifies the build and package management tools considerably. It also ensures we have less environments to test.

Building for Windows Subsystem for Linux (WSL) and Cygwin is of course still supported, but those use the UNIX build setup and need no or very little specific platform support.

Minor changes

Quite a number of minor changes have taken place that either increase the robustness or standards compliance of the system.

syntax-error is now a macro, as required by R7RS. Previously there was a syntax-error procedure in (chicken syntax) with the same argument signature. So where the error was previously raised at runtime, it is now done at expansion time. This is something to take note of when porting code to CHICKEN 6. The invocation is still the same, so the difference can at least be identified easily and corrected, and (scheme base) needs to be imported, of course.

The csc compiler driver passes arguments now properly to subprocesses via execve(2) and not system(3) which reduces shell quoting headaches.

The chicken-install package ("egg") manager locks the build cache directory to avoid conflicts when running multiple installations in parallel. Also, a custom-config feature has been added to places in the package description (.egg) file that specify compile and link options provided by external tools like pkg-config for more portable configuration of native libraries that packages use. The configuration script is expected to be Scheme code. Other eggs can extend the set of possible customisation facilities by providing library code to access them.

The feathers debugger has been removed from the core system and re-packaged as an egg, allowing separate development or replacement. It always was a bit of a proof-of-concept thing that lacks the robustness and flexibility of a real debugger. Some users found it helpful, so we keep it for the time being.

Future directions

Every major release is a chance of fixing long-standing problems with the codebase and address bad design decisions. CHICKEN is now nearly 25 years old and we had many major overhauls of the system. Sometimes these caused a lot of pain, but still we always try to improve things and hopefully make it more enjoyable and practical for our users. There are places in the code that are messy, too complex, or that require cleanup or rewrite, always sitting there waiting to be addressed. On the other hand CHICKEN has been relatively stable compared to many other language implementations and has a priceless community of users that help us improving it. Our users never stop reminding us of what could be better, where the shortcomings are, where things are hard to use or inefficient.

So the final goal is and will always be to make it more robust and easier to use. Performance improvements are always good but are secondary. Strong standards compliance is a requirement, unless it interferes with practicality. We also try to avoid dependencies in the core system at all costs. This eases porting and avoids friction that is very often introduced by inadequate tools or overdesigned development environments.

Some moody ramblings about Scheme standards

The switch to seamless support for R7RS (small) was due for quite a while. R7RS is by now the generally accepted Scheme standard that implementations try to follow. How the further development of R7RS (large) will turn out remains to be seen, but I'm not very confident that it will result in anything more that just a shapeless agglomeration of features. The current direction seems to be oriented towards creating something that includes everything but the "kitchen-sink" - too ambitious and too much driven by the urge to produce something big and comprehensive.

What always distinguished Scheme from other languages and Lisp dialects was its elegance and minimalism. This resulted in a large number of experimental implementations, the investigation of various directions of programming language semantics and in highly interesting advanced implementation techniques that are still in use today. The expressiveness and the small number of core concepts made it also perfect for teaching computing. It still is great for teaching, even if the tendency to address perceived requirements of the "market" replaces the academic use of Scheme with languages that belong more to the "mainstream". This strikes me as strange, as if learning multiple languages and studying different programming paradigms couldn't help in obtaining a broader view of software development; or as if one is somehow wasting time by exploring the world of programming in a non-mainstream language.

Repeatedly the small size and limited scope of Scheme has driven a number of enthusiasts dedicated to favouring a broader use in the industry to demand "bigness". They dream of a comprehensive standard that will support everything and please everyone and makes it easy to write portable and non-trivial applications in Scheme and have them run on large number of implementations. With this comes the tacit expectation that implementers will just follow such a large standard and implement it faithfully.

But what made Scheme successful (and beautiful) was the small size, the small number of powerful concepts, the minimalism, the joy in experimentation. Trying to create the Comprehensive Mother of all Schemes is in my opinion a waste of time. In fact, it makes Scheme less relevant. Large languages inevitably die - the warts and inconsistencies that crop up when you try to design too much up front will just get more blatant as the language ages and will annoy users, constrain implementers and make people search for alternatives.

You can already write serious programs and use a large number of libraries in Scheme: just install CHICKEN, Guile or (even) Racket and get going. The code will to a certain part not be portable across implementations, that's unavoidable, but to use dynamic languages effectively and successfully, some experience with and a certain understanding of the design of the underlying compiler or interpreter is necessary anyway.

Acknowledgements

I would like to thank my employer, bevuta IT GmbH which sponsored part of the effort of preparing a new major release of CHICKEN.

Adding weak references to CHICKEN

2023-06-27T11:48:10Z

Recently, a user was asking on IRC why CHICKEN doesn't give proper line numbers for error call traces in the interpreter. This is indeed rather odd, because the compiler gives numbered traces just fine.

After a brief discussion, we figured that this was probably because line numbers are stored in a "database" (hash table) which maps the source expression to a line number. Because you can keep calling (eval) with fresh input, the number of expressions evaluated is potentially infinite.

This would lead to unbounded growth of the line number database, eventually eating up all available memory.

We'd like to fix this. A great solution for this problem would be weak references. A weak reference is a reference to a value that does not cause the garbage collector to hold on to that value. If other things still refer to the value, it is preserved and the weak reference is maintained. If only weak references refer to a value, it may be collected.

The line number database could then consist of such weak references to source expressions. If an expression is no longer held onto, it can be collected and the line number eventually removed. This would turn the database from a regular hash table into a weak hash table.

This requires hooking into the garbage collector so that it knows about these references and can drop them. Since our collector is a copying collector, addresses of objects change over time, and a weak reference needs to be updated if something is still hanging onto it, without itself causing the weakly held object to stick around.

A quick recap of garbage collection

To explain how we changed our garbage collector, I will start with a quick high-level recap of how the garbage collection works. I'll explain just enough to cover what is needed to understand weak references. For a more in-depth look at CHICKEN's garbage collection, see my post on the GC.

First, we have to realise that CHICKEN stores all its values in a machine word. It distinguishes between "immediate" values and non-immediate or "block" values. Immediate values can be either fixnums (bit 0 is set) or other "small" values (bit 1 is set). Non-immediates are recognised because the lower two bits are zero, which is very convenient, as a pointer to a word-aligned address happens to have the lower two bits cleared as well!

In other words, non-immediate values are simply C pointers, whereas the immediate values are encoded long integers. So, non-immediates are allocated in memory and represented by a pointer to them. At the pointer's address we find a header which encodes what exactly is stored there, and (optionally) the data, where we store the contents of compound objects.

  typedef struct
  {
    C_word header;
    C_word data[];    /* Variable-length array: header determines length */
  } C_SCHEME_BLOCK;

In typical Scheme objects, the data consists of slots. For example, the car and the cdr of a pair are slots, the elements of a vector are slots, etc. Each slot contains a Scheme value. In a graphical representation, a block object looks like this:

The data representation will be important to understand how we implement weak pairs, and to follow the garbage collection process.

If you want more details, see my post about data representation in CHICKEN.

A tale of two spaces

CHICKEN divides the heap up in two halves. While the program is running, only one half is in use. When this half fills up, we start the garbage collection process. We trace all the data that's still considered "live" (aka GC "roots") and copy it over to the other half.

Let's look at an example:

  (cons 5 '())  ; unreferenced, may be collected
  (define x (cons (vector 42) '())) ; sticks around

The half that was full is called the fromspace and the half that starts out empty is called the tospace. Initially, the tospace is empty, and the data is only in the fromspace:

Then, we move the objects, one at a time, starting at the roots. So in the example, the pair that we named "x" gets moved first to tospace, while its slots still point into fromspace:

The changes are in yellow. After tracing the roots, we trace all the contents of the objects in fromspace and copy over the pointed-to objects:

Finally, we're done. Any remaining objects in fromspace are garbage and can be ignored, effectively "clearing" fromspace, and we flip the two spaces so that fromspace becomes tospace and vice versa:

Onward! I mean, forward!

This was a very simple case. A more complex case is when there are several objects with mutual references. We must somehow keep track of which objects got moved where, so that we don't copy the same object more than once.

Another example:

  (define shr (vector 42))
  (define x (cons shr 1)
  (define y (cons shr 2))

This would look like the following when visualized:

As I've explained, roots are copied first, so let's say shr gets copied first:

As you can see, this introduces something new. There's a forwarding pointer stored at the address where we originally found the header of the pair known as shr. This is done so that we don't have to traverse all the objects pointing to shr (which could even be cyclic!). Instead, whenever we find an object that holds an address where there now is a forwarding pointer, we derefence the pointer and change the object to point to the target address.

So, if x is copied next, we get the following picture:

We actually *always* leave behind a forwarding pointer when an object is copied, because the GC does not "know" whether anything references the object. I skipped that in the initial examples to keep it simple. And finally, y can be copied:

Now you can see, the forwarding pointers are still there in fromspace, but nothing references them anymore. The GC is now done. Of course, fromspace and tospace will be flipped (but I'm not showing that here).

How weak pairs work in the GC

Let's make a few changes to our example:

  (define zero (weak-cons (vector 0) 0))  ;; new - car may be collected
  (weak-cons (vector 9) 9) ;; new - may be collected
  (define shr (vector 42))
  (define x (cons shr 1)
  (define y (weak-cons shr 2)) ;; changed to be weak
  (set! shr #f) ;; new - dropping root for shr
  (define z (weak-cons (vector #f) 3)) ;; new - car may be collected

Now, y is changed to be a weak pair holding shr in its car, while x is still a normal pair with the same object in its car. The shr variable is also cleared, so that only x holds onto shr's old value strongly. We've added a few more weak pairs while we're at it.

This weak-cons procedure does not exist in CHICKEN 5.3.0; this is what we'll be adding. It creates a pair whose car field is a weak reference. The cdr field is a regular reference. The reason for having the cdr be a regular reference is that this allows us to build up lists of items which may be collected. The "spine" of the list will remain intact, but the list entries will be cleared (see below).

Let's take a look at our new initial situation:

The GC will start with the live objects, as usual. Let's start at the top, copying zero:

Since zero is a weak pair, we won't traverse the car slot, keeping the vector it points to in fromspace. This is intentional, as the GC treats weak pairs differently; only the cdr slot gets scanned, while the car gets skipped and left as a "dangling" pointer into fromspace.

We'll fix that later, as we don't want to keep this situation after the garbage collection process is done.

Normally we'd pick up shr next, but that is not a root anymore. Let's continue with x instead:

Then, as we scan over the contents of x, we encounter what used to be shr and move it, updating the pointer in x:

Next, we move y:

You'll notice that because it's a weak pair, its car slot does not get updated to the new location of what used to be shr, even though that vector has already been copied. As mentioned before, we'll fix this up later.

Finally, z gets moved:

I'm sure you've noticed that now we're left with a big spaghetti bowl of pointers, some of which are still pointing into fromspace. So let's investigate how we can fix this mess.

Fixing up the dangling weak pairs, naively

It just so happens that CHICKEN internally already has support for weak pairs. Those are pairs whose car field holds objects that may be collected. These are used solely in the symbol table, so that symbols aren't maintained indefinitely. This is important because it stops symbol table stuffing attacks. Those are a form of Denial of Service (DoS) by forcing the system to eat up all memory.

In this current implementation, the only weak pairs in existence are those in the symbol table's hash bucket chains. So, the solution there is very simple: traverse all the symbol table buckets and update weak pointers.

So we traverse the slots like during normal GC's mark operation, but with two exceptions:

When we mark the slot, we don't cause the contents to be copied over into tospace, but only chase any forwarding pointers.
When we encounter a pointer into fromspace, we replace the slot with a special "broken weak pair" marker.

That would yield this final picture:

As you can see, all the weak pairs have their car slots updated, even the ones that we could consider as garbage, like the two at the top. Those get the special value #!bwp, or "broken weak pointer" to indicate that the value originally stored there is no longer live.

Smarter fixing up of weak pairs

Traversing all the weak pairs is fine when we know every single one in the system. When only hash table buckets in the symbol table may be weak, that's the case. But now we want to expose weak pairs to the user, what do we do?

The simplest solution would be to add a table or list of all the weak pairs as they're constructed, or as the GC encounters them. This has a few problems:

It is wasteful of memory, as the extra pointers in this table would be redundant.
If done during garbage collection, we'll have to dynamically allocate during a GC, when memory is already tight (and it's slow).
We'd end up traversing potentially loads and loads of dead weak pairs that would be themselves collected, as in the example above.

Alternatively, you could traverse the entire heap again after GC and mark the weak pairs. In fact, that's what Chez Scheme does. However, Chez has separate heaps for different types of objects (known as a BIBOP scheme). This has the advantage of traversing only the live weak pairs, instead of every single live object.

In CHICKEN, we'd be traversing every single object, as we store everything on a single heap. If the heap is large, this would be wasteful. But at least we wouldn't have to traverse dead weak pairs!

When looking into this problem, I decided to study the Schemes which have weak pointers. Those were Chez Scheme, Chibi Scheme and MIT Scheme. Of those, only MIT Scheme has a Cheney-style copying garbage collector with a single heap like CHICKEN does. The sources to MIT Scheme are well-documented, so I quickly found how they do it, and boy is it clever!

MIT Scheme's solution:

Traverses only the still-live weak pairs,
introduces only a single word of memory overhead,
and is brilliantly simple once you get how it works.

Garbage collecting weak pairs, MIT style

Let's take a closer look at that end state of that last GC run. Remember that a forwarding pointer overwrites the original header of the object that used to live there. But the memory of the slots is still sitting there, allocated but unused!

The great idea in MIT Scheme is that we can "recycle" that unused memory and store a linked list of still-live weak pairs in that space. The extra word of overhead I mentioned before acts as the head of that list. Let's look at how that looks during the GC, by replaying that last run, but with MIT Scheme's algorithm for tracking weak pairs:

The initial state is the same, except we have the additional weak pair chain head. The GC again starts at the top, copying zero, same as before, but with one addition. When the forwarding pointer gets created, we notice that it's a weak pair. This causes the weak pair chain head to get modified so it points to the new forwarding pointer. The original value of the chain head (the empty list marker) gets stored where the old weak pair's car slot used to be.

Again, we continue with x, which is a regular pair, so nothing interesting happens to the space behind the forwarding pointer:

Then, as we scan over the contents of x, we again encounter what used to be shr and move it, updating the pointer in x. The forwarding pointer for shr is also not put in the chain because shr is not a weak pair:

Next, we move y. Because it is a weak pair, we again link up its forwarding pointer into the chain that we're building up. This means the pointer in the weak chain head (which pointed to zero's original address) gets put in the old car field of y, and the head itself now points to y's forwarding pointer:

Finally, z gets moved. Since it is also a weak pair, when making the forwarding pointer we also link it up into the chain:

Finally, we now only need to traverse the weak pairs by following the trail of pointers starting at the head of the live weak pair chain. For each of the forwarding pointers in the chain:

Follow the forwarding pointer to its corresponding live weak pair,
take that pair's car,
check if the car refers to a forwarding pointer. If it does:
update the car to the forwarding pointer's destination.

I'm not going to do this one by one as it's pretty obvious. I will show you the result of this walk:

You can see that we've only updated three out of four weak pairs - exactly the number of weak pairs that got copied during this GC and are still considered live. Because only weak pairs that exist in tospace have to be updated, we are doing the absolute minimum work necessary.

The nice thing about this is, that even if you produce lots of garbage weak pairs, they don't have any additional impact on GC performance. Only those weak pairs that survive garbage collection will be scanned and updated.

When implementing this improvement I noticed that our performance on benchmarks improved a little bit, even though they don't use weak pairs directly. This was all due to only scanning copied weak pairs and not having to scan the entire symbol table on every GC.

With the support for user-facing weak pairs in place, supporting line numbers in the interpreter was relatively easy. This means that debugging your code will be easier starting with the upcoming 5.4 release of CHICKEN!

Clojure from a Schemer's perspective

2021-03-03T15:31:50Z

Recently I joined bevuta IT, where I am now working on a big project written in Clojure. I'm very fortunate to be working in a Lisp for my day job!

As I've mostly worked with Scheme and have used other Lisps here and there, I would like to share my perspective on the language.

Overall design

From a first view, it is pretty clear that Clojure has been designed from scratch by (mostly) one person who is experienced with Lisps and as a language designer. It is quite clean and has a clear vision. Most of the standard library has a very consistent API. It's also nice that it's a Lisp-1, which obviously appeals to me as a Schemer.

My favourite aspect of the language is that everything is designed with a functional-first mindset. This means I can program in the same functional style as I tend to do in Scheme. Actually, it's even more functional, because for example its maps (what would be hash tables in Scheme) are much less clunky to deal with. In Scheme, SRFI-69 hash tables are quite imperative, with hash-table-set! and hash-table-update! being the ways to insert new entries, which of course mutate the existing object. Similarly, Clojure vectors can easily be extended (on either end!) functionally.

The underlying design of Clojure's data structures must be different. It needs to efficiently support functional updates; you don't want to fully copy a hash table or vector whenever you add a new entry. I am not sure how efficient everything is, because the system I'm working on isn't in production yet. A quick look at the code implies that various data structures are used under the hood for what looks like one data structure in the language. That's a lot of complexity! I'm not sure that's a tradeoff I'd be happy to make. It makes it harder to reason about performance. You might just be using a completely different underlying data structure than expected, depending on which operations you've performed.

(non) Lispiness

To a seasoned Lisp or Scheme programmer, Clojure can appear positively bizarre. For example, while there is a cons function, there are no cons cells, and car and cdr don't exist. Instead, it has first and rest, which are definitely saner names for a language designed from scratch. It has "persistent lists", which are immutable lists, but in most day to day programming you will not even be using lists, as weird as that sounds!

Symbols and keywords

One thing that really surprised me is that symbols are not interned. This means that two symbols which are constructed on the fly, or when read from the same REPL, are not identical (as in eq or eq?) to one another:

user> (= 'foo 'foo)
true
user> (identical? 'foo 'foo)
false

Keywords seem to fulfil most "symbolic programming" use cases. For example, they're almost always used as "keys" in maps or when specifying options for functions. Keywords are interned:

user> (= :foo :foo)
true
user> (identical? :foo :foo)
true

Code is still (mostly) expressed as lists of symbols, though. When you're writing macros you'll deal with them a lot. But in "regular" code you will deal more with keywords, maps and vectors than lists and symbols.

Numeric tower

A favorite gotcha of mine is that integers are not automatically promoted to bignums like in most Lisps that support bignums. If you need bignums, you have to use special-purpose operators like +' and -':

user> (* (bit-shift-left 1 62) 2)
Execution error (ArithmeticException) at user/eval51159 (REPL:263).
integer overflow
user> (*' (bit-shift-left 1 62) 2)
9223372036854775808N

user> (* (bit-shift-left 1 62) 2N) ; regular * supports BigInt inputs, though
9223372036854775808N
user> (* 1N 1) ; but small BigInts aren't normalized to Java Longs
1N

This could lead to better performance at the cost of more headaches when dealing with the accidental large numbers in code that was not prepared for them.

What about rationals, you ask? Well, those are just treated as "the unusual, slow case". So even though they do normalize to regular integers when simplifying, operations on those always return BigInts:

user> (+ 1/2 1/4)
3/4
user> (+ 1/2 1/2)
1N
user> (/ 1 2) ; division is the odd one out
1/2
user> (/ 4 2) ; it doesn't just punt and always produce bignums, either:
2

The sad part is, bitwise operators do not support bignums, at all:

user> (bit-shift-right 9223372036854775808N 62)
Execution error (IllegalArgumentException) at user/eval51167 (REPL:273).
bit operation not supported for: class clojure.lang.BigInt
user> (bit-shift-right' 9223372036854775808N 62) ; does not exist
Syntax error compiling at (*cider-repl test:localhost:46543(clj)*:276:7).
Unable to resolve symbol: bit-shift-right' in this context

There's one benefit to all of this: if you know the types of something going into numeric operators, you will typically know the type that comes out, because there is no automatic coercion. Like I mentioned, this may provide a performance benefit, but it also simplifies reasoning about types. Unfortunately, this does not work as well as you would hope because division may change the type, depending on whether the result divides cleanly or not.

Syntax

For many Lispers, this is the elephant in the room. Clojure certainly qualifies as a Lisp, but it is much heavier on syntax than most other Lisps. Let's look at a small contrived example:

(let [foo-value (+ 1 2)
      bar-value (* 3 4)]
  {:foo foo-value
   :bar bar-value})

This is a let just like in Common Lisp or Scheme. The bindings are put inside square brackets, which is literal syntax for vectors. Inside this vector, key-value pairs are interleaved, like in a Common Lisp property list.

The lack of extra sets of "grouping" parentheses is a bit jarring at first, but you get used to it rather quickly. I still mess up occasionally when I accidentally get an odd number of entries in a binding vector. Now, the {:foo foo-value :bar bar-value} syntax is a map, which acts like a hash table (more on that below).

There doesn't seem to be a good rationale about why vectors are used instead of regular lists, though. What I do really like is that all the binding forms (even function signatures!) support destructuring. The syntax for destructuring maps is a bit ugly, but having it available is super convenient.

What I regard as a design mistake is the fact that Clojure allows for optional commas in lists and function calls. Commas are just whitespace to the reader. For example:

(= [1, 2, 3, 4] [1 2 3 4]) => true
(= '(1, 2, 3, 4) '(1 2 3 4)) => true
(= {:foo 1, :bar 2, :qux 3} {:foo 1 :bar 2 :qux 3}) => true
(= (foo 1, 2, 3, 4) (foo 1 2 3 4)) => true
;; A bit silly:
(= [,,,,,,1,,,2,3,4,,,,,,] [1 2 3 4]) => true

Maybe this is to make up for removing the extra grouping parentheses in let, cond and map literal syntax? With commas you can add back some clarity about which items belong together. Rarely anybody uses commas in real code, though. And since it's optional it doesn't make much sense.

This has an annoying ripple effect on quasiquotation. Due to this decision, a different character has to be used for unquote, because the comma was already taken:

`(1 2 ~(+ 1 2)) => (1 2 3)
`(1 2 ~@(list 3 4)) => (1 2 3 4)

This might seem like a small issue, but it is an unnecessary and stupid distraction.

Minimalism

One of the main reasons I enjoy Scheme so much is its goal of minimalism. This is achieved through elegant building blocks. This is embodied by the Prime Clingerism:

  Programming languages should be designed not by piling feature on
  top of feature, but by removing the weaknesses and restrictions
  that make additional features appear necessary.

Let's check the size of the clojure.core library. It clocks in at 640 identifiers (v1.10.1), which is a lot more than R5RS Scheme's 218 identifiers. It's not an entirely fair comparison as Scheme without SRFI-1 or SRFI-43 or an FFI has much less functionality as well. Therefore, I think Clojure's core library is fairly small but not exactly an exercise in minimalism.

Clojure reduces its API size considerably by having a "sequence abstraction". This is similar to Common Lisp's sequences: you can call map, filter or length on any sequence-type object: lists, vectors, strings and even maps (which are treated as key/value pairs). However, it is less hacky than in Common Lisp because for example with map you don't need to specify which kind of sequence you want to get back. I get the impression that in Common Lisp this abstraction is not very prominent or used often but in Clojure everything uses sequences. What I also liked is that sequences can be lazy, which removes the need for special operators as well.

If you compare this to Scheme, you have special-purpose procedures for every concrete type: length, vector-length, string-length etc. And there's no vector-map in the standard, so you need vector-map from SRFI 43. Lazy lists are a separate type with its own set of specialized operators. And so on and so forth. Using concrete types everywhere provides for less abstract and confusing code and the performance characteristics of an algorithm tend to be clearer, but it also leads to a massive growth in library size.

After a while I really started noticing mistakes that make additional features appear necessary: for example, there's a special macro called loop to make tail recursive calls. This uses a keyword recur to call back into the loop. In Scheme, you would do that with a named let where you can choose your own identifier to recur. It's also not possible to nest such Clojure loops, because the identifier is hardcoded. So, this called for adding another feature, which is currently in proposal. Speaking of recur, it is also used for tail recursive self-calls. It relies on the programmer rather than the compiler to mark calls as tail recursive. I find this a bit of a cop-out, especially in a language that is so heavily functional. Especially since this doesn't work for mutually tail-recursive functions. The official way to do those is even more of a crutch.

I find the special syntax for one-off lambdas #(foo %) just as misguided as SRFI 26 (cut and cute). You often end up needing to tweak the code in such a way that you have to transform the lambda to a proper fn. And just like cut, it doesn't save that many characters anyway and makes the code less readable.

The -> macro is a clever hack which allows you to "thread" values through expressions. It implicitly adds the value as the first argument to the first forms, the result of that form as the first argument for the next, etc. Because the core library is quite well-designed, this works 90% of the time. Then the other 10% you need ->> which does the same but adds the implicit argument at the end of the forms. And that's not always enough either, so they decided to add a generic version called as-> which binds the value to a name so you can put it at any place in the forms. These macros also don't compose well. For example, sometimes you need a let in a -> chain to have a temporary binding. That doesn't work because you can't randomly insert forms into let, so you have to split things up again.

And as I note below, the minimalism is kind of "fake" because some essentials simply aren't provided; you have to rely on Java for that.

Java integration

Clojure was originally designed as a "hosted language", so it leverages the JVM. It does this admirably well; Java classes can be seamlessly invoked through Clojure, without any ceremony:

user> (java.util.UUID/randomUUID)
#uuid "bb788bae-5099-4a64-9c37-f6219d40a47f"

;; alternatively:
user> (import 'java.util.UUID)
java.util.UUID
user> (UUID/randomUUID)
#uuid "0bfd2092-14e1-4b88-a465-18698943ea4e"

The downside is that the above is the way to generate a random UUID. So even though uuids have literal syntax in Clojure (as #uuid "..."), there is no Lispy API for them in the Clojure standard library. This can be pretty frustrating, especially in the beginning. There's no clear indication where to look; sometimes you'll be poring over Java language docs for random stuff you thought would have a Clojure interface (like, say, creating temporary files or dealing with byte arrays). At those moments, you're basically programming Java with parentheses.

Having said that, there will often be community-provided nicer APIs for many of those things, but then you need to decide between adding an extra dependency just for a slightly nicer syntax.

Development style

REPL-driven development

Speaking of Java, one thing that constantly bothers me is the slow startup times of the REPL. In my current project, it takes almost 30 seconds to boot up a development REPL. Half a minute!

Luckily, there's great Slime-like Emacs integration with CIDER. Basically, the only sane way to do iterative development is by connecting to a REPL first thing you do and then sending your code to it all the time.

Now, this may sound weird from a Scheme programmer, but I never fully bought into the REPL style of developing. Sure, I experiment all the time in the REPL to try out a new API design or to quickly iterate on some function I'm writing. But my general development style tends more towards the "save and then run the test suite from an xterm". Relying solely on the REPL just "feels" jarring to me. I also constantly run into issues where re-evaluating a buffer doesn't get rid of global state that was built up on a previous run. When this happens, I'm testing an old version of some function without realising it. Keeping track of the "live" state versus the textual code I'm looking at is a total mind fuck for me. I don't understand how others can do this.

Another thing I seem to constantly do is write some code, have the tests go all green, only to see the CI crash on some cyclic dependency in my namespaces. The REPL does not always see those, because reloading a buffer with a namespace declaration works just fine when you loaded the imported namespaces before, even though they refer to the namespace being re-evaluated.

One thing I really find very nice when you're using CIDER is that everything (and I do mean everything) from Clojure is just a "jump to source" away. Most of the builtin functions seems to be written in Clojure itself. For example, if you want to know how map is implemented, you can just press M-. to see it.

Maps and keywords for everything

One thing you'll really notice is that in idiomatic Clojure code, maps are used for everything. A map is a functionally updateable hash table. It looks like this:

{:key-1 "value 1"
 :key-2 "value 2"}

This lends to a very dynamic style of programming, very much like you would in (dare I say it?) PHP. A bit of a strange comparison, but PHP also makes dealing with arrays (which double as maps in a weird way) extremely ergonomic. There, missing nested keys are automatically created on the fly and because of a strange quirk in its developmental history, arrays are the only objects which are passed by value. This means you can program in a referentially transparent way, while still mutating them inside functions at will. Not exactly the same mechanism, but the end effect on programming style feels very similar: you reach for them whenever you want to bunch some stuff together. It is the go-to data structure when you need flexibility.

In other Lisps you'd use alists (or plists, or SRFI-69 hash tables) for this, but they don't deal so well with nested maps and the library is not as convenient. For example, you can easily select, drop and rename keys in a map:

(-> {:key-1 "value 1" :key-2 "value 2"}
    (set/rename-keys {:key-1 :key})
    (dissoc :key-2)
    (assoc :foo "bar")) => {:key "value 1" :foo "bar"}

This -> notation took me a while to get used to by the way, and I'm still not entirely comfortable with it. I explained how it works above. It's a macro for "threading" expressions. In Scheme, you'd probably use a let* for this, or something. In Clojure that would look like this:

(let [map {:key-1 "value 1" :key-2 "value 2"}
      map (set/rename-keys map {:key-1 :key})
      map (dissoc map :key-2)
      map (assoc map :foo "bar")]
  map) => {:key "value 1" :foo "bar"}

As you can see, the version with -> is much more convenient and less repetitive. Unfortunately, it doesn't compose that well (duh, it's a macro), but because of the way the standard library is designed it is more useful than it would seem at first glance.

Anyway, the way maps are typically used everywhere in a project means that there's a lot less "structure" to your data structures. It is extremely convenient to use maps, even though there are also things like records and protocols. Because of their convenience, you'll end up using maps for everything. As I've noticed in my refactorings, when you change the structure of maps, a lot of code is going to break without a clear indication of where it went wrong.

This is made extra painful by "nil punning". For example, when you look up something in a map that doesn't exist, nil is returned. In Clojure, many operations (like first or rest) on nil just return nil instead of raising an error. So, when you think you are looking up something in a map, but the "map" is actually nil, it will not give an error, but it will return nil.

Now like I said, sometimes you may get an error on nil. It's a bit unclear which operations are nil-punning and which will give a proper error. So when you finally get a nil error, you will have a hell of a time trying to trace back where this nil got generated, as that may have been several function calls ago. This is an example where I really like the strictness of Scheme as compared to some other Lisps, as nil-punning is traditionally a dynamic Lisp thing; it's not unique to Clojure.

Multimethods with keywords

Initially, I was quite impressed by the way multimethods work; they're super simple and clean, yet powerful. First, you declare the multimethod and a "decision procedure", which returns a value that can be compared:

(defmulti say-hi :kind)

(defmethod say-hi :default [animal]
  (println (:name animal) "says hello"))

(defmethod say-hi :duck [animal]
  (println (:name animal) "says quack"))

(defmethod say-hi :dog [animal]
  (println (:name animal) "says woof"))

(say-hi {:name "Daffy" :kind :duck})  => "Daffy says quack"
(say-hi {:name "Pluto" :kind :dog})   => "Pluto says woof"
(say-hi {:name "Peter" :kind :human}) => "Peter says hello"

Using multimethods takes some care and taste, because it splits up your logic. So instead of having one place where you have decisions made with an if or cond tree, you have a function call and then depending on how the multimethod was defined, a different function will be called. This is basically what makes C++ so difficult to deal with in large projects: when people use function overloading, it can get really messy. You need to figure out which of the many things called "say-hi" is actually called in a situation, before you can dive into that implementation.

Compared to the insane amount of customizability that e.g. CLOS offers you, the design restraint shown in Clojure multimethods was nice to see, but then I realised this simplicity can be completely defeated by building hierarchies. That is, Clojure allows you to define a hierarchy on keywords. This was a huge wtf for me, because to me, keywords are just static entities that are unrelated to eachother.

When you realise how Clojure keywords can be namespaced, it makes slightly more sense: this gives them some separation.

A keyword can appear in "bare" form like :foo. This is a globally scoped keyword that belongs to no particular code. It's definitely not smart to hang a hierarchy onto such a keyword, and you're also better off not adding any "meta attributes" to them.

The other form is ::foo, which puts the keyword in the current namespace, which is shorthand for ::more-magic.net/foo if you are in the more-magic.net namespace.

Conclusion

All in all, Clojure is a well-designed language with neat features and it's certainly a lot better than most other JVM languages. There are things in it that I wish Scheme had, and it's certainly functional and modern. As a general programming language, I just can't get over the JVM and all its Java trappings, which is just not my cup of tea.

Apart from the JVM, there are some gratuitous departures from traditional Lisps, especially the "rich syntax" and the extreme reliance and overloading of keywords and maps.

As always, such things are a matter of taste, so take my opinion with a large grain of salt.

An appeal to the WHATWG

2018-09-11T20:09:20Z

As you may know, I co-maintain the uri-generic egg, together with Ivan Raikov. We had just been working on fixing a bug and porting it to CHICKEN 5 when I stumbled across the WHATWG URL specification, an evolution over RFC 3986. I found it hard to believe they dropped the formal grammar from the RFC, so I checked the issue queue and found a closed ticket from 2015.

They replaced the BNF with a series of steps which is several pages long and overly concerned with implementation-specific details.

It really got to me that such an important and basic part of the web stack is so informally specified. So I wrote an appeal to them to restore a formal grammar in this ticket. I think the reasons are worth being spread more widely, so I'm reproducing it here on my blog.

My request

I would like to offer my opinion from an implementor's perspective and hopefully convince the WG to restore a formal grammar. Let me start by providing some background on where I'm coming from. Feel free to skip this next section.

My background

I am the co-maintainer of the uri-generic egg for CHICKEN Scheme. This implementation attempts to follow RFC 3986 to the letter, and this has resulted in what IMO is a very high-quality implementation (at least, as far as parsing is concerned; URL construction still has some known issues). Oftentimes when we ran into issues, we've compared it with other implementations. It turns out that many of these are lacking in some way or another. I think the main reason is that they're not attempting to really implement the formal grammar (even if they claim to be RFC compliant), while we do. We even have a growing repository of alternative implementations using different parser generators which all pass the same test suite! (feel free to now call me a smug Lisp/Scheme weenie :) )

I wasn't aware of the WHATWG spec until I saw it mentioned in a libcurl post. It piqued my interest because I'm always looking for more test cases. The web platform test suite looks like a big, juicy set to start using in our egg's tests. I'd also consider implementing the WHATWG spec if this increases compatibility with other implementations.

What I expect from a spec

As an implementor, I routinely check the RFC's ABNF as a guide to determine what a valid URL should look like. If someone finds a certain URL our implementation doesn't parse, or if it parses an URL that it shouldn't, the first thing I do is go back to the ABNF in the RFC to verify the behaviour. It is compact, to the point and, for a trained eye, it is trivial to quickly determine if a parser should accept a given (sub)string or not.

The collected ABNF of RFC 3986 is a brief three screenful. In contrast, the algorithm in the WHATWG spec is roughly eighteen screenful. It is an overly detailed and nonstandard way of defining a grammar. This makes it harder to determine which language is accepted by this algorithm. It also makes it hard for me to determine what the changes are, compared to the RFC. Implementing the WHATWG spec would (for me) involve a complete rewrite.

The specification is so focused on the mechanics of a specific manual parsing technique that it almost precludes parser generators or other implementations. Parser generators have a long tradition in theory and practice, and can generate efficient language recognisers. Even today, it is an active research field; PEG grammars for example have been "discovered" as recently as 2004.

The way I think about it is that the purpose of this spec is to define what a URL "officially" looks like. So, as an implementor, I don't understand the hesitation to supply a formal grammar. Not having one will likely result in different people interpreting the spec differently. This results in _less_ interoperability, which defeats the point of a spec.

Other reasons why I think a formal grammar is important

Finally, I would like to emphasise the importance of parsers based on formal grammars over ad hoc ones for security reasons. Let's say you have a pipeline of multiple processors which use different URL parsers. For example, you might have a HTML parser on a comment form which cleans URLs by dropping JavaScript and data URLs, among other things, or a mail client which blocks intranet or file system-local URLs before invoking an HTML viewer. If these are all ad hoc "informal" parsers that try to "fix" syntactically invalid URLs, it is nigh-impossible to verify that filtering them for "safe" URLs is correct. That's because it's impossible to decide which language is really accepted by an ad hoc implementation. An implementation further down the stack might interpret an URL (radically) different from one up the stack and you have a nice little exploit in the making.

If you're not convinced by my measly attempts at explaining this idea, please watch the talk "The Science of Insecurity". Meredith Patterson states the case much more eloquently than I ever could. This talk was an absolute eye-opener for me.

With this context, it baffled me to read the statement that "there are several large parts of the spec that cannot be captured by any kind of grammar". This is literally equivalent to saying "we can't know if an URL will be valid without evaluating the algorithm". This means you cheerfully drag the halting problem into what should be a simple, straightforward notation (come on, URLs aren't that ill-defined!). As far as I can tell, the RFC defines a regular grammar. The decision to go from a regular to an unrestricted grammar should not be taken lightly!

What to expect from CHICKEN 5

2018-08-11T09:44:27Z

We're getting close to a CHICKEN 5 release, so let's take a look at the cool new stuff!

Overhaul of built-in modules

The biggest change you'll notice when you fire up CHICKEN and start to use it is that the modules that come shipped with core are completely different from CHICKEN 4. The functionality is mostly the same, but we moved things around (a lot!) to make things more logical.

This is also the main reason we decided to bump the major version number: the modules have different names, procedures have been renamed, merged or dropped.

You can take a look at the complete list in the CHICKEN 5 manual. We've taken the module layout from R7RS small as inspiration, but since CHICKEN is still an R5RS Scheme first (with r7rs being an optional extension) we had to make some changes.

So, we define a scheme module which contains the entire R5RS language. For everything that is a CHICKEN-specific extension to standard R5RS Scheme, we put it under a (chicken ...) name, which tries to follow the R7RS naming conventions.

For example, R7RS defines a (scheme process-context) module with the following procedures:

command-line
exit
emergency-exit
get-environment-variable
get-environment-variables

Likewise, CHICKEN defines a (chicken process-context) module, which is a superset of the corresponding R7RS module. Take a look at its manual page; you can see that it defines many more procedures, but it includes all the standard ones too.

By using the R7RS names but with scheme replaced by chicken, the new modules should be easy to remember for anyone used to R7RS. Of course, you can still write portable standard R7RS programs via the r7rs egg, which defines a 100% compatible (scheme process-context) module with only the R7RS identifiers.

There is one important caveat: Because our scheme modules exports everything from R5RS Scheme, we don't provide, say, a (chicken cxr) module for all the cadadr, caddar and so on, because those are all in scheme. This also means that the (chicken load) module does not export load; that's already in scheme. Instead, it defines various non-standard CHICKEN extensions like load-relative and such.

Saner module imports

Speaking of modules, we've improved the way modules are linked into user code. In CHICKEN 4, there's a very strict distinction between modules and (compilation) units. This was an endless source of confusion for beginners. For example, why did (import foo) give an error when you tried to actually refer to an identifier from the foo module? That's because import didn't actually load the code, just the import library. To actually load the code and import the library, you needed (use foo). You could also load the code without importing it via (require-library foo). This should help with cross-compilation. The idea was that you would only need to load the import library on the host, and have the library itself compiled for the target, but in practice you needed to compile the library twice anyway (once on the host, once for the target).

We got rid of this mess: now the canonical way to import the foo library is simply (import foo). For more info, see this post by Felix outlining how to improve imports.

Full numeric tower

Of course, support for the full numeric tower is a personal favorite of mine, having spent a lot of time to perfect this stuff!

Most importantly, this means you no longer need to worry about integer computations over- or underflowing into a flonum and all the weird floating-point problems that entails. Bignums are also a necessity when dealing with 64-bit numeric C types in the FFI. For example, we finally support the size_t type correctly. To me, complex numbers and exact fractions (aka rational numbers) are a nice added bonus, as you could already get them before with the numbers egg. However, by having these types built-in, they're more efficient and you don't have to worry about passing these numbers to code that can't handle them because support happened not to be compiled in.

Take some time to read my blog series about the numeric tower if you're interested in the details.

Declarative egg description language

The chicken-install program to install eggs was rewritten along with all the surrounding tools. The main reason to do this was to make the life of package maintainers easier.

The old version of chicken-install would download, build, install and (optionally) run the unit tests as part of one command. If any dependencies were missing, it would also recursively download, build, install and run tests for those as well. The new version cleanly separates these steps, by generating shell scripts (batch files on Windows) that can do the necessary actions to build and install.

To make this easier, we also had to re-think the egg "language". In CHICKEN 4, a .setup-file was simply a Scheme program in which a few helper procedures were available for calling the compiler. This means it's impossible to create a simple shell script that will separate the build and install steps. That's why we now have a separate, declarative file which describes the components of an egg. See the .egg file documentation for a concrete example.

The rewritten chicken-install will now also cache eggs to avoid re-downloading the same eggs again and again. By default the cache is stored in a dot-directory under the user's home directory. This can be overridden with the CHICKEN_EGG_CACHE environment variable, which might also help package maintainers take the distributed files from another location.

See these design notes for more information about the goals and motivations behind the rewrite.

Improved support for static compilation

In principle, CHICKEN 4 has good support for static compilation. In practice, egg authors would not include the necessary commands for building their libraries statically. Most people don't have a real need for static linking, which means they tend not to make an effort to support it just in case someone else might need it.

The upshot of this was that you could only really compile programs statically when they didn't use any eggs, or if you created a custom build script that would compile the eggs manually with the required -static option. With the new chicken-install, you get static compilation support automatically, for free.

Note that in CHICKEN 4, you could also build eggs and programs using the so-called deployment mode. This allowed shipping a program with all its libraries in one directory. This worked quite well if your target platform supported it, but not all platforms did. Static compilation covers all the use cases that deployment supported and works reliably on all platforms, so we decided to drop deployment mode with all the complexity it brings.

Other noteworthy things

But wait, there's more!

Code generation is now fully deterministic, making builds reproducible. This allows you to verify that any given file of generated C code corresponds to the Scheme source code by recompiling it with the same CHICKEN version, both for user code and for CHICKEN core itself. As an added bonus, because the generated C output is deterministic, ccache can be used to get much faster builds (before, it would invalidate the cache as each file would be different).
We've improved how symbols are garbage collected, which was optional and somewhat broken in CHICKEN 4. This will speed up code that generates many symbols, and stops symbol table stuffing attacks from being a threat.
We have removed quite a bit of bloat: The srfi-1, srfi-13, srfi-14, srfi-69 and srfi-18 libraries have been removed from core! Not to worry though; they are now available as eggs. This will both allow faster development and encourage innovation and competition from alternatives to these non-essential libraries (especially R7RS-large seems to be geared towards renewal of some of these). We've also moved several non-SRFI procedures from core: object-evict, compile-file, binary-search, procedures for dealing with queues, scan-input-lines and POSIX group-information have all been moved to eggs. Support for SWIG has been removed, as it was bit-rotting and nobody seemed to be using it anyway.
Ports can now be bi-directional, so there's no more unnecessary distinction between input-ports and output ports. This maps more cleanly to file descriptor semantics, which can also be opened for both reading and writing.
Random number generation has been completely replaced. Before, we used libc's rand(), which produces very low quality random numbers. CHICKEN 5 uses the WELL512 PRNG to generate random integers, and it provides access to the system entropy pool for generating cryptographically secure streams of random bytes (using /dev/urandom on *nix, and RtlGenRandom on Windows).

Conclusion

There's a lot to like about the new CHICKEN, so go ahead and give it a spin! Release candidate 1 was made available today for you to try. The full list of changes can of course be found in the NEWS file. If you're already a happy CHICKEN 4 user, we've created a porting guide for you, to make it easier to make the transition from 4 to 5. If you need more help, you can of course contact the always friendly CHICKEN community.

CHICKEN's numeric tower: part 5

2016-10-20T18:01:20Z

Now that we have covered the most important algorithms, it's time to take a look at the internal data representation of extended numerals. This will be the final part in this series of posts.

Ratnums and cplxnums

Recall from my data representations article that fixnums are represented as immediate values, directly within a machine word. Flonums are represented in boxed form, by putting them in an opaque bytevector-like structure.

The data representations of complex numbers and rational numbers are pretty simple. Each have their own type tag, and they both contain two slots: the numerator and denominator in case of rational numbers, and the real and imaginary parts in case of complex numbers.

As you can see in the above diagram, the representations of ratnums and cplxnums are very similar. In the example, the slots contain just fixnums. Rational numbers are the simplest here: they can only contain integers (bignums or fixnums). Complex numbers can consist of any number type except other complex numbers, but the exactness of the real and imaginary components must match. This means you can't have 1.5+2/3i, for example.

In its most complex (haha!) form, a complex number contains a rational number in both the real and the imaginary parts, and these rational numbers both contain bignums as their numerator and denominator. In this situation, the entire complex number takes up a whopping minimum of 29 machine words: 3 words for the wrapper complex number, 2 times 3 words for each of the ratnums, and 4 times 5 words for the bignums inside the ratnums.

We'll now look into why bignums require at least 5 words.

Bignums

Initially I tried to represent bignums as a kind of opaque bytevector, much like how flonums are stored. Memory-wise this is the best representation as it has no unnecessary overhead: only 2 extra words; one for the header and one for the sign. On a 32-bit machine it would look like this:

This representation is somewhat wasteful, because it uses a full machine word to represent the sign, which is only one bit of information! This is done to keep the bignum digits word-aligned, which is important for performance. The sign could be shoved into the header if we really wanted to be frugal on memory, but doing so would also complicate type detection. Alternatively, we could store the bignum's digits in 2s complement form so the sign is simply the high bit of the top digit, but that complicates several algorithms.

Regarding the "bytevector" part: because the limbs are word-aligned, it makes more sense to represent the size in words rather than bytes. Unfortunately, there's no way to do this with the current data representation of CHICKEN. This was the direct cause of the following bug: Someone tried to represent the largest known prime number in CHICKEN, and it failed to behave correctly because we didn't have enough header bits to represent its size. This was just for fun, so no harm was done, but when someone will actually need such numbers in practice, they're out of luck. One of these days we're going to have to tackle this problem...

Performance takes a hit

When I first integrated the "numbers" egg into CHICKEN 5, I also did some benchmarking. It turned out that my initial version made some benchmarks up to 8 times slower, though on average it would slow things down by a factor of 2. As pointed out by Alex Shinn and Felix Winkelmann, the reason it impacts some benchmarks so badly has to do with allocation.

Let's compile a loop going from zero to n, like so:

;; Very silly code, calculates 100 * 100 in a stupid way
(let lp ((i 0))
  (if (= i 100)
      (* i i)
      (lp (add1 i))))

Originally, in CHICKEN 4 without the full numeric tower, the compiled C code looked like this:

/* lp in k207 in k204 in k201 */
static void C_fcall f_220(C_word t0,C_word t1,C_word t2){
  C_word tmp;
  C_word t3;
  C_word t4;
  C_word t5;
  C_word t6;
  C_word *a;
loop:
  C_check_for_interrupt;
  if(!C_demand(C_calculate_demand(4, 0, 2))) {
    C_save_and_reclaim_args((void *)trf_220, 3, t0, t1, t2);
  }
  a=C_alloc(4); /* Allocate flonum for overflow situation */
  if(C_truep(C_i_nequalp(t2, C_fix(100)))) {
    t3=t1;
    {
      C_word av2[2];
      av2[0] = t3;
      av2[1] = C_a_i_times(&a, 2, t2, t2);
      ((C_proc)(void*)(*((C_word*)t3+1)))(2, av2);
    }
  } else {
    t3 = C_a_i_plus(&a, 2, t2, C_fix(1));
    C_trace("test.scm:4: lp");
    t5=t1;
    t6=t3;
    t1=t5;
    t2=t6;
    goto loop;
  }
}

It's not much to look at, but this is very close to optimal code: It's a C loop, which allocates a fixed size of memory from the stack/nursery into which it can write the result of + or *, in case they would overflow.

The compiler knows how it can compile + and * to "inlineable" C functions. Many of the most performance-critical functions are built into the compiler like that. But because the compiler (currently) doesn't perform range analysis, it's not smart enough to figure out that none of these operators in this example can cause an overflow. This bites us especially hard when introducing bignums: because we need to assume that any operator may overflow, we must be able to allocate a bignum. And assuming the result of these operators may be bignums, the next iteration of the loop is a bignum. Adding two bignums of unknown sizes together results in another bignum of unknown size.

Because of the above, we can't pre-allocate in a tight C loop. Instead, we must split our loop in two. This is needed to allow the garbage collector to kick in: if you'll recall from the garbage collector post, we need a continuation both for liveness analysis and as a place to jump back to after GC.

One part of our lp calls an allocating procedure, wrapping up the other part in a continuation:

/* k251 in lp in k223 in k220 in k217 in k214 (first part of our "lp") */
static void C_ccall f_253(C_word c, C_word *av) {
  C_word tmp;
  C_word t0 = av[0];
  C_word t1 = av[1];
  C_word t2;
  C_word *a;
  C_check_for_interrupt;
  if(!C_demand(C_calculate_demand(0, c, 2))) {
    C_save_and_reclaim((void *)f_253, 2, av);
  }
  C_trace("test.scm:6: lp");
  t2 = ((C_word*)((C_word*)t0)[2])[1];
  f_236(t2, ((C_word*)t0)[3], t1);
}

/* lp in k223 in k220 in k217 in k214 (second part of our "lp") */
static void C_fcall f_236(C_word t0, C_word t1, C_word t2){
  C_word tmp;
  C_word t3;
  C_word *a;
  C_check_for_interrupt;
  if(!C_demand(C_calculate_demand(4, 0, 3))) {
    C_save_and_reclaim_args((void *)trf_236, 3, t0, t1, t2);
  }
  a=C_alloc(4);
  if(C_truep(C_i_nequalp(t2, C_fix(100)))) {
    C_trace("test.scm:5: *");
    {
      C_word av2[4];
      av2[0] = C_SCHEME_UNDEFINED;
      av2[1] = t1;
      av2[2] = t2;
      av2[3] = t2;
      C_2_basic_times(4,av2);
    }
  } else {
    /* Allocate continuation of (add1 i), which is the first part (f_253) */
    t3=(*a = C_CLOSURE_TYPE|3, a[1] = (C_word)f_253,
        a[2] = ((C_word*)t0)[2], a[3] = t1, tmp = (C_word)a, a += 4, tmp);
    C_trace("test.scm:6: add1");
    {
      C_word av2[4];
      av2[0] = C_SCHEME_UNDEFINED;
      av2[1] = t3;
      av2[2] = t2;
      av2[3] = C_fix(1);
      C_2_basic_plus(4,av2);
    }
  }
}

As you can imagine, allocating a continuation on the stack every time is pretty heavy, and function calling isn't as cheap as a goto loop either. The first part of the loop doesn't even do anything. It just acts as a continuation to be received by the plus call. You can probably imagine how terrible the code would look if we compiled something like (/ (* (+ a b) (+ c d)) 2). That's at least 4 continuations, instead of a few simple statements.

For this reason, my patch was rejected (and rightly so!). The message was clear: code that doesn't use bignums should never pay a performance penalty just because bignums exist.

In order to fix this situation, I had to come up with a radical change to how bignums worked, or face the possibility that a full numeric tower would not make it into CHICKEN 5.

Adding a new "scratch space" memory region

If we want to make the extended numeric operators as fast as the originals, we must be able to inline them. This prevents garbage collection, because we don't get the continuation for an inlined call. But what if they allocate some unknown quantity of memory? We can't allocate on the stack or heap, because that could cause either to fill up, requiring a GC.

So, the obvious solution is to allocate these objects elsewhere. A separate memory space in which bignums can be stored. But what if that space fills up? Don't we need to initiate a GC then? But this is where we're in luck: bignums are not compound objects! They are huge slabs of opaque data, much like strings. Because they can't refer to other objects, we are dealing with a simplified garbage collection problem: only the objects pointing to a bignum need to be updated.

Unfortunately, finding all the live objects that point to a bignum would be quite difficult. Luckily, like many problems in computer science, this can be easily solved by adding another level of indirection. While we're calling inline functions, we can allocate small objects on the stack, which will remain there, never moving until the next GC. We can use this to our advantage: whenever a bignum is needed, we allocate a fixed-size wrapper object on the stack. This object points into the scratch space, where the actual bignum data lives. See the following diagram:

In the diagram, we have a bignum representing the number 24386824307922, which we've put in a list and a vector, and we also have the rational number 1/24386824307922, which refers to the same bignum in its denominator. All these objects can be on the stack or on the heap. We have no control over them; the user can set any object slot to hold the bignum. We do have control over the wrapper object, and only the wrapper object directly points into scratch space. Because bignums are opaque objects in Scheme, the wrapper is invisible. Thus, user code is (in principle) unable to access the wrapper's data slot, so there will be no direct references to the bignum data portion. This means we're free to move it around without updating anything but the wrapper object's slot.

Note that in the scratch space, we also store a back-pointer to the wrapper object's slot. This allows us to update the wrapper object after moving its matching bignum data blob around. This way, we can reallocate the scratch space when more space is needed.

Some of the native functions like Karatsuba multiplication or Burnikel-Ziegler division generate many temporary values. All such hand-written code has been tuned to erase a bignum's back-pointer when that bignum is no longer needed. It makes the code quite a bit hairier, but it allows (limited) garbage collection to be done when reallocating the scratch space.

With this setup, all numeric operations only need to allocate memory to hold a bignum wrapper object. This is a fixed size, much like in CHICKEN 4, and it means numeric operations can once again be inlined!

Oh, and why a bignum takes up 5 words? Well, sometimes we know that a procedure receives 2 fixnums. In that case, we can pre-allocate a bignum for the case when it overflows. Because we know in its maximum size in advance, there's no need for the scratch space; we can just allocate it in the nursery. For uniformity reasons, such a bignum still requires a wrapper object (2 words) and a bignum data blob (3 words: its header, the sign and one limb). This sounds complicated, but it shortens the specialised code for two fixnums, and allocating only in the nursery is also faster.

Some parting thoughts

Adding full numeric tower support has been extremely educational for me. I'm not really a math person, but having a clear goal like this motivated me to dive in deep into the literature. Overall, I'm happy with how it turned out, but there are always improvements.

For example, instead of doing everything in C it would (of course!) be preferable to do it all in Scheme. Unfortunately, CHICKEN's design makes it hard to do this in an efficient way: it's currently impossible to export Scheme procedures that can be called inline (i.e., non-CPS calls) without copying their full code into the call site. If we can find a way, it would be possible to do 90% of the algorithms in Scheme. The principle on which this would be based can be found in an old paper about the Lucid Common Lisp implementation. Basically, you implement a handful of primitives natively, and everything else can be done in Lisp. For example, SBCL is implemented this way too.

As far as I can tell, of the more popular Scheme implementations, Gambit is the only one that actually does this. I've been very impressed with Gambit in general. Besides having pretty readable Scheme code for bignum algorithms, Gambit has some superior bignum algorithms, most get close to (and in rare cases even surpass) GMP performance. This is mostly due to the hard work of Bradley Lucier, a numerical analyst who has also provided useful feedback on some of my work on the numbers egg, and this series of blog posts. He really knows this stuff! Most other Scheme implementations are in C and still pretty slow due to the algorithms they use, unless of course they use GMP.

In CHICKEN, there is a lot of room for optimisations. But I also think we shouldn't try to implement every algorithm under the sun. Things should generally be fast enough to serve the most common cases. Typical code doesn't use bignums, and if it does it's only small bignums (for instance, when using 64-bit integers with the FFI), which is why I think we should optimise for these cases. For example, my implementations of Karatsuba and Burnikel-Ziegler aren't great, so if anyone feels like having a stab at improving these things we already have (or simply replacing them with a better algorithm), please do!

References

Jon L White, Reconfigurable, Retargetable Bignums: A Case Study in Efficient, Portable Lisp System Building (or via Sci-Hub). This is a wonderful 1986(!) paper how to elegantly have bignum algorithms in pure Lisp with a minimal amount of native code to get good performance.
After writing the scratch space stuff, I sent a mail to chicken-hackers explaining how it works.
There's a Larceny note that mentions how numbers are represented in Larceny, and a note that briefly mentions that it also uses the "retargetable" bignum approach.
Gambit's source code is extensively commented on representation and algorithms used.

CHICKEN's numeric tower: part 4

2016-10-18T17:42:03Z

In this instalment of the blog series, we'll take a look at how string->number and number->string are implemented for bignums.

Performing base conversions

Performing calculations is all nice and useful, but you eventually want to print the results back to the user. And of course the user needs a way to enter such large numbers into the system. So, converting between numbers and strings is an essential operation. The Scheme standard requires support for conversion between bases 2, 8, 10 and 16, but many practical implementations support conversion between arbitrary bases, and why not? It doesn't really require more effort.

Converting strings to numbers

Let's start with the simpler of the two directions: string->number. The naive way of converting a string in base n to a number is to scan the string from left to right (high to low), adding the digit to the result and multiplying the result by n:

/* "result" is a pre-allocated, zeroed out bignum of the right size */
while (*str != '0') /* Assuming NUL-terminated string */
{
  int digit = hex_char_to_digit((int)*str++);
  bignum_destructive_scale_up(result, radix);
  bignum_destructive_add(result, digit);
}

This is very simple and elegant, but also quite slow. The Scheme48 code also checks for invalid characters in this loop, while in CHICKEN, the reader performs this check. So, when converting a digit stream to a bignum, we already know the digit stream contains only valid digits.

A simple improvement can be made to this algorithm: we can avoid traversing the entire bignum for every string digit. Instead, we collect multiple digits in a register until we fill up a halfword. Then, we scale up the bignum, adding the collected halfword:

do {
  C_word big_digit = 0;  /* Collected digit */
  C_word factor = radix; /* Multiplication factor for bignum */

  /* Keep collecting from str while factor fits a half-digit */
  while (str < str_end && C_fitsinbignumhalfdigitp(factor)) {
    str_digit = hex_char_to_digit((int)*str++);
    factor *= radix;
    big_digit = radix * big_digit + str_digit;
  }

  /* Scaling up with carry avoids traversing bignum twice */
  big_digit = bignum_digits_destructive_scale_up_with_carry(
                digits, last_digit, factor / radix, big_digit);

  if (big_digit) { /* If there was a carry, increment size of bignum */
    (*last_digit++) = big_digit;
  }
} while (str < str_end);

Remember the previous post in this series? Now you know where the design behind bignum_digits_destructive_scale_up_with_carry comes from: we can scale up the bignum with an initial "carry" value that's our collected digit. The return value is the resulting carry (if any), so we know when to increase the bignum's size by moving the last_digit pointer. This pointer makes it easier to detect the final length of the bignum, which can be hard to predict precisely. We can't predict the exact size, but we can calculate the maximum size. Because we allocate this maximum size, this moving of the end pointer is safe.

If the string's base is a power of two, we can perform an even better optimisation: We don't need to multiply or add to the bignum, we can just write straight to the bignum's digits!

int radix_shift = C_ilen(radix) - 1;  /* Integer length (the power of two) */
C_uword big_digit = 0;       /* The current bignum digit being constructed */
int n = 0;                /* The number of bits read so far into big_digit */

/* Read from least to most significant digit.  This is much easier! */
while (str_end > str_start) {
  str_digit = hex_char_to_digit((int)*--str_end);

  big_digit |= (C_uword)str_digit << n;
  n += radix_shift;                             /* Processed n bits so far */

  if (n >= C_BIGNUM_DIGIT_LENGTH) {             /* Filled up the digit? */
    n -= C_BIGNUM_DIGIT_LENGTH;                 /* Number of remainder bits */
    *digits++ = big_digit;
    big_digit = str_digit >> (radix_shift - n); /* Keep only the remainder */
  }
}
/* If radix isn't an exact divisor of digit length, write final remainder */
if (n > 0) *digits++ = big_digit;

From my benchmarks it looks like CHICKEN's string->number implementation is among the fastest of the popular Scheme implementations for power of two bases, due to this bit banging loop.

Converting numbers to strings

The naive way of converting a number to a string in base n is to do the opposite of converting a string in base n to a number: we repeatedly divide by the target base and prepend this number to the string.

char *characters = "0123456789abcdef";

/* This fills the "buf" array *back to front*, so index starts at the
 * end of "buf".  counter represents # of characters written.
 */
do {
  digit = bignum_destructive_scale_down(working_copy, radix);
  *index-- = characters[digit];

  /* If we reached the current string's length, reallocate in
   * increments of BIGNUM_STR_BLOCK_SIZE.
   */
  if (++counter == len) {
   char *newbuf = C_malloc(len + BIGNUM_STR_BLOCK_SIZE);
   if (newbuf == NULL) return ERR;

   C_memcpy(newbuf + BIGNUM_STR_BLOCK_SIZE, buf, len);
   C_free(buf);
   buf = newbuf;
   index = newbuf + BIGNUM_STR_BLOCK_SIZE - 1;
   len += BIGNUM_STR_BLOCK_SIZE;
} while(bignum_length(working_copy) > 0);

This is the original version as provided by Scheme48. Again, it's very short and clean. It operates on a copy of the bignum, which it destructively scales down by dividing it by radix. The remainder digit is written to the string after conversion to a character. Many implementations use this algorithm, but it can be improved pretty easily, in basically the same way we improved the reverse operation. Instead of dividing by the radix on ever loop iteration, you can chop off a big lump. This is the remainder of dividing by a large number. Then, you divide this remainder in a loop while emitting string digits until you hit zero, then repeat until the bignum is zero.

Another improvement over the Scheme48 code is that you can pre-calculate a (pessimistic) upper bound on the number of digits, so you can avoid the reallocation (which implies a memory copy). For powers of two, this can be done precisely. For other radixes you can shorten the buffer only once at the end, and only if it turns out to be necessary.

This is really very simple, except for the finishing up part, where we shorten the buffer:

int steps;
C_uword base;         /* This is the "lump" we cut off every time (divisor) */
C_uword *scan = start + C_bignum_size(bignum); /* Start scanning at the end */

/* Calculate the largest power of radix (string base) that fits a halfdigit.
 * If radix is 10, steps = log10(2^halfdigit_bits), base = 10^steps
 */
for(steps = 0, base = radix; C_fitsinbignumhalfdigitp(base); base *= radix)
  steps++;

base /= radix; /* Back down: we always overshoot in the loop by one step */

while (scan > start) {
  /* Divide by base. This chops "steps" string digits off of the bignum */
  big_digit = bignum_digits_destructive_scale_down(start, scan, base);

  if (*(scan-1) == 0) scan--; /* Adjust if we exhausted the highest digit */

  for(i = 0; i < steps && index >= buf; ++i) {      /* Emit string digits */
    C_uword tmp = big_digit / radix;
    *index-- = characters[big_digit - (tmp*radix)]; /* big_digit % radix */
    big_digit = tmp;
  }
}

/* Move index onto first nonzero digit.  We're writing a bignum
   here: it can't consist of only zeroes. */
while(*++index == '0');

if (negp) *--index = '-';

/* Shorten with distance between start and index. */
if (buf != index) {
  i = C_header_size(string) - (index - buf);
  C_memmove(buf, index, i); /* Move start of number to beginning. */
  C_block_header(string) = C_STRING_TYPE | i; /* Mutate strlength. */
}

Finally, if the radix is a power of two, we can do a straight bit-to-bit extraction like we did with string->number:

int radix_shift = C_ilen(radix) - 1;    /* Integer length (the power of two) */
int radix_mask = radix - 1;              /* Bitmask of N-1 ones (radix = 2ᴺ) */

/* Again, we go from least significant to most significant digit */
while (scan < end) {
  C_uword big_digit = *scan++;
  int big_digit_len = C_BIGNUM_DIGIT_LENGTH;

  while(big_digit_len > 0 && index >= buf) {
    int radix_digit = big_digit & radix_mask;    /* Extract one string digit */
    *index-- = characters[radix_digit]; /* Write it (as character) to string */
    big_digit >>= radix_shift;                            /* Drop this digit */
    big_digit_len -= radix_shift;
  }
}

Unfortunately, there are some caveats that make this slightly trickier than you would guess. The above code is simplified, and only works for some radixes. If your radix is 2ⁿ, and your base digit size is 2ᵐ bits, then this code works if m is a multiple of n. Otherwise, you'll need to take care of overlaps. Octal numbers are the biggest problem here, because they're 3 bits per string digit and the bignum digit sizes of 32 or 64 bits don't divide cleanly by 3.

Getting this right complicates the algorithm enough to make it slightly too hairy to present here (there's some more shifting and if checks involved). If you're interested how to handle this, you can always study the CHICKEN sources.

Divide and conquer

Because of the many divisions, number->string is much slower than string->number. Luckily, we can speed up the former by relying once again on a recursive divide and conquer style algorithm.

This requires you to know the string's expected size in the target base, and will divide the number by half that. For example, if you wish to convert the number 12345678 to a decimal string, you can decide to split it in two. If you had a perfect guess of the string's length (which is 8), you can split the expected string in two halves by dividing the number by 10⁴, or 10000, giving us the quotient 1234 and the remainder 5678. These can recursively be converted to a string and finally appended together. Note that if you have a pessimistic upper limit of the final string length, it'll be slower, but will still produce a correct result. The code for this is quite straightforward:

(define (integer->string/recursive n base expected-string-size)
  (let*-values (((halfsize) (fxshr (fx+ expected-string-size 1) 1))
                ((b^M/2) (integer-power base halfsize))
                ((hi lo) (quotient&remainder n b^M/2))
                ((strhi) (number->string hi base))
                ((strlo) (number->string (abs lo) base)))
    (string-append strhi
                   ;; Fix up any leading zeroes that were stripped from strlo
                   (make-string (fx- halfsize (string-length strlo)) #\0)
                   strlo)))

Because the number 120034, when split in two, generates 120 and 34, we need the make-string call to add back the leading zeroes, otherwise we would get 12034 as the output. This can be omitted if you have a more low-level number->string implementation which doesn't truncate leading zeroes.

While I was researching this, I found out about a technique called the "scaled remainder tree" algorithm. This algorithm is supposedly even faster than the simple recursive algorithm I just showed you. Unfortunately, I was unable to wrap my head around it. Maybe you will have better luck!

Reading list

There doesn't seem to be that much information on how to efficiently perform base conversion, even though there are quite a few clever implementation techniques out there. If you spend the time searching, you'll be sure to find some gems.

The y-cruncher page on radix conversion is full of ideas. Y-cruncher is a program to calculate digits of pi by efficiently using multiple CPU cores. This page has several algorithms, including recursive string splitting and scaled remainder tree. Unfortunately, the program itself is proprietary so you can't study its implementations of these algorithms.
Modern Computer Arithmetic, which has quickly become my favourite bignum textbook, has an interesting alternative string to number technique based on iterative multiplication of the input string. I still want to try that out some day. It also hints at scaled remainder tree for number to string conversion, but doesn't explain how it works.
Division-Free Binary-to-Decimal Conversion by Cyril Bouvier and Paul Zimmermann explains a number to string conversion based on the scaled remainder tree. Unfortunately, this paper only confused me.
It looks like Gambit has a recursive divide-and-conquer implementation of string->number that's slightly different from ours. Their recursive number->string implementation is rather interesting too.

CHICKEN's numeric tower: part 3

2016-10-15T18:08:42Z

Now that you understand the basic bignum algorithms, let's look at various tricks to speed up these operations.

Faster multiplication

Like I mentioned in the previous part of this series, the primary school method for addition and subtraction is the fastest known, for bignums of any size. And you can't really get better than O(n). However, primary school multiplication is O(n²), and there are several better algorithms than that.

Multiplication by a fixnum

As you now know, multiplication is done by looping over the half-digits of the two bignum arguments in a nested loop. Nested loops are something to be avoided as much as possible, because this means you're looking at a quadratic time algorithm, in other words it will perform O(n²) operations.

When multiplying a bignum by a fixnum, the naive implementation is to "promote" the fixnum into a bignum, and then perform a standard bignum multiplication. However, if the fixnum fits in a half-digit, you can avoid the nested loop. Complexity-wise, this isn't a great improvement, as the outer loop is only run once anyway. But, because the small value fits in a machine word, you only read from one bignum address instead of two on every iteration of the inner loop. Now that makes a big difference! In Scheme48, this improved algorithm looks like this:

static void
bignum_destructive_scale_up(bignum_type bignum, bignum_digit_type factor)
{
  bignum_digit_type carry = 0;
  bignum_digit_type * scan = (BIGNUM_START_PTR (bignum));
  bignum_digit_type two_digits;
  bignum_digit_type product_low;
#define product_high carry
  bignum_digit_type * end = (scan + (BIGNUM_LENGTH (bignum)));
  BIGNUM_ASSERT ((factor > 1) && (factor < BIGNUM_RADIX_ROOT));
  while (scan < end)
    {
      two_digits = (*scan);
      product_low = ((factor * (HD_LOW (two_digits))) + (HD_LOW (carry)));
      product_high =
        ((factor * (HD_HIGH (two_digits))) +
         (HD_HIGH (product_low)) +
         (HD_HIGH (carry)));
      (*scan++) = (HD_CONS ((HD_LOW (product_high)), (HD_LOW (product_low))));
      carry = (HD_HIGH (product_high));
    }
  /* A carry here would be an overflow, i.e. it would not fit.
     Hopefully the callers allocate enough space that this will
     never happen.
   */
  BIGNUM_ASSERT (carry == 0);
  return;
#undef product_high
}

The current version of this algorithm in CHICKEN is a bit shorter and slightly more versatile because it returns the carry if the result doesn't fit:

static C_uword bignum_digits_destructive_scale_up_with_carry(
    C_uword *start, C_uword *end, C_uword factor, C_uword carry)
{
  C_uword digit, p;

  assert(C_fitsinbignumhalfdigitp(carry));
  assert(C_fitsinbignumhalfdigitp(factor));

  while (start < end) {
    digit = (*start);

    p = factor * C_BIGNUM_DIGIT_LO_HALF(digit) + carry;
    carry = C_BIGNUM_DIGIT_LO_HALF(p);

    p = factor * C_BIGNUM_DIGIT_HI_HALF(digit) + C_BIGNUM_DIGIT_HI_HALF(p);
    (*start++) = C_BIGNUM_DIGIT_COMBINE(C_BIGNUM_DIGIT_LO_HALF(p), carry);
    carry = C_BIGNUM_DIGIT_HI_HALF(p);
  }
  return carry;
}

This is a destructive operation, which means it doesn't operate on an "empty" target bignum. Instead, you copy the original bignum, which is then mutated in-place. That makes it faster because you're only reading and writing to a single digit array, so it's much more localised in memory. In the next post, we'll see why returning the carry can be helpful.

Strictly speaking, you could do the same for addition and subtraction: if one of the arguments is a bignum and the other a fixnum, you could destructively add or subtract it. In fact, Scheme48 does this, but the CHICKEN implementation does not. If you'll recall, the CHICKEN bignum implementation already copies the larger of the two numbers into a new bignum and modifies it in-place while adding the smaller number. The effect of this is almost the same as Scheme48 does, while also improving the default case of adding two bignums.

There's a second optimisation that can be done, which Scheme48 does not do at all. If the factor by which we're multiplying is a power of two, you can simply shift the result by the 2log of the factor! The code for detecting this is pretty ugly, so I'm not going to show it. Just remember that this makes a huge difference in practice.

Karatsuba's multiplication

Multiplying two bignums can be done faster, too. There are two important algorithms: Karatsuba multiplication and FFT-based multiplications like Schönhage-Strassen. I must confess that my understanding of the FFT is too weak to implement, let alone explain the Schönhage-Strassen algorithm, so if you want to read about that, check the reading list at the end of this post.

Luckily, the Karatsuba algorithm is easy to explain. It was named after the Russian mathematician Anatoly Karatsuba, who was the first to reject the idea that O(n²) is the best we can do. His approach is based on divide and conquer. It uses a simple algebraic trick to reduce work on each step.

Let's say we want to multiply two bignums of two limbs, x[1,2] and y[1,2] in base B. The result is of course xy. Rephrased as an equation:

 xy = (x₁·B + x₂) · (y₁·B + y₂)

This can be written out as:

 xy = (x₁·y₁)·B² + (x₁·y₂ + x₂·y₁)·B + x₂·y₂

If we call these three components a, b and c, then xy = a·B² + b·B + c.

Now, you can calculate all three components separately, but this requires exactly as many steps as the "school book" algorithm we already have, namely O(n²). However, the crucial insight that Karatsuba had is that you can derive b from a and c with a simpler multiplication. He actually did this by first making it more complex. This shows the man's genius: complicating things to simplify them isn't exactly intuitive! So we have a, b and c:

a = x₁·y₁
b = x₁·y₂ + x₂·y₁
c = x₂·y₂

We first complicate b by adding (a-a) and (c-c), then expand:

b = x₁·y₂ + x₂·y₁ + (a - a) + (c - c)
b = x₁·y₂ + x₂·y₁ + ((x₁·y₁) - (x₁·y₁)) + (x₂·y₂) - (x₂·y₂)

Next, we remove the parentheses and reorder by moving one a and one c to the left, and finally we can re-group the moved variables:

b = x₁·y₂ + x₂·y₁ + x₁·y₁ - x₁·y₁ + x₂·y₂ - x₂·y₂
b = x₁·y₁ + x₂·y₂ + x₂·y₁ + x₁·y₂ - x₁·y₁ - x₂·y₂
b = (x₁ + x₂)·(y₁ + y₂) - x₁·y₁ - x₂·y₂

And ta-da, as if by magic, we have:

b = (x₁ + x₂)·(y₁ + y₂) - a - c

If you compare this with the original definition of b, you'll notice that the new definition has one multiplication instead of two. The multiplication factors are slightly larger than in the original formula, but because there's only one, we end up doing less work. Of course, we'll still have to do all those additions and subtractions, but with big enough numbers, it's faster than multiplying twice. That's because the complexity of this algorithm has dropped from O(n²) to something less. The exact complexity turns out to be O(n^log3).

Now, this is the asymptotic complexity. You can easily tell from the new formula of b that it uses several more operations and it requires splitting up x and y in two parts, which requires allocating memory for them and copying out the parts. This is a lot of overhead, which means that Karatsuba's algorithm only becomes more efficient than the "school book" algorithm for (very) large numbers. The same is true for the FFT-based algorithms mentioned earlier, those improve performance only for even larger numbers, because they split up the numbers into even more pieces. This is a (better) reason why it's probably not worth adding the extra code and complexity of an FFT implementation to CHICKEN core.

In code, a naive implementation of Karatsuba's algorithm is quite simple as well:

;; Where this is called, we ensure that |x| <= |y|
(define (karatsuba-multiply x y)
  (let* ((rs (fx* (signum x) (signum y)))     ; Result's sign (-1 or +1)
         (x (abs x))
         (n (bignum-digit-count y))
         (n/2 (fxshr n 1))                    ; (floor (/ n 2))
         (bits (fx* n/2 *bignum-digit-bits*))
         (x-hi (extract-digits x n/2 #f))     ; x[n..n/2]
         (x-lo (extract-digits x 0 n/2)))     ; x[n/2..0]
    (let* ((y (abs y))
           (y-hi (extract-digits y n/2 #f))   ; y[n..n/2]
           (y-lo (extract-digits y 0 n/2))    ; y[n/2..0]
           (a  (* x-hi y-hi))
           (b  (* x-lo y-lo))
           (c  (* (- x-hi x-lo)
                  (- y-hi y-lo))))
      (* rs (+ (arithmetic-shift a (fx* bits 2))
               (+ (arithmetic-shift (+ b (- a c)) bits)
                   b)))))))

Especially helpful here is that the operators prefixed by fx indicate clearly when we're working with fixnums. This uses absolute numbers because that's simpler to deal with, especially when using extract-digits to quickly get a range of bignum digits. It might be possible to make this faster by operating directly on the signed numbers.

In CHICKEN 5, this algorithm has been translated to C for stupid reasons which I will explain in a later post. But this Scheme code is taken directly from the numbers egg and cleaned up only slightly.

According to MCA, an optimal implementation should avoid allocating the intermediate calculations of (- x-hi x-lo) and (- y-hi y-lo). In a low level C-based implementation like CHICKEN 5's, it might be easier to perform in-place modification of these numbers, but so far I haven't been successful in doing this. Nevertheless, our Karatsuba implementation is efficient enough for now.

Sometimes, the Karatsuba algorithm is referred to as the Toom-Cook algorithm. That's because Toom and Cook figured out a way to generalise the algorithm. This way, you can split the numbers into any number of components, instead of two components like Karatsuba did. Apparently there's a sweet spot in number sizes where a 3-way split is faster than a 2-way split, but pretty soon after that, the numbers get big enough and the FFT-based algorithms overtake Toom-Cook in efficiency.

Faster division

Faster multiplication is interesting, but division is the real tortoise in this race, so let's see how we can speed that up. It turns out that the approaches are rather similar to those of multiplication.

Division by a fixnum

Recall that the division algorithm needs to "guess" how many times the denominator fits in the numerator based on the first half-digit (plus some magic surrounding the second half-digit). If the denominator is itself is only a half-digit, there's no need to guess.

So, just like when multiplying by a small fixnum, we have a destructive division algorithm that operates on a copy of the numerator. The Scheme48 version I started with:

/* Given (denominator > 1), it is fairly easy to show that
   (quotient_high < BIGNUM_RADIX_ROOT), after which it is easy to see
   that all digits are < BIGNUM_RADIX. */

static bignum_digit_type
bignum_destructive_scale_down(bignum_type bignum, bignum_digit_type denominator)
{
  bignum_digit_type numerator;
  bignum_digit_type remainder = 0;
  bignum_digit_type two_digits;
#define quotient_high remainder
  bignum_digit_type * start = (BIGNUM_START_PTR (bignum));
  bignum_digit_type * scan = (start + (BIGNUM_LENGTH (bignum)));
  BIGNUM_ASSERT ((denominator > 1) && (denominator < BIGNUM_RADIX_ROOT));
  while (start < scan)
    {
      two_digits = (*--scan);
      numerator = (HD_CONS (remainder, (HD_HIGH (two_digits))));
      quotient_high = (numerator / denominator);
      numerator = (HD_CONS ((numerator % denominator), (HD_LOW (two_digits))));
      (*scan) = (HD_CONS (quotient_high, (numerator / denominator)));
      remainder = (numerator % denominator);
    }
  return (remainder);
#undef quotient_high
}

And the smaller version based, again, on the division algorithm from Hacker's Delight:

static C_uword bignum_digits_destructive_scale_down(
  C_uword *start, C_uword *end, C_uword denominator)
{
  C_uword digit, k = 0;
  C_uhword q_j_hi, q_j_lo;

  /* Single digit divisor case from Hacker's Delight, Figure 9-1,
   * adapted to modify u[] in-place instead of writing to q[].
   */
  while (start < end) {
    digit = (*--end);

    k = C_BIGNUM_DIGIT_COMBINE(k, C_BIGNUM_DIGIT_HI_HALF(digit)); /* j */
    q_j_hi = k / denominator;
    k -= q_j_hi * denominator;

    k = C_BIGNUM_DIGIT_COMBINE(k, C_BIGNUM_DIGIT_LO_HALF(digit)); /* j-1 */
    q_j_lo = k / denominator;
    k -= q_j_lo * denominator;
    
    *end = C_BIGNUM_DIGIT_COMBINE(q_j_hi, q_j_lo);
  }
  return k; /* The remainder */
}

And, just like with multiplication, if you're dividing by a power of two, there's an easy optimisation: you can simply shift the numerator right by as many bits as the 2log of the denominator. The remainder is formed by the bits that were shifted out.

Burnikel-Ziegler division

And here too, there's a divide and conquer algorithm named after the mathematician (or, in this case, mathematicians) who discovered it. This algorithm is a lot more complicated than Karatsuba's relatively simple algebraic trick, and a lot harder to implement correctly. The paper is long and still not very helpful when it comes down to the crucial details. I found the super-elegant presentation in MCA to be more helpful in figuring out details. Especially the algorithm's start-up procedure is tricky to get right. I will use the explanation style from MCA because it is simpler than the original paper.

The algorithm is similar to the classical "school book" division algorithm, but "in the large". The basic idea is that we operate on partial bignums at a time instead of on half-digits. The core algorithm handles only 2n/n division. This means that the numerator must be twice the size of the denominator. It splits the numerator in two halves. Each half is then divided by the entire denominator and finally recombined to form the result. These two divisions are themselves also done in two steps, thereby making the numbers smaller in the recursion.

Because the intermediate division only divides by the first half of the denominator, the result may end up negative. So, like the schoolbook method, this algorithm also makes an adjustment when re-joining the two intermediate results. The core algorithm is as follows:

If the denominator is smaller than some limit, fall back to "primary school" algorithm, otherwise:
Split the denominator B in two: B₁·β + B₂. So, if B is a bignum of n limbs, the base β is half that.
Next, split the numerator A into four such parts: A₁·β³ + A₂·β² + A₃·β + A₄.
First half:
- Divide A₁·β + A₂ by B₁, yielding the guessed quotient Q̂₁ and remainder R₁ (the recursive step).
- Combine the remainder R₁ with A₃ and subtract Q̂·B₂, yielding R̂₁ = R₁·β + A₃ - Q̂·B₂.
- While R̂₁ < Q̂₁·B₂, adjust the guess; Q̂₁:=Q̂₁-1 and R̂₁:=R̂₁+B.
Second half:
- Divide R̂₁ by B₁, yielding the guessed quotient Q̂₂ and remainder R (another recursive step).
- Combine the remainder R with A₄ and subtract Q̂·B₂, yielding R̂ = R + A₄ - Q̂·B₂.
- While R̂ < Q̂₂·B₂, adjust the guess; Q̂₂:=Q̂₂-1 and R̂:=R̂+B.
Recombination of quotient after division:
- The final quotient is Q̂₁·β + Q̂₂, the final remainder is just R̂.

The interesting part is that B₂ is only ever used for checking the guess. It is not involved in any division. Of course, in the recursion B₂ is also split in two parts, so the high half will be used in the next division, and so on.

In the diagram you can see how it works on a (very) small sample:

In the diagram, B₁ = 31 is shown in white on the left in the first and fourth rows. B₂ = 21 is shown in green on the left in the second and final rows. In the first row you also see highlighted in white A₁·β + A₂ = 3456 and R₁ = 15. In the second row, A₃ = 78 is shown in white, as it drops down to form R̂₁ = R₁·β + A₃ = 1578 with Q̂₁ = 111.

Between the second and third rows, Q̂₁ = 110 is adjusted to 111 and R̂₁ = -753 is adjusted to 2368 by adding the numerator.

In the third row we continue with the second half, dividing R̂₁ by B₁ in the same manner and then recombining R = 43 with A₄ = 67 and subtracting Q̂·B₂ = 1575. As no more adjustments are needed, we're done, with R̂ = 2792 and Q̂₂ = 75. Combine Q̂₁,Q̂₂ into Q = 11075 and we're done!

Burnikel and Ziegler present the algorithm in their paper in a bit of a roundabout way that didn't make sense to me at first. It requires understanding the big picture, which they don't really explain up front. So it's best to read the paper entirely, and then go back and re-read it to grasp the details. It's a bit bottom-up in a sense, because they refactor it into two algorithms; one for dividing numbers 2n/n, and one for dividing numbers 3n/2n. This confused me no end, as it resulted in a bit of a cyclic definition.

In the explanation I gave above, 2n/n is the overall algorithm as a whole. The first and second "halves" of the algorithm are really identical, and represented by Burnikel and Ziegler as two calls to the 3n/2n algorithm. This can be seen in the Scheme code below:

(define (digit-bits n) (fx* *bignum-digit-bits* n))    ; Small helper

;; Here and in 2n/1n we pass both b and [b1, b2] to avoid splitting
;; up the number more than once.  This is a helper function for 2n/n.
(define (burnikel-ziegler-3n/2n a12 a3 b b1 b2 n)
  (receive (q^ r1)
      (if (< (arithmetic-shift a12 (fxneg (digit-bits n))) b1)
          (let* ((n/2 (fxshr n 1))                     ; (floor (/ n 2))
                 (b11 (extract-digits b1 n/2 #f))      ; b1[n..n/2]
                 (b12 (extract-digits b1 0 n/2)))      ; b1[n/2..0]
            (burnikel-ziegler-2n/1n a12 b1 b11 b12 n))
          ;; Don't bother dividing if a1 is a larger number than b1.
	  ;; We use a maximum guess instead (see paper for proof).
          (let ((base*n (digit-bits n)))
            (values (- (arithmetic-shift 1 base*n) 1)  ; B^n-1
                    (+ (- a12 (arithmetic-shift b1 base*n)) b1))))
    (let ((r1a3 (+ (arithmetic-shift r1 (digit-bits n)) a3)))
      (let lp ((r^ (- r1a3 (* q^ b2)))
               (q^ q^))
        (if (negative? r^)
            (lp (+ r^ b) (- q^ 1))                     ; Adjust!
            (values q^ r^))))))

;; The main 2n/n algorithm which calls 3n/2n twice.  Here, a is the
;; numerator, b the denominator, n is the length of b (in digits) and
;; b1 and b2 are the two halves of b (these never change).
(define (burnikel-ziegler-2n/1n a b b1 b2 n)
  (if (or (fxodd? n) (fx< n DIV-LIMIT))                ; Can't recurse?
      (quotient&remainder a b)                         ; Use school division
      (let* ((n/2 (fxshr n 1))
             ;; Split a and b into n-sized parts [a1, ..., a4] and [b1, b2]
             (a12 (extract-digits a n #f))             ; a[m..n]
             (a3  (extract-digits a n/2 n))            ; a[n..n/2]
             (a4  (extract-digits a 0 n/2)))           ; a[n..0]
        ;; Calculate high quotient and intermediate remainder (first half)
        (receive (q1 r1) (burnikel-ziegler-3n/2n a12 a3 b b1 b2 n/2)
          ;; Calculate low quotient and final remainder (second half)
          (receive (q2 r) (burnikel-ziegler-3n/2n r1 a4 b b1 b2 n/2)
            ;; Recombine quotient parts as q = [q1,q2]
            (values (+ (arithmetic-shift q1 (digit-bits n/2)) q2) r))))))

The reason b1, b2 are passed in but not a1, ..., a4 has to do with the "full" algorithm, which deals with unbalanced division where a may be bigger than 2n, given b of size n. There, b is constant, so it pays off to "cache" b1 and b2. Because a keeps changing, we don't cache it.

This full algorithm for dividing two numbers x and y of arbitrary lengths is as follows: If the denominator y is of size n, we can simply chop up the numerator x into n-sized pieces. We then perform a division algorithm on those pieces, using a sort of "sliding window" over x. This passes [x_{i+1},x_i] and y to 2n/n, and recombines the remainder r_i with x_{i-1} to get [r_i,x_{i-1}], which is used for 2n/n in the next iteration, and so on.

Well, in theory it's simple...

(define (quotient&remainder/burnikel-ziegler x y return-quot? return-rem?)
  ;; The caller will already have verified that abs(x) > abs(y), but we
  ;; need to remember the signs of the input and make them absolute.
  (let* ((q-neg? (if (negative? y) (not (negative? x)) (negative? x)))
         (r-neg? (negative? x))
         (x (abs x))
         (y (abs y))
         (s (bignum-digit-count y))
         ;; Define m as min{2^k|(2^k)*DIV_LIMIT > s}.
         ;; This ensures we shift as little as possible (less pressure
         ;; on the GC) while maintaining a power of two until we drop
         ;; below the threshold, so we can always split N in half.
         (m (fx* 2 (integer-length (fx/ s DIV-LIMIT))))
         (j (fx/ (fx+ s (fx- m 1)) m))  ; j = s/m, rounded up
         (n (fx* j m))
         ;; Normalisation, just like with normal school division
         (norm-shift (fx- (digit-bits n) (integer-length y)))
         (x (arithmetic-shift x norm-shift))
         (y (arithmetic-shift y norm-shift))
         ;; l needs to be the smallest value so that a < base^{l*n}/2
         (l (fx/ (fx+ (bignum-digit-count x) n) n))
         (l (if (fx= (digit-bits l) (integer-length x)) (fx+ l 1) l))
         (t (fxmax l 2))
         (y-hi (extract-digits y (fxshr n 1) #f))   ; y[n..n/2]
         (y-lo (extract-digits y 0 (fxshr n 1))))   ; y[n/2..0]
    (let lp ((zi (arithmetic-shift x (fxneg (digit-bits (fx* (fx- t 2) n)))))
             (i (fx- t 2))
             (quot 0))
      (receive (qi ri) (burnikel-ziegler-2n/1n zi y y-hi y-lo n)
        (let ((quot (+ (arithmetic-shift quot (digit-bits n)) qi)))
          (if (fx> i 0)
              (let ((zi-1 (let* ((base*n*i-1 (fx* n (fx- i 1)))
                                 (base*n*i   (fx* n i))
                                 ;; x_{i-1} = x[n*i..n*(i-1)]
                                 (xi-1 (extract-digits x base*n*i-1 base*n*i)))
                            (+ (arithmetic-shift ri (digit-bits n)) xi-1))))
                (lp zi-1 (fx- i 1) quot))
              ;; De-normalise remainder, but only if necessary
              (let ((rem (if (or (not return-rem?) (eq? 0 norm-shift))
                             ri
                             (arithmetic-shift ri (fxneg norm-shift)))))
                ;; Return values (quot, rem or both) with correct sign:
                (cond ((and return-quot? return-rem?)
                       (values (if q-neg? (- quot) quot)
                               (if r-neg? (- rem) rem)))
                      (return-quot? (if q-neg? (- quot) quot))
                      (else (if r-neg? (- rem) rem))))))))))

As you can see, this procedure is extremely hairy. The trickery is in how the bignums are chopped up into n-sized pieces. The sizes we use should be nice powers of two, which improves the algorithm's effectiveness. Notice the (or (fxodd? n) (fx< n DIV-LIMIT)) check in 2n/1n; whenever that is true, we fall back to the school division. This has to be avoided as much as possible, so that's why we try to scale up the number x to nicely rounded multiples of n. At the same time, you have to make sure that the numbers don't grow too large, because that would create more work for the algorithm!

The particular calculation is tricky, but the idea is simple: you want to scale up both numbers to the closest power of two that's larger than the cutoff size. Then, the numerator is scaled up so that it is a size that's a multiple of n, the final size of the denominator. No doubt my implementation of this part of the algorithm can be simplified.

Reading list

First, start with the reading list of the previous post, because most of those references discuss advanced algorithms as well. The ones below are either more specific or more advanced than the descriptions you'll find in the standard references.

The GMP manual has a chapter on Karatsuba Multiplication.
Gaudry et al, A GMP-based Implementation of Schönhage-Strassen’s Large Integer Multiplication Algorithm. The title says it all.
Martin Fürer, Faster Integer Multiplication. This paper describes what is currently the fastest known algorithm for multiplying extremely large numbers (based on FFT). This is asymptotically fastest, but for practical bignum sizes, Schönhage-Strassen remains king.
Yan-Bin Jia, Polynomial Multiplication and Fast Fourier Transform. Lab notes for a CS course on problem solving techniques. Pretty terse, but might be helpful in understanding how FFT works.
Burnikel & Ziegler, Fast Recursive Division. The original paper on this algorithm.
Karl Hasselstrom, Fast Division of Large Integers. This paper compares Newton's algorithm, the Burnikel-Ziegler algorithm and something called "Barret's Algorithm". The conclusion seems to be that Burnikel-Ziegler is usually fastest.
Deamentiaemundi's blog post about implementing Burnikel-Ziegler in JavaScript, including code.

CHICKEN's numeric tower: part 2

2016-10-13T17:35:29Z

This is the second part documenting my journey to add full numeric tower support to CHICKEN core. In this post I explain some of the basic algorithms. You'll need to understand these before going on to the next part, which deals with fancier versions of these algorithms.

Classical numerical algorithms

Like I mentioned in my previous post, the Scheme48 numerical code used only the so-called "classical" algorithms. Comments in the Scheme48 code refer to Donald Knuth's seminal work, The Art of Computer Programming, Volume 2, chapter 4. Interestingly, after these classical algorithms, Knuth explains a few faster algorithms, but Scheme48 did not implement these.

Addition and subtraction

Addition and subtraction are extremely simple algorithms: you simply loop over the limbs of both numbers simultaneously, and add them together, taking care to propagate the carry or borrow from the previous position. This is the same algorithm you learned in primary school. The difference is that the computer can add a whole machine word, while at school you would handle one decimal position at a time. This is O(n) in complexity:

For subtraction the algorithm is the same, except it uses borrowing instead of carrying. You might wonder what happens if the value being subtracted is bigger than the one being subtracted from. If those numbers are both positive, that results in a negative number, but when subtracting a negative number from a smaller positive number, its result would be positive.

The solution is simple in case you're using unsigned representation with explicit sign: You compare the absolute values first. If the second value is larger than the first, you swap them first. Then you subtract them and toggle the sign of the result: If a - b = x, then multiplying all factors by -1 gives: -a + b = -x, or simply b - a = -x.

As far as I'm aware, the primary school algorithm is it. There are no shortcuts, and no quicker ways around it. However, Scheme48 used a surprising representation for their bignums: the limbs inside the bignum did not make use of the top two bits in the machine word. Presumably they did this for portability and correctness. You see, in C, signed overflow is undefined, just so compilers can eke out a little more performance. I think this is completely ridiculous, and it's another source of security issues, but that's what life with C is like.

However, CHICKEN uses the -fwrapv compiler option to enforce sane overflow behaviour. That means CHICKEN bignums are free to use all available bits in a machine word. This representation will also use slightly less memory for really large bignums, especially on 32-bit systems. But, more importantly, it's faster because there's less masking and checking going on. Here's the heart of Scheme48's bignum addition:

while (scan_y < end_y)
{
  sum = ((*scan_x++) + (*scan_y++) + carry);
  if (sum < BIGNUM_RADIX)     /* No overflow */
    {
      (*scan_r++) = sum;
      carry = 0;
    }
  else                        /* Overflow, adjust and set carry */
    {
      (*scan_r++) = (sum - BIGNUM_RADIX);   /* sum modulo radix */
      carry = 1;
    }
}

And here is CHICKEN's:

while (scan_y < end_y) {
  digit = *scan_r;
  if (carry) {
    sum = digit + *scan_y++ + 1;
    carry = sum <= digit; /* Overflow if wrapped result is smaller or equal */
  } else {
    sum = digit + *scan_y++;
    carry = sum < digit;  /* Overflow if wrapped result is smaller */
  }
  (*scan_r++) = sum;
}

Aside from the difference in coding style, you can see that Scheme48 needs to adjust the result if we got a carry. The BIGNUM_RADIX is the maximum bignum digit value plus one. In terms of instructions, this masking and checking doesn't make that much of a difference, surprisingly enough.

But while tweaking this algorithm, I discovered that a nice performance improvement could be gained: First, copy the larger bignum to the result bignum, and then you loop over the second bignum, adding its limbs to the result's limbs, modifying it in-place. I suppose this is faster because you're only manipulating two pointers at a time rather than three. This is why scan_x is not used in the CHICKEN version. This requires memcpy to be fast, so on some systems, the CHICKEN approach can potentially be slower than the Scheme48 one.

Multiplication

Multiplication is where things start to get more interesting. The classical algorithm is still pretty basic, but much slower because it's O(n²) in complexity. This is because in this algorithm, we multiply each position in the first number by every position in the other number, in a nested loop:

As the graphic attempts to clarify, we take only half-digits when multiplying, because the result must fit a single digit. This slows things down even further, because we can only iterate over the limbs at half speed. In Scheme48's code, it looked like this:

#define x_digit x_digit_high
#define y_digit y_digit_high
#define product_high carry
while (scan_x < end_x)
  {
    x_digit = (*scan_x++);
    x_digit_low = (HD_LOW (x_digit));
    x_digit_high = (HD_HIGH (x_digit));
    carry = 0;
    scan_y = start_y;
    scan_r = (start_r++);
    while (scan_y < end_y)
      {
        y_digit = (*scan_y++);
        y_digit_low = (HD_LOW (y_digit));
        y_digit_high = (HD_HIGH (y_digit));
        product_low =
          ((*scan_r) +
           (x_digit_low * y_digit_low) +
           (HD_LOW (carry)));
        product_high =
          ((x_digit_high * y_digit_low) +
           (x_digit_low * y_digit_high) +
           (HD_HIGH (product_low)) +
           (HD_HIGH (carry)));
        (*scan_r++) =
          (HD_CONS ((HD_LOW (product_high)), (HD_LOW (product_low))));
        carry =
          ((x_digit_high * y_digit_high) +
           (HD_HIGH (product_high)));
      }
    (*scan_r) += carry;
  }

The #define statements at the start are rather interesting, and seem to have been meticulously chosen to maximise re-use of variables. This was probably done to cajole inefficient compilers into re-using registers. Some of the bignum code is originally from 1986, when C compilers weren't very sophisticated! The HD_CONS macro combines two halfwords together, while the HD_LOW and HD_HIGH extract the low and high halfword from a machine word, respectively:

#define HD_LOW(digit) ((digit) & BIGNUM_HALF_DIGIT_MASK)
#define HD_HIGH(digit) ((digit) >> BIGNUM_HALF_DIGIT_LENGTH)
#define HD_CONS(high, low) (((high) << BIGNUM_HALF_DIGIT_LENGTH) | (low))

Remember, Scheme48 bignum digits use only 30 bits on a 32-bit machine and 62 bits on a 64-bit machine, so the masking and shifting is required. Because CHICKEN bignum digits now used the full machine word, I was able to replace it with another, much shorter implementation, which relies on "automatic" truncation of machine words:

/* From Hacker's Delight, Figure 8-1 (top part) */
for (j = 0; j < length_y; ++j) {
  yj = C_uhword_ref(yd, j);
  if (yj == 0) continue;
  carry = 0;
  for (i = 0; i < length_x; ++i) {
    product = (C_uword)C_uhword_ref(xd, i) * yj +
              (C_uword)C_uhword_ref(rd, i + j) + carry;
    C_uhword_set(rd, i + j, product);
    carry = C_BIGNUM_DIGIT_HI_HALF(product);
  }
  C_uhword_set(rd, j + length_x, carry);
}

As the comment says, this code is adapted from the fantastic booklet "Hacker's Delight" by Henry S. Warren, so any elegance you see in this code is not due to me! The original code is even more elegant, but it assumes little-endian order of bignum digits and the halfwords within these digits. On big endian machines the halfwords will be swapped within their machine words, so I introduced C_uhword_ref and C_uhword_set, which are ugly macros to select the higher/lower halfword of the relevant machine word:

/* The bignum digit representation is fullword- little endian, so on
 * LE machines the halfdigits are numbered in the same order.  On BE
 * machines, we must swap the odd and even positions.
 */
#ifdef C_BIG_ENDIAN
#define C_uhword_ref(x, p)           ((C_uhword *)(x))[(p)^1]
#else
#define C_uhword_ref(x, p)           ((C_uhword *)(x))[(p)]
#endif
#define C_uhword_set(x, p, d)        (C_uhword_ref(x,p) = (d))

The (C_uhword *) casts here ensure that only a halfword is extracted. Most machines have an instruction to fetch a halfword into a register, which is much faster than masking it out. So, even if it's ugly and hacky, I vastly prefer this over the Scheme48 code.

Division

Oh boy, where to start? The above algorithms are so simple, but division, now that's quite a can of worms. To make things worse, many textbooks (including Knuth) gloss over important details, assuming that readers can figure it out on their own.

The first problem is that, unlike the above algorithms, the traditional pen and paper-algorithm doesn't translate well to the computer. Let's look at an example division, performed by hand as you would have learned it in primary school. Here, we divide 543456 (the dividend or numerator) by 344 (the divisor or denominator):

The notation might differ slightly from what you're used to (different schools use different notations, apparently), but the algorithm should be familiar: Given a denominator of n digits, you take n+1 digits from the numerator (but n in the first step!), then divide them by the denominator. You write the quotient on the right. Then you subtract the remainder from the digits you took from the numerator, and you continue with the next digit, until you hit the last digit of the numerator. The final subtraction gives you the remainder at the bottom, and the digits you wrote on the right together form the quotient.

There is a problem with this "algorithm", though: it requires you to divide each numerator part by the entire denominator. If the denominator is a bignum, you're still dividing one bignum by another! Using this algorithm recursively won't work either, because it doesn't reduce the denominator's size.

However, it turns out that you can guess the results of these intermediate divisions, based on the first few digits of both numbers. Intuitively, you can get a pretty good guess of how many times a number fits in another by doing a trial division of their leading digits.

For example, a number like 3xx can fit about 2x times in a number like 7xxx. In other words, our guess is 7/3 = 2. For example, the number 300 will fit 23 times in 7000. This guess isn't completely accurate: for example, the number 399 will fit only 17 times in 7000. Note that the leading digit is now a 1 instead of a 2, which means our guess was bad. So in some cases we need to correct the guess. Note that a guess may never exceed 9, because we're calculating one decimal position of the quotient. All this leads to the following relatively simple algorithm:

Make a guess based on trial division of the leading digits as described above;
Multiply the denominator by the guess, to get a result;
Subtract this result from the numerator, but:
If the subtraction goes below zero, add back the denominator and adjust the guess.

This algorithm would work, but it takes many iterations. It can be improved by taking into account two leading digits of the denominator, instead of one. This improves the accuracy of the guess, and it can be done easily if we only use halfdigits in our calculation (which we'll have to do anyway to avoid overflow when multiplying). In the picture below, for simplicity and brevity, each digit represents one halfword.

The picture above is pretty complicated! I hope it clarifies the algorithm a little. The picture clearly shows two places where this algorithm guessed wrong, in which case we need to adjust some values (shown in red).

To understand the algorithm, first note the highlighted quotient digits with a question mark below them. These indicate that the quotient digit is a guess.

We tentatively multiply this guess by the first halfdigit of the denominator, and subtract it from the current remainder, giving a result in green. Then, we append the next digit from the denominator (in blue) to the result we just got. Finally, we multiply the next digit from the numerator (yellow) by its first digit, and see if the number is less than the combined intermediate remainder. This means the guess was correct; otherwise the guess is incorrect, because the remainder would be negative.

If the guess was wrong, we need to adjust the guess by subtracting one and performing the check again until the guess is correct. You can see this happening near the bottom of the first column in the above picture.

Once we have a correct guess based on the first two halfdigits, we go ahead and calculate the remainder. To do this, we multiply the full n digits of the denominator by the guess, and subtract the first n digits of the remaining numerator. All this can be done simultaneously, in O(n), even!

Unfortunately, after having calculated the remainder, it can turn out negative. This means the original guess was bad after all! In this case we must make a last-minute adjustment, by subtracting one from the quotient, and then adding the denominator to the remainder. This is shown in the picture in the first two steps of the second column.

The actual implementation of this horribly complicated algorithm in Scheme48 was also very complex and extremely long (it's all of the stuff between lines 1045 and 1383). So, instead of attempting to understand and rework this to be faster and more consistent with CHICKEN core, once again I opted to steal an implementation from Hacker's Delight. It looks like this:

static C_regparm void
bignum_destructive_divide_normalized(C_word big_u, C_word big_v, C_word big_q)
{
  C_uword *v = C_bignum_digits(big_v),
          *u = C_bignum_digits(big_u),
          *q = big_q == C_SCHEME_UNDEFINED ? NULL : C_bignum_digits(big_q),
           p,               /* product of estimated quotient & "denominator" */
           hat, qhat, rhat, /* estimated quotient and remainder digit */
           vn_1, vn_2;      /* "cached" values v[n-1], v[n-2] */
  C_word t, k;              /* Two helpers: temp/final remainder and "borrow" */
  /* We use plain ints here, which theoretically may not be enough on
   * 64-bit for an insanely huge number, but it is a _lot_ faster.
   */
  int n = C_bignum_size(big_v) * 2,       /* in halfwords */
      m = (C_bignum_size(big_u) * 2) - 2; /* Correct for extra digit */
  int i, j;		                  /* Just two loop variables */

  /* Part 2 of Gauche's aforementioned trick: */
  if (C_uhword_ref(v, n-1) == 0) n--;

  /* These won't change during the loop, but are used in every step. */
  vn_1 = C_uhword_ref(v, n-1);
  vn_2 = C_uhword_ref(v, n-2);

  /* See also Hacker's Delight, Figure 9-1.  This is almost exactly that. */
  for (j = m - n; j >= 0; j--) {
    /* First, determine the initial guess: */
    hat = C_BIGNUM_DIGIT_COMBINE(C_uhword_ref(u, j+n), C_uhword_ref(u, j+n-1));
    if (hat == 0) {
      if (q != NULL) C_uhword_set(q, j, 0);
      continue;
    }
    qhat = hat / vn_1;
    rhat = hat % vn_1;

    /* Next, keep making early adjustments to the guess
     * until it matches the first two digits:
     */

    /* Two whiles is faster than one big check with an OR.  Thanks, Gauche! */
    while(qhat >= (1UL << C_BIGNUM_HALF_DIGIT_LENGTH)) { qhat--; rhat += vn_1; }
    while(qhat * vn_2 > C_BIGNUM_DIGIT_COMBINE(rhat, C_uhword_ref(u, j+n-2))
	  && rhat < (1UL << C_BIGNUM_HALF_DIGIT_LENGTH)) {
      qhat--;
      rhat += vn_1;
    }

    /* Finally, multiply and subtract: */
    k = 0;
    for (i = 0; i < n; i++) {
      p = qhat * C_uhword_ref(v, i);
      t = C_uhword_ref(u, i+j) - k - C_BIGNUM_DIGIT_LO_HALF(p);
      C_uhword_set(u, i+j, t);
      k = C_BIGNUM_DIGIT_HI_HALF(p) - (t >> C_BIGNUM_HALF_DIGIT_LENGTH);
    }
    t = C_uhword_ref(u,j+n) - k;
    C_uhword_set(u, j+n, t);

    /* Subtracted too much?
     * Make a late adjustment by adding back the denominator:
     */
    if (t < 0) {
      qhat--;
      k = 0;
      for (i = 0; i < n; i++) {
        t = (C_uword)C_uhword_ref(u, i+j) + C_uhword_ref(v, i) + k;
        C_uhword_set(u, i+j, t);
	k = t >> C_BIGNUM_HALF_DIGIT_LENGTH;
      }
      C_uhword_set(u, j+n, (C_uhword_ref(u, j+n) + k));
    }
    if (q != NULL) C_uhword_set(q, j, qhat);
  } /* end of "j" loop */
}

There are some shoutouts to Gauche, which is a beautifully-crafted Scheme implementation in C. The particular "trick" referred to here simplifies the calculation of our allocation sizes a little bit by ensuring we never shift more than a halfdigit when normalising (see next section).

As you can see from the implementation, the "multiply and subtract" is actually done in one loop which scans over the remainder u and denominator v at the same time, so this is not "magic"; we can perform the multiply and subtract steps over the entire bignum in one efficient O(n) loop. Perhaps surprisingly, the overall algorithm is O(n²), just like multiplication. Division is still much slower than multiplication because each "step" performs more operations (just look at the algorithms!).

Normalisation

A real-world implementation of the above division algorithm will try to reduce the number of guess adjustments. This is done by first normalising or scaling the numbers. This is done by multiplying both the numerator and denominator with the same power of two before starting to do the division. Afterwards, the remainder must be scaled back by dividing by that power of two. Instead of multiplying and dividing, you can of course just shift the numbers.

The number by which is multiplied depends on the numerator's first digit; it must be scaled up to be at least half of the base. In base 10, you need to scale it up to at least 5, while in a "full machine word" base it's even easier: you simply shift the entire number so that the highest bit of the most significant limb is set. How Scheme48 did this:

bignum_digit_type v1 = (BIGNUM_REF ((denominator), (length_d - 1)));
while (v1 < (BIGNUM_RADIX / 2))  /* Is the high bit set yet? */
  {
    v1 <<= 1;
    shift += 1;
  }

In the CHICKEN version, we take a simpler approach by subtracting the integer length from the digit length, which effectively is the same as counting the number of leading zeroes ("nlz"):

C_uword v1 = *(C_bignum_digits(denominator) + length - 1); 
shift = C_BIGNUM_DIGIT_LENGTH - C_ilen(v1); /* nlz */

Then, both numbers are copied into temporary buffers which are shifted left in-place by the number of bits calculated here.

Normalisation works by preventing the algorithm from overshooting. Think about it: any guess may always be too high, never too low! So if you scale the first digit to be as high as possible, you can't so easily make a guess that is too high. It's weird, but the math seems to work out.

A reading list for beginners

I am writing this blog post series mostly as a quick overview and introduction to the struggles and approaches taken in CHICKEN's bignum implementation. It is not intended as a full-on tutorial. If you are serious about implementing a full numeric tower (good for you!) or diving deeper into the CHICKEN code, you'll need more. Unfortunately, good and easy to understand documentation is surprisingly hard to find, so here's a reading list to save you some effort.

Knuth's The Art of Computer Programming, Vol. 2: Seminumerical Algorithms. The definitive reference. Many will say that every self-respecting hacker should have read these books, but truth be told they're rather tough to get through. But even if you do give up working through the books, they serve as great reference material. Why are these books so tough? They're math-heavy (especially the first book), and Knuth uses his own "hypothetical" MIX architecture for all code examples and exercises. Yes, everything is in assembly language! Nevertheless, the books are very thorough, and they're obviously written out of love for the craft.
Tom St Denis's book Multi-Precision Math is much more gentle than Knuth's books. This is the companion book for LibTomMath, a public domain, well-commented bignum library, explicitly written to be easy to understand. The book and library cover mostly classic algorithms, but there are also a handful of "advanced" algorithms, and several special-purpose optimised versions.
Per Brinch Hansen's Multiple-Length Division Revisited: A Tour of the Minefield. This little gem is helpful if you are having trouble following textbook explanations of the classical division algorithm. It was written out of frustration with the poor quality of existing explanations.
MpNT: A Multi-Precision Number Package by Tiplea et al. is another overview of a library's algorithms. This is a bit terser and more math-heavy than the LibTom book, but also covers several more advanced algorithms. This is a very good and complete reference.
Finally, Modern Computer Arithmetic by Richard Brent and Paul Zimmermann is probably the tersest, but also the most complete guide to efficient algorithms that I've found so far. These guys know what they're talking about: this book truly covers the "state of the art". Only for advanced students of numerics :)
As a bonus, if you're serious about efficiency: The "algorithms" section of the GMP manual. These are terse and incomplete, and you usually won't get a complete understanding just by reading them. However, since GMP is the most popular implementation, it is also the fastest: Researchers usually create a proof of concept implementation for GMP and compare it to the existing algorithms. So, it is important to know which algorithms GMP is currently using, and then try to find better papers that explain them.

CHICKEN's numeric tower: part 1

2016-10-10T19:36:11Z

Originally, CHICKEN only supported fixnums (small integers) and flonums (floating-point numbers). The upcoming CHICKEN 5 will support a full numeric tower, adding arbitrary-length integers, rational numbers and complex numbers. This is the first in a series of blog posts about my journey to make this a reality. We'll start with a bit of background information. Later parts will dive into the technical details.

In the beginning, there were two numerical types

Like I mentioned, CHICKEN originally only supported fixnums and flonums. This is still the case in CHICKEN 4. When a fixnum overflows, it is coerced into a flonum. On 32-bit systems, this buys us 52 bits of precision, which is more than the 30 bits of precision fixnums offer:

 #;1> most-positive-fixnum
 1073741823
 #;2> (+ most-positive-fixnum 1)
 1073741824.0

This works reasonably well, and is well-behaved until you go beyond the 52 bits supported by the floating-point representation:

 #;3> (flonum-print-precision 100)
 #;4> (expt 2 53)
 9007199254740992.0
 #;5> (+ (expt 2 53) 1)
 9007199254740992.0
 #;6> (= (expt 2 53) (+ (expt 2 53) 1))
 #t

On a 64-bit machine, overflow of the 62 bits of a fixnum to the 52 bits of a flonum is rather weird:

 #;1> (= most-positive-fixnum (- (+ most-positive-fixnum 1) 2))
 #t

Since we only have fixnums and flonums, any attempt to enter a rational number will result in a flonum:

 #;1> 1/2
 0.5
 #;2> 1/3
 0.333333333333333

Complex numbers are not supported at all:

 #;1> 1+2i
 
 Error: unbound variable: 1+2i

Of course, some people still needed to work with complex numbers, so a long time ago, Thomas Chust created the "complex" egg. This added complex number support to the reader, and the basic numeric operators were overridden to support complex numbers. About a year later, Felix Winkelmann created the original version of the "numbers" egg, using Thomas's code for complex numbers. This added arbitrarily large integers ("bignums" in Lisp parlance) and rational number support via the GNU MP library. Thus, CHICKEN finally had a full numeric tower, and it was completely optional, too. Pretty awesome!

Cracks start to show

Unfortunately, it's not as awesome as it sounds. There are some problems with having parts of the numeric tower as an add-on, instead of having it all in core:

In a Scheme with modules, + from the scheme module should always refer to the same procedure. So if a module imports that + instead of the one from the numbers module, it will not understand extended numeric types. This means that you can't easily combine a library that uses numbers with one that doesn't. If you pass a bignum to the library that does not use numbers, it will raise an exception. This is mostly a problem with Scheme itself, which doesn't have a clean way to define polymorphic procedures. This makes the numeric tower a built-in special case. It is possible to mutate procedures, but allowing for that implies a big performance hit on all code, even if you don't use the numbers egg.
The numbers egg extends the reader to support extended numeric literals. This means that if some code somewhere loads the numbers egg, the reader extension is active even though you didn't load numbers yourself. This can cause confusion because normal numeric operations don't accept these numbers. For an example, see this bug report.
Speaking of extended numeric literals: the compiler doesn't know how to serialise those into the generated C code. This means you can't compile Scheme code containing such literals. You'd have to use string->number everywhere, instead. I found a clever hack to make this work with the numbers egg, but it isn't fool-proof. For instance, it doesn't work when cross-compiling to a platform with different endianness, or if one platform is 32-bit and the other is 64-bit.
The compiler can optimise tight loops by using inline C functions for primitive operations such as the built-in numerical procedures. A current weak spot of CHICKEN is that (as far as I know), eggs can't add such inline C function replacements. So, any code that uses the numbers egg is doomed to have bad performance in critical loops. I think making inlining of C functions available for user code would be a great project (hint, hint!).
Because the FFI (foreign function interface) is built into the compiler, it doesn't support bignums. This means 64-bit integers returned from C are converted to flonums, losing precision. Eggs can't hook into the FFI deeply enough to override this.

One could argue that these are all language or implementation limitations. On the one hand, that's a fair argument. On the other hand, keeping everything "open" so it can be tweaked by the user prevents many optimisations. It also makes the implementation more complex. For instance, there are hooks in core specifically for the numbers egg, to support reading and writing extended literals. The numeric tower needs deeper integration than most other things because numbers are a basic type, much like symbols, strings or lists. So, it makes more sense to have this in the core system.

The start of my quest

Traditionally, Lisps have supported a full numeric tower. At least since the MacLISP days (the early 1970s; see also The History of Lisp), bignums have been pretty standard. Scheme formalises this in the standard, but it does not require full support for all numeric types. Still, in my opinion any serious Lisp or Scheme implementation should support the full numerical tower. It's one of those things that make Lisp unique and more cause for that famous smugness of us Lisp weenies.

It is fantastic when a language supports arbitrarily large integers. Not having to worry about overflows helps prevent various nasty security bugs (luckily, overflowing into flonums, like CHICKEN, mitigates most of these). Bignums can also make it much easier to interact with native code, because integer width is never a problem. It basically frees the programmer from having to think about "unimportant" low-level details. Rational numbers (i.e., fractions like 1/2 or 3/5) and complex numbers are just icing on the cake that add a real feeling of "thoroughness" to Lisp.

This idea, and the fact that other "proper" Scheme implementations support the full numeric tower out of the box always frustrated me. I believe people are less likely to take CHICKEN seriously as a full Scheme implementation. Especially new users are often surprised when CHICKEN does not work as expected. Tutorials don't mention that the numeric tower is partly optional!

More experienced users were also frustrated with the limitations of having numbers as a separate egg, like you can see for example in this thread. In it, some of the problems are indicated, and it is also made clear why a GNU MP-based implementation should not be part of CHICKEN.

From all of this, I decided that the best way to get bignums into core would be to start with finding a good BSD-licensed implementation. Then I could replace GMP with this new implementation in the numbers egg, tweak it to use CHICKEN's naming conventions and finally integrate the new code into core. How hard could it be, really? Little did I suspect that 5 years later, the code would finally be introduced to core!

A very slow, but BSD-licensed implementation

Finding a BSD-licensed bignum implementation is not very difficult, and I quickly settled on the Scheme48 implementation, which was originally taken from MIT Scheme. I've always admired Scheme48 for its extremely clean and easy to understand code base, and CHICKEN core already used the syntax-rules implementation from Scheme48, so it made a lot of sense to use their code. Unfortunately, it turned out that the implementation was extremely inefficient, especially when dealing with rational numbers ("ratnums"). After a few weeks of intensive hacking to fix the worst problems, it was finally ready.

This new implementation was much more efficient than the GMP-based numbers egg, but that's only because the GMP-based version relied heavily on finalizers to clean up memory. The new version integrated properly with the CHICKEN garbage collector. This reduced a whole lot of overhead. Having said that, GMP itself is the fastest bignum implementation you'll ever find, so if you can at all get away with using it in your project, do so!

CHICKEN 5 is announced

The CHICKEN core team (of which I'm a member) decided that CHICKEN 5 should be a clean break, with no backwards compatibility. We wanted to finally restructure the core libraries, which had become rather messy, and change a few confusing aspects about modules. Doing this with backwards compatibility would sap too much development energy and possibly result in an even bigger mess. When this decision was made, I decided that this would be the perfect opportunity to finally integrate the numbers egg into core.

I had been working on the numbers egg on and off over the past years, hoping for a good moment to add it to core. When the opportunity presented itself, at first I naively thought a few tweaks would suffice to integrate it. I thought I only had to make some name changes and rearrange some functions. The Scheme48 code base used very descriptive and highly abstract naming, whereas CHICKEN uses terse names and has both inline and CPS variants for primitive operations. Besides, quite a bit of code in the numbers egg was purely in Scheme, whereas CHICKEN has a more-or-less official C API. So, I had to convert some of the functions to C. This would probably also result in some performance improvements.

Small changes lead to a total rewrite

During the conversion to C, I noticed various opportunities for performance improvements. For instance, the Scheme48 code still relied on malloc() to allocate temporary numbers in several places. Where this was done, the final result of an operation would then be allocated into GC-managed memory and the temporary buffer was immediately freed.

Rewriting the code to allocate directly in GC-able memory resulted in quite the restructuring of the code, because we'd need to have a restartable continuation at every point where an allocation would take place. For example, here's the code for negating a bignum:

static void big_neg(C_word c, C_word self, C_word k, C_word x)
{
  bignum_type big = big_of(x); /* Extract bignum data */
  C_word negated_big = bignum_new_sign(big, !(BIGNUM_NEGATIVE_P (big)));
  C_return_bignum(k, negated_big);
}

static bignum_type bignum_new_sign(bignum_type bignum, int negative_p)
{
  bignum_type result =
    (bignum_allocate ((BIGNUM_LENGTH (bignum)), negative_p));  /* mallocs */
  bignum_destructive_copy (bignum, result);  /* basically a manual memcpy */
  return (result);
}

It looks very simple, but a lot is going on under the hood. The C_return_bignum function contained all the hairy complexity; it would either convert the bignum to a fixnum, deallocate the bignum and call the passed continuation, or it would set up a continuation that would copy the bignum into a heap-allocated copy and deallocate the original bignum, and pass that to an allocation function.

This was changed into the following, which uses the core's _u_ naming convention to indicate that the function is unsafe, i.e. it doesn't check its arguments:

void C_ccall C_u_bignum_negate(C_word c, C_word self, C_word k, C_word x)
{
  C_word kab[C_SIZEOF_CLOSURE(3)], *ka = kab, k2, negp, size;

  /* Create continuation k2, to call after allocation */
  k2 = C_closure(&ka, 3, (C_word)bignum_negate_2, k, x);
  
  negp = C_i_not(C_u_i_bignum_negativep(x)); /* Toggle sign */
  size = C_u_i_bignum_size(x);
  C_allocate_bignum(3, (C_word)NULL, k2, size, negp, C_SCHEME_FALSE);
}

static void bignum_negate_2(C_word c, C_word self, C_word new_big)
{
  C_word k = C_block_item(self, 1), /* Extract original continuation */
         old_big = C_block_item(self, 2); /* Extract original bignum */

  /* Copy old bignum digits to newly allocated (negated) bignum */
  C_memcpy(C_bignum_digits(new_big), C_bignum_digits(old_big),
           C_header_size(C_internal_bignum(old_big))-C_wordstobytes(1));

  C_kontinue(k, new_big); /* "Return" the new bignum by calling k with it */
}

The new version looks hairier but does less, because it allocates the bignum directly into the nursery or the heap. Because this may require a GC, it needs to have a continuation, which can be invoked from the GC's trampoline. That's the reason this has to be cut into two separate C functions. There are functions that allocate 2 bignums or even more, which I had to cut up into 3 or more functions!

Besides using "native" naming conventions, this new version also gets rid of the unnecessary, un-CHICKENish bignum_type abstraction. Instead, it uses only C_word as its type. This also removed the need for some questionable type casts. Luckily, the final negating version that ended up in CHICKEN 5 is a lot simpler and again only one function, but that required a breakthrough in thinking that I hadn't had at this point yet. I will discuss this breakthrough in the final post in this series.

After having taken care of all the functions, very little remained of the original Scheme48 code. It was completely mutilated! Earlier I had to rewrite some of the Scheme code to improve performance, and now I was restructuring the C code. To top it off, after studying other bignum implementations, it became clear that the Scheme48 code was pretty slow when compared to other Schemes. It only implemented the "classical" algorithms, and it was optimised for readability, not speed.

So, I studied up on better algorithms to make it perform more acceptably. In the next few parts, I'll share with you a few things that I've learned.

Self hosting with cgit using Spiffy

2016-03-07T19:48:52Z

The recent trouble at GitHub, both cultural changes within the company and criticism from the community, reminded me how unstable the whole "free as in beer code hosting for the public good" idea really is. The good part is that it motivated me to finally look into setting up personal hosting for my own projects, because how hard can it be, really?

History shows: code hosting is unreliable

GitHub isn't unique in its problems, and switching away to a competitor with less problems isn't going to help in the long run. Besides, dormant projects will be irretrievably lost if GitHub ever shuts down unless someone happens to have a recent clone. To show that the problem is bigger than GitHub, let's look at some events that I remember:

Ages ago in internet time, in 2005, the Dutch government set up a forge website to foster open source usage within the government. In 2009, it went offline. Most projects crawled off to SourceForge to die (including mine), but some survive to this day.
The year 2012 is actually not that long ago, and I clearly remember when BerliOS shut down. I had used it for a project or two back when Subversion was brand new. They offered Subversion hosting when SourceForge and Savannah only offered CVS.
In 2014, I found out the hard way that RubyForge had shut down a few months earlier. I lost the complete commit history for a maintenance-mode project at work.
In 2015, Gitorious got assimilated and subsequently shut down by GitLab. At least, projects that didn't opt-in to a GitLab migration are available in read-only from their archive.
Also in 2015, it became known that SourceForge was adding malware to popular free software downloads and lost all remaining goodwill from the community. It's probably a matter of time before it dies completely, taking down with it an Alexandrian wealth of source code.
At the beginning of 2015, Google Code shut down. Tarballs of archived projects will stay available until the end of the year, but after that the code will probably be gone forever.
In 2017, Gna! shut down. It was lesser-known but still relatively popular in some circles (especially in France).
In 2020, Bitbucket axed Mercurial support, simply deleting all Mercurial repositories (instead of, say, converting them to git). Some projects in maintenance mode where the author moved on to other hosting sites for their projects got their public code (and issue tracker!) removed.

Except for Gitorious/GitLab, I have used and relied on every single one of these code hosting sites, either for personal or work-related projects, or as a contributor to someone else's project.

Is your code for the public good?

As a community, we take too much for granted: Code hosting, free of charge, is regarded almost as a public utility. But in reality, it's far from that: we are relying on untrustworthy companies, assuming they won't tamper with our code and keep their servers up and running forever, free of cost.

And to top it all off, there's the irony, or should I say hypocrisy, of the free software community's dependence on proprietary software for critical project infrastructure. At the same time, some of us are trying to explain to people why proprietary software is harmful to society. Most people just choose what's most convenient when deciding where to host code. We need to realise that this decision can be a political, philosophical and ethical choice. This is my main motivation to move all my personal projects away from Bitbucket.

There are basically two ways to achieve code hosting freedom: The first is to entrust your code to a nonprofit organisation which is committed to supporting free software projects without commercial interference. For instance, the Free Software Foundation offers Savannah. There are more specific hosting sites, like those from the Debian, Apache, GNOME and KDE foundations, but they don't accept all projects. So, for small and personal projects it is probably easier to self-host.

Now, I know I can't completely avoid proprietary code hosting sites due to the network effect of contributing to free software projects, but for projects I control, I can at least do better. With CHICKEN we're already hosting our own code (with mailing lists provided by Savannah). I decided to host my personal projects on a VPS of my own, which is a good dog fooding opportunity: I found and fixed a bug in the spiffy-cgi-handlers egg while setting this up.

The rest of this blog post will explain how I set up code.more-magic.net. It's not difficult, so hopefully I can inspire you to consider hosting your own code too.

Installing Git, cgit and CHICKEN

I knew right away that I didn't need all the bells and whistles that GitLab or Phabricator provide. I just want to host a few small personal projects, and if ever one really did become popular (ha ha), it would make sense to set up a dedicated server like we have for CHICKEN.

I also decided to convert my Mercurial repositories to Git, to consolidate my VCS usage: At work we're using it, CHICKEN is using it, and so are other projects I contribute to. I'm tired of context-switching all the time, and I'm finally acclimated to magit.

Since I'm only using Git, I don't need to worry about VCS independence of the code hosting tool. Preferably it shouldn't need much RAM, to keep hosting costs down. I narrowed it down to gitweb or cgit. I chose the latter because its UI is less messy and confusing than the former (to me, at least).

Installing cgit is easy as 1, 2, 3:

$ sudo apt-get install git cgit

It is possible to install CHICKEN from its Debian package, but as a core developer I always want the latest version. Besides, CHICKEN only depends on libc, so it's no big deal:

$ sudo apt-get install gcc make libc-dev
$ wget https://code.call-cc.org/releases/4.10.0/chicken-4.10.0.tar.gz
$ tar xzf chicken-4.10.0.tar.gz
$ cd chicken-4.10.0

By installing it into /usr/local/chickens/4.10.0, you can have multiple versions of CHICKEN installed at the same time:

$ make PLATFORM=linux PREFIX=/usr/local/chickens/4.10.0
$ sudo make PLATFORM=linux PREFIX=/usr/local/chickens/4.10.0 install

A nice trick to help us remember which CHICKEN is being used for Spiffy is to symlink it by usage:

$ sudo ln -s /usr/local/chickens/4.10.0 /usr/local/chickens/spiffy

Setting up Spiffy under systemd

Let's start by installing the Spiffy egg. We'll also need a CGI handler to use cgit from Spiffy:

$ /usr/local/chickens/spiffy/chicken-install -s spiffy spiffy-cgi-handlers

First, we must create a small script to run Spiffy. Put this in /usr/local/libexec/spiffy.scm and make it executable:

#!/usr/local/chickens/spiffy/bin/csi -s

(use data-structures spiffy uri-common intarweb cgi-handler)

(spiffy-user "www-data")
(spiffy-group "www-data")
(server-port 80)

(root-path "/usr/share/cgit")
(error-log "/var/log/spiffy/error.log")
(access-log "/var/log/spiffy/access.log")
;(debug-log "/var/log/spiffy/debug.log")

(define cgit (cgi-handler* "/usr/lib/cgit/cgit.cgi"))

;; cgit expects its PATHINFO to contain the full request URI path.
;; However, this is a 404 handler, so we haven't resolved the path
;; to a final file.  This means we don't know what part of the URI
;; is the "script path" and which is the remainder (the pathinfo).
(handle-not-found
  (lambda (p)
    (let* ((uri (request-uri (current-request)))
           (uri-path-rest (cdr (uri-path uri)))
           (path (string-intersperse uri-path-rest "/")))
      (parameterize ((current-pathinfo uri-path-rest))
        (cgit path)))))

;; For the root request (otherwise you'll get 403 forbidden)
(handle-directory cgit)

(start-server)

Now, teach logrotate about the log files we configured, by saving this as /etc/logrotate.d/spiffy:

/var/log/spiffy/access.log
/var/log/spiffy/error.log
/var/log/spiffy/debug.log {
    daily
    missingok
    rotate 10
    compress
    delaycompress
    notifempty
    # If you're in the adm group, you can read logs without sudo
    create 640 www-data adm
}

This rotates logs daily, going back 10 days. Spiffy won't create the directory, and needs to be able to write to the file as www-data, so let's create the files and the directory:

$ sudo mkdir /var/log/spiffy
$ sudo touch /var/log/spiffy/{access,error,debug}.log
$ sudo chown -R www-data:adm /var/log/spiffy

The systemd script from our wiki is a bit too complicated, so I based mine on a simpler example from Python's Gunicorn documentation.

Put the following in /etc/systemd/system/multi-user.target.wants/spiffy.service:

[Unit]
Description=Spiffy the web server
After=network-online.target

[Service]
User=root
Group=www-data
WorkingDirectory=/usr/share/cgit/
ExecStart=/usr/local/libexec/spiffy.scm
ExecStop=/bin/kill -s TERM $MAINPID

[Install]
WantedBy=multi-user.target

Note that we need to run it as root so that it can bind to port 80. It will drop the privileges itself. To register this unit file immediately, you'll need to reload systemd:

$ sudo systemctl daemon-reload

Now you can start Spiffy simply by typing:

$ sudo systemctl start spiffy

If you visit the website with a browser, you'll notice that the styling doesn't work yet. To fix that, we'll turn to the cgit configuration.

Configuring cgit

The default configuration puts cgit's assets at /cgit-css. You could set up Spiffy so that this is handled from /usr/share/cgit, but it's much simpler to remove the prefix from the configuration. While we're at it, let's add some Git repositories as well. Open up /etc/cgitrc and put this in there:

# cgit config, see cgitrc(5) for details

css=/cgit.css
logo=/cgit.png

repo.url=testrepo1
repo.path=/srv/git/test1
repo.desc=This is my first git test repository

section=A section for repo 2
repo.url=testrepo2
repo.path=/srv/git/test2
repo.desc=This is my second git test repository

Let's make sure the git repos exist:

$ sudo mkdir /srv/git
$ sudo chown user:user /srv/git
$ sudo chmod 755 /srv/git
$ git init --bare /srv/git/test1
$ git init --bare /srv/git/test2

If you now visit the web site (no reload/restart necessary), you should see a fully functional cgit installation.

Improving the cgit configuration

The above is a simple configuration. My configuration currently looks more like the following:

# cgit config, see cgitrc(5) for details

css=/cgit.css
logo=/cgit.png

root-title=My repositories
root-desc=This is my repo browser. There are many like it, but this one is mine.

# This will show a "clone" section at the bottom of each repo.
clone-prefix=http://code.example.com ssh://code.example.com

# If you don't want clones to be made over HTTP, you must disable it!
#enable-http-clone=0

# When you want to serve eggs from cgit, snapshot links are helpful.
# Note that snapshots can be downloaded even when links are not shown!
snapshots=tar.gz

# Show readme files in "about" tab.  The colon tells cgit to take the
# file from the default branch (usually master). IMPORTANT: See below!
readme=:README
readme=:readme
readme=:readme.txt
readme=:README.txt
readme=:readme.md
readme=:README.md

# Process readme files with a file extension-specific formatter.
# Be *very* careful with this!  The default filter allows arbitrary
# HTML which means XSS, cookie hijacking and other tricks, so either
# run this on a sand-boxed domain or be careful who gets commit access.
about-filter=/usr/lib/cgit/filters/about-formatting.sh

# Highlight source files.  This requires the "python-pygments" package.
# For maximum dog fooding, I should use the colorize egg here :)
source-filter=/usr/lib/cgit/filters/syntax-highlighting.py

# Automatically scan /srv/git for repos.  If you want to de-list some,
# simply make them unreadable for the www-data user.  Important: this
# must be the last statement: everything after it is ignored!
section-from-path=1
scan-path=/srv/git

Many thanks to this guide for pointing out the paths and configuration settings that cgit uses on Debian. There are two follow-up posts that are useful too. There's one with tips on how to use cgit in practice and one about how to tweak the layout.

Now go forth and host your own code!

Let's add a statistical profiler to CHICKEN!

2016-01-04T19:14:26Z

I just submitted a patch to add a statistical profiler to CHICKEN. In this post, I'll explain how it works. It's easier than you'd think!

Instrumentation-based profiling

There are two major ways to profile a program. The first way has been supported by CHICKEN as long as I can remember: You add instrumentation to each procedure. This counts how often the procedure is called, and how much time is spent in it.

You've probably done this by hand: You check the clock before and after calling a procedure, and print the difference. This can be useful when whittling down a specific procedure's run time. But when you want to know where the bottlenecks are in the first place, it's less practical. You don't want to manually add this stuff to all your procedures!

In order to easily instrument each procedure, you'll need language support, either in the compiler or in the run time. Unfortunately, the instrumentation itself will cause your program to slow down: all this tracking takes some time! That's why CHICKEN's profiler is part of the compiler: Instrumentation is emitted only when you compile with -profile. This option adds wrappers around each procedure, which look like this:

;; Original source code:
(define (foo a b c)
  (print (+ a b) c))

;; Instrumented version created by -profile:
(define foo
  (lambda args
    (dynamic-wind  ; Explained later
      (lambda () (##sys#profile-entry 0 profile-info-1234.4567))
      (lambda () (apply (lambda (a b c) (print (+ a b) c)) args))
      (lambda () (##sys#profile-exit 0 profile-info-1234.4567)) ) ) )

In the above code, ##sys#profile-entry starts the clock and increments the call count for this procedure, and ##sys#profile-exit stops the clock. The profile-info-1234.4567 is a global vector in which each procedure of this compilation unit is assigned a unique position.

To create the vector and assign the positions, a prelude is added to the compilation unit. This defines the vector and registers a position for each procedure:

;; Prelude for entire program:
(define profile-info-1234.4567 (##sys#register-profile-info 1 #t))
(##sys#set-profile-info-vector! profile-info-1234.4567 0 'foo)

;; Not exactly true, but let's pretend because it's close enough:
(on-exit ##sys#finish-profile)

This simply creates a vector of size one, and assigns the foo procedure to position zero. Then it requests ##sys#finish-profile to run when the program exits. This will write profile information to disk on exit.

We need `dynamic-wind`, but it creates problems

If you're not familiar with it, dynamic-wind is a sort of "guard" or "try/finally". Whenever the second lambda is entered, the first lambda ("before") is invoked, and when it is left, the third lambda ("after") is invoked. To understand why we need a dynamic-wind in the code presented earlier, consider the naive, incorrect implementation:

;; INCORRECT expansion:
(define foo
  (lambda args
    (begin
      (##sys#profile-entry 0 profile-info-1234.4567)
      (apply (lambda (a b c) (print (+ a b) c)) args)
      ;; Not reached if + throws an exception.  This
      ;; will happen when a or b are not numbers!
      (##sys#profile-exit 0 profile-info-1234.4567) ) ) )

Here, the "after" part will be skipped if the procedure raises an exception. Furthermore, a continuation might be captured or called in the procedure, even multiple times. This would cause the code to jump in and out of the procedure without neatly going over the before/after bits every time. Because of this, the profiler would miss these exits and re-entries, and hence it would not be able to accurately keep track of the time actually spent in this procedure.

The code with dynamic-wind will take care of non-local exits, stopping the clock whenever we jump out of the procedure, and starting it when we jump in again via a captured continuation.

While dynamic-wind is necessary, it also implies quite a heavy hit on performance: dynamic-wind isn't cheap. Even worse is the fact that it prevents the compiler from inlining small procedures in larger ones. Furthermore, the use of apply implies we'll need to cons up an argument list. Normally, arguments aren't put into a Scheme list, because doing so results in more garbage being created. This means that the performance shape of the profiled application can be quite different from the original, non-profiled application!

Statistical profiling

I've always wanted to look into fixing the profiler, but never had the energy to do so. Now that Felix wrote an entire graphical debugger for CHICKEN, I thought maybe I could lean on the debugger's infrastructure to make a better profiler. But it turned out I didn't have to!

First, I should explain statistical profiling. The basic idea is that the process is periodically sampled while it's running. These samples are taken by inspecting the instruction pointer and mapping it to a procedure. If you do this often enough (every 10 ms or so), you can get a pretty good idea of where the program is spending most of its time.

CHICKEN's trace buffer

Looking at the instruction pointer or C function is not very useful in CHICKEN, unless you like to pore over endless piles of machine-generated C code and to mentally map it back to Scheme. It can be educational and even fun, just like it can be fun and educational to read the assembly output of a C compiler, but it is generally unproductive and gets frustrating quickly.

So how can we take a snapshot of what a Scheme program is doing at any given time? It turns out that CHICKEN already does this: when a Scheme program raises an exception, the interpreter will show you a call trace. This is a bit like a stack trace in a "traditional" language, only it shows a trace of the execution flow that led to the error. Let's look at a contrived example:

(define (fib n)
  (if (< n 2)
      n
      (+ (fib (- n 1)) (fib (- n 2))) ) )

(define (run-fib)
  (let* ((n (string->number (car (command-line-arguments))))
         (result (fib n)))
    ;; This line is wrong: It tries to append a number to a string
    (print (string-append "Result: " result)) ) )

(run-fib)

When you compile the program and run it, you'll see the output:

 $ csc -O3 fib.scm
 $ ./fib 3
 
 Error: (string-append) bad argument type - not a string: 2
 
         Call history:
         
         fib.scm:12: run-fib
         fib.scm:7: command-line-arguments
         fib.scm:8: fib
         fib.scm:4: fib
         fib.scm:4: fib
         fib.scm:4: fib
         fib.scm:4: fib
         fib.scm:10: string-append               <--

In a stack trace you would see only two lines, mentioning run-fib and string-append. Here we can see the trace of execution through the program, where it entered run-fib, then called command-line-arguments, proceeded to invoke fib five times in total, and finally it invoked string-append with the result.

There are advantages and disadvantages to either approach, but a trace makes more sense in languages with full continuations. It is the natural thing to do in a compiler that converts all programs to continuation-passing style. It does tend to confuse beginners, though!

So, these trace points are already inserted into every program by default. This happens even in programs optimised with -O3, because trace points have very little overhead: It's just a pointer into a ring buffer that gets updated to point to the procedure's name. You can choose to omit traces completely via -no-trace or -d0, but that's not the default.

Trace points are a good fit for our profiler: when taking a sample, we can simply take a look at the newest entry in the trace buffer, which will always reflect the procedure that's currently running!

Setting up a sampler

So how can we interrupt the program at a point in time, without messing with the program's state? This is simple: we ask for a signal to be delivered periodically. There's even a dedicated signal reserved for this very task: SIGPROF.

We'll use setitimer() to set up the timer which causes the signal to be delivered, even though POSIX says it's obsolete. It is much more convenient and more widely supported than the alternative, timer_create() plus timer_settime(). We can always switch when setitimer() is removed from an actual POSIX implementation.

The following setup code is simplified a bit (most importantly, error handling is omitted):

C_word CHICKEN_run(void *toplevel) { /* Initialisation code for CHICKEN */
  /* ... a lot more code ... */

  if (profiling) {
    struct itimerval itv;
    time_t freq = 10000;                   /* 10 msec (in usec) */

    itv.it_value.tv_sec = freq / 1000000;
    itv.it_value.tv_usec = freq % 1000000;
    itv.it_interval.tv_sec = itv.it_value.tv_sec;
    itv.it_interval.tv_usec = itv.it_value.tv_usec;

    setitimer(ITIMER_PROF, &itv, NULL);
    signal(SIGPROF, profile_signal_handler);
  }

  /* ... a lot more code ... */
}

This sets up a profile timer. When such a timer expires, the kernel will send SIGPROF to the process. The code also registers a signal handler that will be invoked when this happens. It looks like this, much simplified:

struct profile_item           /* Item in our profiling hash table */
{
  char *key;                  /* Procedure name, taken from trace buffer */
  unsigned int sample_count;  /* Times this procedure was seen */
  struct profile_item *next;  /* Next bucket chain */
};

void profile_signal_handler(int signum)
{
  struct profile_item *pi;
  char *procedure_name;

  procedure_name = get_topmost_trace_entry();

  pi = profile_table_lookup(procedure_name);

  if (pi == NULL) {
    pi = profile_table_insert(procedure_name);
    pi->sample_count = 1;
  } else {
    pi->sample_count++;
  }
}

Maybe you are wondering why we're doing this in C rather than Scheme. After all, Scheme code is more readable and easier to maintain. There are a few reasons for that:

Scheme signal handlers are blocked in some critical sections of the run time library. This would delay profiling until the program is back in user code, skewing the results.
Core is compiled with -no-trace by default, but this can be turned off. Doing so can be useful when profiling core procedures, but not with a signal handler in Scheme. It would see its own code in the trace buffer, instead of what we want to trace!
The profiling code should be as low-overhead as possible, to avoid affecting the results. Remember, this is one of the main problems with the instrumenting profiler! While CHICKEN produces fast code, it is faster if we do it directly in C, and it will trigger no garbage collections.

After the program has completed its run, we must write the profile to a file. I won't show it here, but the CHICKEN implementation simply writes each key in the hash table with its call count and the time spent in that procedure. The time is estimated by multiplying the sampling frequency by the call count.

This means we'll miss some calls, and therefore we'll under-represent the time taken by those procedures. On the other hand, some procedures are over-represented: if a sample is taken for a very fast procedure, we'll assign 10ms to it, even if it runs in a fraction of that. This is the essence of the statistical approach: if a program runs long enough, these measurement errors should balance out.

Comparing the two profilers

It's hard to come up with a small but representative example which is self-contained, so I'll use a few benchmarks from Mario's collection.

Low-level vs high-level

The first benchmark we'll profile is the kernwyk-array benchmark. It's taken from a historical set of benchmarks by Kernighan and Van Wyk, designed to compare the performance of various "scripting languages". This particular benchmark creates an array containing a million numbers by destructively initialising it. After that, it creates a second array into which the first is copied. This is repeated 100 times.

If we compile this with csc -O3 -profile and run it, the original profiler gives us the following breakdown:

procedure	calls	seconds	average	percent
`my-try`	100	4.008	0.040	100.000
`go`	1	4.008	4.008	100.000
`create-y`	100	2.311	0.023	57.684
`create-x`	100	1.696	0.016	42.315

If we compile this with csc -O3 and run it with -:p under the new profiler, we'll get a radically different result, even though the total run time did not change much:

procedure	calls	seconds	average	percent
`kernwyk-array.scm:12:make-vector`	100	2.400	0.024	58.679
`kernwyk-array.scm:5:make-vector`	100	1.690	0.016	41.320

The statistical profiler is a bit more "low-level": It tells you exactly the procedure call and line that is taking the most time. On the other hand, the instrumentation-based profiler shows us a breakdown in which procedure the most time is spent. The percentages and "seconds" column are also different: the original profiler shows us the cumulative time each procedure takes up. Thus, a main entry point will always be at 100% at the top.

But the most significant difference is in what this tells us about where the time is spent: the original profiler tells us that create-y is a little slower than create-x. Reading such output would lead me to think that probably vector-ref and vector-set! take the most time. If we remove all calls to those, the program takes 2.6 seconds, and the profiler output looks more or less the same, so they're not the biggest contributor to the total run time. Instead, make-vector is, due to the fact that it allocates, which will cause garbage collections. And garbage collections are the real time consumers in this benchmark!

Precision of the two profilers

The next benchmark we'll look at is the nfa benchmark. I'm not sure about the origins of this benchmark. It emulates a depth-first NFA search for the regular expression ((a | c)* b c d) | (a* b c). This is matched against a string containing 133 "a" characters followed by "bc".

The output of the instrumenting profiler is completely useless, because only toplevel procedures are instrumented:

procedure	calls	seconds	average	percent
`recursive-nfa`	150000	8.071	0.000	100.000

This is another reason for wanting to "fix" the profiler: it doesn't give an inside view of where large procedures or closures are spending their time. You can manually tweak the programs by lifting all the inner procedures up to the toplevel. If these procedures close over some variables, you must turn those into extra arguments and pass them along when calling these procedures. It's a bit tedious, but if we do this for the benchmark, we'll get output that is more useful:

procedure	calls	seconds	average	percent
`recursive-nfa`	150000	30.687	0.00020458	100.000
`state0`	150000	30.559	0.00020373	99.582
`state1`	20100000	23.372	0.00000116	76.160
`state2`	20100000	8.423	0.00000041	27.450
`state3`	20100000	7.080	0.00000035	23.070
`state4`	150000	0.132	0.00000088	0.430

Now, if we take the original, untweaked program, and run it through the statistical profiler, we'll immediately get useful output:

procedure	calls	seconds	average	percent
`nfa.scm:14:state1`	146	5.830	0.039	72.602
`nfa.scm:31:state3`	141	2.160	0.015	26.899
`nfa.scm:9:state1`	2	0.020	0.010	0.249
`nfa.scm:44:##sys#display-times`	1	0.010	0.010	0.124
`nfa.scm:9:state3`	1	0.010	0.010	0.124

This also shows a disadvantage of the statistical profiler: the call counts are all wrong! That's because the state procedures are extremely fast: in the original profiler you can see that they run 20 million times in 8 seconds or so. Because they're so fast, the average time per call is close to zero. This results in the timer being too slow for sampling each procedure while it is running. It's so slow we only see an extremely small fraction of all calls!

Nevertheless, we can clearly tell that most time is spent in state1 and state3. The calls for state2 are not even registering, because this state will return much sooner: there are almost no b, c or d characters in the input pattern, so it will just quickly "fall through" this procedure without a match. The reason it shows up in the original profiler is because the instrumentation itself is interfering with an accurate reading of time spent in the procedure.

The total run time of the instrumented and tweaked version is almost 31 seconds, while the total run time of the version with statistical profiling is less than 8 seconds on my laptop! Let's take a closer look at that overhead.

Instrumentation overhead

Having two distinct ways of gathering profile data opens up a really cool opportunity. We can measure the overhead introduced by the instrumentation profiler, by running it under the statistical profiler!

Let's do that on the "tweaked" NFA benchmark again:

 $ csc -O3 -profile nfa.scm
 $ ./nfa -:p
 30.88s CPU time, 2.404s GC time (major),
 163050000/149475510 mutations (total/tracked),
 5972/113664 GCs (major/minor)

And let's look at the result:

procedure	calls	seconds	average	percent
`##sys#profile-entry`	701	17.000	0.024	55.051
`##sys#profile-exit`	633	10.710	0.016	34.682
`##sys#dynamic-wind`	145	1.660	0.011	5.375
`nfa.scm:15:state2`	61	0.670	0.010	2.169
`nfa.scm:29:state3`	41	0.440	0.010	1.424
`nfa.scm:12:state1`	31	0.340	0.010	1.101
`nfa.scm:41:state0`	3	0.040	0.013	0.129
`nfa.scm:49:recursive-nfa`	2	0.020	0.010	0.064

This clearly shows that the instrumentation machinery completely dominates the profile: A stunning 27 seconds are being soaked up by ##sys#profile-entry and ##sys#profile-exit!

Of course, this is just a contrived example of profiling a benchmark. An experienced Chickeneer would have been able to just tell from the output of the benchmark with and without profiling:

 $ csc -O3 nfa.scm
 $ ./nfa
 8.156s CPU time, 0.012s GC time (major), 32/6849 GCs (major/minor)
 $ csc -O3 -profile nfa.scm
 $ ./nfa
 30.628s CPU time, 2.316s GC time (major),
 163050000/149475510 mutations (total/tracked),
 5970/113666 GCs (major/minor)

Aside from the obvious increase in CPU time, you can see that the number of mutations went from zero to more than a hundred million. The number of garbage collections is also many times higher. This would spell certain doom if you saw it in a real program!

Conclusion

It's too early to be sure, but it looks like a statistical profiler is a useful alternative to instrumentation. On the other hand, for some programs the situation is reversed:

Programs blocking SIGPROF offer fewer sampling opportunities, resulting in incomplete profiles.
If you need the exact number of procedure calls, instrumentation is your only real option.
When there are lots of smallish procedures called by a few toplevel procedures, the noisy low-level output of the statistical profiler can drown out the useful information.

I think if we can push down the performance bottleneck caused by the instrumentation, it'll become a lot more useful again. This won't be easy, because some of the overhead is fundamental to its use of dynamic-wind and the way inlining is prevented by it. In the mean time, please test the statistical profiler and let me know how it works for you!

The Flattr "experiment"

2015-10-25T18:35:04Z

Maybe you've noticed the little Flattr icons that I've added at the bottom of each post, and now you're wondering if, and why, you should support my blogging efforts with a donation.

Let me state this first and foremost: I enjoy writing posts about subjects that interest me and I learn a lot from the research I put into my posts. So, even without donations I'd continue doing so. But, it's also a fact that many of my posts demand a lot of effort. Researching, writing and illustrating an in-depth post can take weeks, depending on my energy levels and other activities.

I hope you now understand why this blog is updated so slowly. I'm convinced that few updates with informative, long posts is better than frequent 100-word status updates of my projects. That would be a waste of your time! But because it takes so much time and effort, sometimes I struggle to find the motivation for writing new posts.

This is where Flattr comes in: a small donation is a wonderful way to express how much you liked a post, and it can really lift my spirits and motivate me to write more!

As an added advantage, by seeing which posts get Flattr'd the most I can get a better idea of what you, the readership of my blog, like and what you don't like. This will help me choose which topics to write about more.

Update: You can now also donate via Bitcoin, as several people indicated that they wouldn't like to sign up to Flattr. The bitcoin address you can use is 19uPJ7BVcJea4iFtUiMcFGTEJG8pEbFmDC.

Update 2: After more than 3 years I got only a handful of donations, so I decided to get rid of these buttons. I've learned from this experiment that (in general) people aren't interested in tipping random strangers for good articles. Don't worry, I still won't put ads on my blog :)

Why I refuse to put ads on my blog

Of course, I know that placing ads on my blog would be the easiest way to get some returns from the hard work that I put into my posts. This would easily repay my hosting costs. However, in my opinion, the advertising industry is the most toxic influence on the web today. It is completely absurd that it's considered normal for companies (and individuals with blogs...) to be rewarded for filling up our screens with obnoxious garbage, and for eroding our privacy through various tracking mechanisms. They don't care that all this digital pollution is only going to be acted on by a tiny fraction of all visitors (if any). It is downright disrespectful; personally I think it's as bad as spam e-mail.

In my opinion, Flattr is a much more benign way of receiving money for my efforts. The company was created by a founder of The Pirate Bay, as a better and more direct alternative for rewarding creators than the commercial exploitation of copyright law by middle men. It's also non-unobtrusive: just a small icon in each post. The integration that I'm using is a simple hyperlink, so no nasty tracking cookies or web beacons; I'm even hosting the icon image myself.

Besides that, Flattr is completely voluntary on the visitor's part. This means that any rewards are directly linked to people's enjoyment of my posts. This encourages informative posts of substance rather than click-bait titles with disappointing content which would draw more traffic, just to get more ad impressions.

These are my views on the matter. If you're using ads on your blog, I hope I've managed to convince you to switch away from ads to visitor-supported donations. And if you like my posts, I hope you'll consider making a small donation.

Integrating Flattr into Hyde

To make this post at least a little bit on-topic, I'd like to share how I added the Flattr button to my Hyde configuration. I added the following definition to hyde.scm (slightly modified):

(use hyde uri-common)

(define (flattr-button #!optional (page (current-page)))
  (let* ((base-uri (uri-reference "https://flattr.com/submit/auto"))
         (attrs `((user_id . "YOUR FLATTR ACCOUNT NAME")
                  ;; A bit ugly, but Hyde does it like this, too
                  (url . ,(string-append ($ 'base-uri page) (page-path page)))
                  (category . "text")
                  (language . ,(or ($ 'lang page) "en_GB"))
                  (title . ,($ 'title page))))
         ;; Flattr doesn't grok semicolon as query param separator...
         (flattr-page-uri
          (parameterize ((form-urlencoded-separator "&"))
            (uri->string (update-uri base-uri query: attrs)))))

    `(a (@ (class "flattr-button")
           (href ,flattr-page-uri))
        (img (@ (src "/pics/flattr-button.svg")
                (alt "Flattr this!")
                (title "Flattr this post!"))))))

Then, in my SXML definition for article layouts which I call, surprise, surprise! article.sxml:

;; -*- Scheme -*-
()

`(div (@ (class "article"))

        ,@(prev/next-navigation)

        (h1 (@ (class "article-title"))
            ,($ 'title) " "
            (small (@ (class "date"))
                   "Posted on " ,(format-seconds (page-updated))))

        (div (@ (class "article-body"))
             (inject ,contents)
             ,(flattr-button))

        ,@(prev/next-navigation))

The button immediately follows the contents of the blog post, which makes sense because only after reading a blog post, you'll know whether it was worth something to you.

CHICKEN internals: data representation

2015-08-16T12:19:13Z

In my earlier post about the garbage collector, I lied a little bit about the data representation that CHICKEN uses. At the end of the post I briefly mentioned how CHICKEN really stores objects. If you want to fully understand the way CHICKEN works, it is important to have a good grasp on how it stores data internally.

Basic idea

CHICKEN attempts to store data in the most "native" way it can. Even though it's written in C, it tries hard to use machine words everywhere. So on a 32-bit machine, the native code that's eventually generated will use 32-bit wide integers and pointers. On a 64-bit machine it will use 64-bit wide integers and pointers.

This is known as a C_word, which is usually defined as an int or a long, depending on the platform. By the way, the C_ prefix stands for CHICKEN, not the C language. Every Scheme value is represented as a C_word internally. To understand how this can work, you need to know that there are roughly two kinds of objects.

Immediate values

First, there are the immediate values. These are the typical "atomic" values that come up a lot in computations. It is important to represent these as efficiently as possible, so they are packed directly in a C_word. This includes booleans, the empty list, small integers (these are called fixnums), characters and a few other special values.

Because these values are represented directly by a C_word, they can be compared in one instruction: eq? in Scheme. These values do not need to be heap-allocated: they fit directly in a register, and can be passed around "by value" in C. This also means they don't need to be tracked by the garbage collector!

At a high enough level, these values simply look like this:

This doesn't really show anything, does it? Well, bear with me...

Block objects

The other kind of value is the block object. This is a value that is represented as a pointer to a structure that contains a header and a variable-length data block.

The data block is a pointer which can conceptually be one of two types. In case of a string or srfi-4 object, the data block is simply an opaque "blob" or byte-vector. In most other cases, the block is a compound value consisting of other Scheme objects. Typical examples are pairs, vectors and records.

Because these values are heap-allocated, two distinct objects are not stored at the same memory address, even if they store the same value. That's why comparing their values is a complex operation. This operation is either equal? for deep structural comparison, or eqv? for value comparisons of numbers and symbols.

The R5RS specification explains that the difference between eq? and eqv? is not necessarily the same across Scheme implementations. For example, in CHICKEN, eq? can be used to compare characters and fixnums, because they are stored as immediate values. Portable programs should not rely on that. If you use eq? on block objects, their pointers will be compared. That means it checks whether they are one and the same object. This can be a useful operation in its own right.

Objects represented by data blocks also have to be tracked by the garbage collector: if there are still references to the block, its data must be copied (recursively) to keep it alive across GC events.

Here are some "high-level" examples of block objects:

This picture should look somewhat familiar to students of SICP: it is reminiscent of the box-and-pointer notation used to illustrate the structure of lists. The boxes containing green text represent the object headers. The header indicates the type of object and the object's size. It also determines whether the object's data block is a byte block or a block containing Scheme objects: if it contains Scheme objects, the header tells us how many slots (locations for storing Scheme objects) the object has. Byte blocks, on the other hand, are opaque and can contain any data. Their size is stored as a byte count.

From top to bottom, left to right, these represent the following values:

(#\a . #\b) is a pair containing the character "a" in its car and "b" in its cdr.
#(#f 123 456 #f 42) is a regular Scheme vector containing fixnums and false values.
"hello" is a string consisting of 5 characters (strings are treated as byte vectors in CHICKEN).
12.5 is an inexact representation of the number twelve and a half (a "flonum"). This is a byte block storing the raw byte value of a C double.
("hello" . (12.5 . ())) is the first pair of a proper list which contains a string and a flonum.
(12.5 . ()) is the cdr of that list; a pair containing a number and the end-of-list marker.

The final two pair objects show that slots (like any C_word) can hold not only immediate values, but also pointers to block objects. This leads us to the question: how to differentiate between a pointer to an object and an immediate object?

Bit fiddling

Most platforms require pointers to words to be aligned on a word boundary. Thus, on a 32-bit machine, memory addresses will always have zero in the lower 2 bits, because we can only point to multiples of 4 bytes. On a 64-bit machine, word addresses will have zero in the lower 3 bits.

Because the lower two bits are never used, we can perform a simple trick: any value that has either of the lower two bits set cannot be a word pointer, so we enforce immediate objects to have either bit set. It may feel like a gross hack to people who are used to working with "clean", high-level C code, but it is a technique which goes back a long way: Orbit, one of the earliest optimising compilers for Scheme, did exactly the same thing. Other modern Schemes like Larceny and Gambit do the same thing. Even Scheme48, which is probably the cleanest Scheme implementation, uses tagged words. Other Lisps use this representation as well. See Steel Bank Common Lisp, for example.

Many other dynamic languages don't use a packed data representation like this. Many prefer the simpler but bulkier struct representation. At the other end of the spectrum, we have statically typed, non-garbage collected languages. They generally don't need to store the type of a value along with it. Instead, they can directly store the "un-boxed" value in memory. This, and the relation to garbage collection, is explained rather well in Appel's 1989 paper "Runtime Tags Aren't Necessary".

Representation of objects

We've learned how CHICKEN distinguishes between pointers to (block) objects and immediate values. Now we will look into the nitty-gritty details of the object representation.

We can make the following breakdown of bit patterns (assuming a 32-bit platform):

This shows that the lower two bits can be used to distinguish between block objects (zero) and immediate objects (nonzero). For immediate objects, the low bit can be used to distinguish between fixnum objects and other kinds of immediate objects. The colouring indicates which bits are used for tagging objects of that kind. The uncoloured bits are used for representing the object being stored.

Fixnums are distinguished from "other immediate" values because fixnums are so incredibly common: they are used for indexing into strings, loop counters and many calculations. These have to be represented as efficiently as possible while storing the widest possible range of values. Run time type checking for fixnums should use as few CPU instructions as possible.

The "other immediate" types are further differentiated through the top two bits of the lower nibble:

The unused "other immediate" type of 0010 is reserved for future use. To get a good feel for the representation of immediates, let us look at a few example bit patterns. I'll also show you how to construct them in C.

Bit patterns of immediate values

Fixnums

These small integer values are stored in regular old two's complement representation, like the CPU uses. The lowest bit is always 1, due to the fixnum tag bit. The highest bit is used to determine the sign of the number.

The C_fix() macro shifts its argument one bit to the left, and sets the lower bit through a bit-wise OR with 1. To convert a Scheme fixnum back to a C integer, you can use the C_unfix() macro. This shifts its argument one bit to the right.

You might wonder what happens when you calculate or enter a very large integer. In CHICKEN 4, it will be coerced to a flonum. In CHICKEN 5, it will be stored as a bignum. Bignums are block objects, not immediates, because they may be arbitrarily large.

Booleans

That's a very large bit space for only two values. However, reserving a special type tag just for booleans simplifies type detection code: we only have to compare the lower four bits with 0110 to check whether an object is a boolean.

Characters

Characters do not make full use of the available bits, because the lower byte's high nibble is always 0000. This means that only 24 bits are available for representing the character on 32-bit platforms. Luckily, this is enough for representing the full Unicode range. If Unicode ever starts using up a bigger code space, we can always sneak in 4 more bits.

Special objects

This list is exhaustive: currently there are only four special objects. There is a lot of room for adding other special objects, if that ever becomes necessary.

The "unbound variable" representation cannot be captured by a program: when it is evaluated, it immediately raises an exception. This is its intended function.

A closer look at block objects

Now that we know all about immediate values, let's turn to block objects. These are represented by a pointer to a C structure with a header and a data block. Slightly simplified, it looks like this:

#define C_uword  unsigned C_word
#define C_header C_uword

typedef struct
{
  C_header header;
  C_word data[];    /* Variable-length array: header determines length */
} C_SCHEME_BLOCK;

The header's bit pattern is broken up into three parts:

The bottom 24 bits encode the size of the object. On 64-bit machines, the bottom 56 bits are used for the size. The middle 4 bits encode the type of the object. The top 4 bits encode special properties to make the garbage collector's work easier:

C_GC_FORWARDING_BIT indicates this object has been forwarded elsewhere. To find the object at its new location, the entire header is shifted to the left (which shifts out this bit). Then, the value is reinterpreted as a pointer. Remember, the lowest two bits of word pointers are always zero, so we can do this with impunity!
C_BYTEBLOCK_BIT indicates this is a byte blob (size bits are interpreted in bytes, not words).
C_SPECIALBLOCK_BIT indicates that the first slot is special and should be skipped by the GC.
C_8ALIGN_BIT indicates that for this object, alignment must be maintained at an 8-byte boundary.

The type bits are assigned incrementally. There is room for 16 types, only 2 of which are currently unused. Let's look at the definitions, which should also help to explain the practical use of the latter 3 GC bits:

#define C_SYMBOL_TYPE            (0x01000000L)
#define C_STRING_TYPE            (0x02000000L | C_BYTEBLOCK_BIT)
#define C_PAIR_TYPE              (0x03000000L)
#define C_CLOSURE_TYPE           (0x04000000L | C_SPECIALBLOCK_BIT)
#define C_FLONUM_TYPE            (0x05000000L | C_BYTEBLOCK_BIT | C_8ALIGN_BIT)
/*      unused                   (0x06000000L ...) */
#define C_PORT_TYPE              (0x07000000L | C_SPECIALBLOCK_BIT)
#define C_STRUCTURE_TYPE         (0x08000000L)
#define C_POINTER_TYPE           (0x09000000L | C_SPECIALBLOCK_BIT)
#define C_LOCATIVE_TYPE          (0x0a000000L | C_SPECIALBLOCK_BIT)
#define C_TAGGED_POINTER_TYPE    (0x0b000000L | C_SPECIALBLOCK_BIT)
#define C_SWIG_POINTER_TYPE      (0x0c000000L | C_SPECIALBLOCK_BIT)
#define C_LAMBDA_INFO_TYPE       (0x0d000000L | C_BYTEBLOCK_BIT)
/*      unused                   (0x0e000000L ...) */
#define C_BUCKET_TYPE            (0x0f000000L)

Most of the types should be self-explanatory to a seasoned Schemer, but a few things deserve further explanation.

You'll note that in the STRING type tag, C_BYTEBLOCK_BIT is also set, for obvious reasons: strings do not consist of slots containing Scheme values, but of bytes, which are opaque. Because the header's size bits store the length in bytes instead of in words, we can spot a very important limitation: CHICKEN strings can only hold 16 MiB of data on a 32-bit machine (on a 64-bit machine, strings are "limited" to 65536 TiB).

The CLOSURE type uses C_SPECIALBLOCK_BIT. This indicates to the garbage collector that the first slot contains a raw non-Scheme value. In the case of a closure, it contains a pointer to a C function. The other slots contain free variables that were closed over ("captured") by the lambda, which are normal Scheme objects. The compiled C function "knows" which variable lives in which slot.

The FLONUM type uses C_BYTEBLOCK_BIT, because an un-boxed C double value is not a Scheme object: we want to treat the data as an opaque blob. On a 32-bit system, the double will take up two machine words, so we can't use C_SPECIALBLOCK_BIT. The header will therefore hold the value 8 as its size. It also has another GC bit: C_8ALIGN_BIT. This ensures that the 64-bit double is aligned on a 8-byte boundary, to avoid unaligned access on 32-bit systems. This adds some complexity to garbage collection and memory allocation.

The STRUCTURE type refers to a SRFI-9 type of record object. Its slots hold the record's fields, and the accessors and constructors "know" which field is stored at which index.

The POINTER type holds a raw C pointer inside a Scheme object. Again, because C pointers are not Scheme objects, the object's first (and only) slot is treated specially, via C_SPECIALBLOCK_BIT.

The LOCATIVE type represents a rather complicated object. It acts a bit like a pointer into a slab of memory. You can use it as a single value which represents a location inside another block object. This can then be used as an argument to a foreign function that expects a pointer. Its first slot holds a raw pointer. The other slots hold the offset, the type of pointer (encoded as fixnum) and the original object, unless it is a weak reference.

The TAGGED_POINTER type is exactly like POINTER, but it has an extra user-defined tag. This can make it easier for code to identify the pointer's type. The tag is a Scheme value held in its second slot.

The SWIG_POINTER has been removed in CHICKEN 5 and was used for compatibility with SWIG. It is basically the same as POINTER, with additional SWIG data added to it.

The LAMBDA_INFO type stores procedure introspection information (mostly for debugging).

The BUCKET type is a special internal pair-like object which is used in the linked list of symbols under a hash table bucket in the symbol table. It does not count as a reference, so that symbols can be garbage collected when only the symbol table still refers to them.

So far, the only numeric types we've seen are fixnums and flonums. What about the other numeric types? After all, CHICKEN 5 will (finally) have a full numeric tower!

In CHICKEN 5, rational and complex numbers are viewed as two simpler numbers stuck together. They're stored as records with a special tag, which the run-time system recognises. Bignums are a different story altogether. When I first implemented them, they used one of the two unused header types in the list above. For various reasons I won't go into now, they are now also represented as a record with a special tag and a slot that refers to the byte blob containing the actual bignum value. Perhaps this is something for a later blog post.

Putting it all together in the garbage collector

So far, all of this perhaps sounds rather arbitrary and complex. The data representation is finely tuned to fit the garbage collector, and vice versa, so it may help to see how this simplifies the garbage collector.

The way the data representation is set up, the garbage collector only has to perform a few very basic checks. It does not need to know about any of the data types at all, it only needs to look at the special GC bits, and the size of an object!

Now we're finally ready to understand the heart of the garbage collector, which scans the live data and marks nested objects. This part of CHICKEN implements the Cheney algorithm. It's only 22 lines of code, without any simplifications. This is taken directly from runtime.c, with comments added for exposition:

/* Mark nested values in already moved (marked) blocks
   in breadth-first manner: */
while(heap_scan_top < (gc_mode == GC_MINOR ? C_fromspace_top : tospace_top)) {
  bp = (C_SCHEME_BLOCK *)heap_scan_top; /* Get next object from queue */

  /* If this word is an alignment hole marker, skip it */
  if(*((C_word *)bp) == ALIGNMENT_HOLE_MARKER)
    bp = (C_SCHEME_BLOCK *)((C_word *)bp + 1);

  n = C_header_size(bp);  /* Extract size bits from header */
  h = bp->header;         /* Remember header for masking other bits */
  bytes = (h & C_BYTEBLOCK_BIT) ? n : n * sizeof(C_word);  /* Size in bytes */
  p = bp->data;           /* Data block (first slot) */

  if(n > 0 && (h & C_BYTEBLOCK_BIT) == 0) { /* Contains slots, not bytes? */
    if(h & C_SPECIALBLOCK_BIT) { /* Skip first word (not a Scheme object) */
      --n;
      ++p;
    }

    while(n--) mark(p++); /* Mark Scheme objects in data slots */
  }

  /* Advance onto next word just after object */
  heap_scan_top = (C_byte *)bp + C_align(bytes) + sizeof(C_word);
}

The comment at the start refers to the fact that the "tip of the iceberg" of live data has already been copied; this code scans that set for nested objects referred to by those live objects. See my post about the garbage collector for more about how the GC and Cheney's algorithm work.

If we're in a minor GC, this code scans over the fromspace, which is the memory area into which the nursery objects will be copied. If we're in a major GC, we're scanning over tospace, which is the other half of the heap, to which the fromspace will be copied.

The code above simply advances the heap_scan_top pointer over the objects we need to look at until we hit the end of this space. It then checks for an ALIGNMENT_HOLE_MARKER, which is a magic value that gets used as a placeholder to indicate that this machine word should be skipped. This placeholder may get inserted when allocating a C_8ALIGN_BIT object, to avoid unaligned access.

Next, the size (in bytes) of the object is determined, based on the C_BYTEBLOCK_BIT. Finally, if it's a data block (C_BYTEBLOCK_BIT is not set), we loop over the data slots. The first word is skipped if it's indicated as "special" via C_SPECIALBLOCK_BIT.

The mark() call hides the hairy part. It performs the following steps:

Check that the word contains a block object. Otherwise, return because it's an immediate value.
Check that the word points to memory that's being moved, otherwise return. This avoids copying already copied or evicted data.
If the object has the C_GC_FORWARDING_BIT set, just update the marked slot with the new location the object was forwarded to, and return.
If we're on a 32-bit machine, the object to be copied has the C_8ALIGN_BIT set, and the current top of the target heap area is not aligned, insert an ALIGNMENT_HOLE_MARKER.
In case the target area is too small to hold the object, interrupt the current GC and trigger the "next" GC type. This will be a major collection if we're currently doing a minor collection, or a heap reallocating major collection if we're in a regular major collection.
Finally, copy the object via a simple memcpy().

Because this is done by mark() and not by the scanning code shown above, all this is only performed if the object in question is a block object which needs to be copied (the mark() macro inlines the first check). Just scanning the live data is extremely fast. We can thank the data representation's simplicity for that speed!

A (mostly) comprehensive guide to calling C from Scheme and vice versa

2014-10-16T19:05:25Z

When you're writing Scheme code in CHICKEN it's sometimes necessary to make a little excursion to C. For example, you're trying to call a C library, you're writing extremely performance-critical code, or you're working on something that's best expressed in C, such as code that requires a lot of bit-twiddling.

This post contains a lot of code, including generated C code. If you get too tired to absorb it, it's probably best to stop reading and pick it up again later.

A basic example of invoking C code from CHICKEN

This is one of CHICKEN's strengths: the ability to quickly drop down to C for a small bit of code, and return its result to Scheme:

(import foreign)

(define ilen
  (foreign-lambda* long ((unsigned-long x))
    "unsigned long y;\n"
    "long n = 0;\n"
    "#ifdef C_SIXTY_FOUR\n"
    "y = x >> 32; if (y != 0) { n += 32; x = y; }\n"
    "#endif\n"
    "y = x >> 16; if (y != 0) { n += 16; x = y; }\n"
    "y = x >>  8; if (y != 0) { n +=  8; x = y; }\n"
    "y = x >>  4; if (y != 0) { n +=  4; x = y; }\n"
    "y = x >>  2; if (y != 0) { n +=  2; x = y; }\n"
    "y = x >>  1; if (y != 0) C_return(n + 2);\n"
    "C_return(n + x);"))

(print "Please enter a number")
(print "The length of your integer in bits is " (ilen (read)))

This example is taken from a wonderful little book called "Hacker's Delight", by Henry S. Warren. It calculates the number of bits required to represent an unsigned integer (its "length"). By the way, this procedure is provided by the numbers egg as integer-length. The algorithm is implementable in Scheme, but at least a direct translation to Scheme is nowhere as readable as it is in C:

(define (ilen x)
  (let ((y 0) (n 0))
    (cond-expand
      (64bit
       (set! y (arithmetic-shift x -32))
       (unless (zero? y) (set! n (+ n 32)) (set! x y)))
      (else))
    (set! y (arithmetic-shift x -16))
    (unless (zero? y) (set! n (+ n 16)) (set! x y))
    (set! y (arithmetic-shift x -8))
    (unless (zero? y) (set! n (+ n 8)) (set! x y))
    (set! y (arithmetic-shift x -4))
    (unless (zero? y) (set! n (+ n 4)) (set! x y))
    (set! y (arithmetic-shift x -2))
    (unless (zero? y) (set! n (+ n 2)) (set! x y))
    (set! y (arithmetic-shift x -1))
    (if (not (zero? y)) (+ n 2) (+ n x))))

The performance of the Scheme version is also going to be less than that of the C version. All in all, plenty of good reasons to prefer integration with C. There's no shame in that: most fast languages forego "pure" implementations in favour of C for performance reasons. The only difference is that calling C in other languages is often a bit more work.

Analysing the generated code

In this section we'll unveil the internal magic which makes C so easily integrated with Scheme. You can skip this section if you aren't interested in low-level details.

As you might have noticed, the C code in the example above contains one unfamiliar construct: It uses C_return() to return the result. If you inspect the code generated by CHICKEN after compiling it via csc -k test.scm, you'll see that it inserts some magic to convert the C number to a Scheme object. I've added some annotations and indented for readability:

/* Local macro definition to convert returned long to a Scheme object. */
#define return(x) \
  C_cblock C_r = (C_long_to_num(&C_a,(x))); goto C_ret; C_cblockend

/* Prototype declaring the stub procedure as static, returning a
 * C_word (Scheme object) and passing arguments through registers.
 * It's not strictly necessary in this case.
 */
static C_word C_fcall stub7(C_word C_buf, C_word C_a0) C_regparm;

/* The stub function: it gets passed a buffer in which Scheme objects get
 * allocated (C_buf) and the numbered arguments C_a0, C_a1, ... C_an.
 */
C_regparm static C_word C_fcall stub7(C_word C_buf, C_word C_a0)
{
  C_word C_r = C_SCHEME_UNDEFINED, /* Return value, mutated by return() macro */
        *C_a=(C_word*)C_buf;     /* Allocation pointer used by return() macro */

  /* Conversion of input argument from Scheme to C */
  unsigned long x = (unsigned long )C_num_to_unsigned_long(C_a0);

  /* Start of our own code from the foreign-lambda* body, as-is */
  unsigned long y;
  long n = 0;
#ifdef C_SIXTY_FOUR
  y = x >> 32; if (y != 0) { n += 32; x = y; }
#endif
  y = x >> 16; if (y != 0) { n += 16; x = y; }
  y = x >>  8; if (y != 0) { n +=  8; x = y; }
  y = x >>  4; if (y != 0) { n +=  4; x = y; }
  y = x >>  2; if (y != 0) { n +=  2; x = y; }
  y = x >>  1; if (y != 0) C_return(n + 2);
  C_return(n + x);

C_ret: /* Label for goto in the return() macro */
#undef return
  return C_r; /* Regular C return */
}

/* chicken.h contains the following: */
#define C_return(x)              return(x)
#define C_cblock                 do{
#define C_cblockend              }while(0)

In the foreign-lambda*, I used C_return for clarity: I could have just used return with parentheses, which will get expanded by the C preprocessor. This is somewhat confusing: return n + x; will result in an error, whereas return(n+x); will do the same as C_return(n+x);.

The return macro calls C_long_to_num, which will construct a Scheme object, which is either a fixnum (small exact integer) or a flonum (floating-point inexact number), depending on the platform and the size of the returned value. Hopefully, in CHICKEN 5 it will be either a fixnum or a bignum - that way, it'll always be an exact integer.

Because these number objects need to get allocated on the stack to integrate with the garbage collector, the calling code needs to set aside enough memory on the stack to fit these objects. That's what the C_buf argument is for: it's a pointer to this area. In CHICKEN, a whole lot of type punning is going on, so it's passed as a regular C_word rather than as a proper pointer, but let's ignore that for now.

The stub function above is used to do the actual work, but in order to integrate it into CHICKEN's calling conventions and garbage collector, an additional wrapper function is generated. It corresponds to the actual Scheme "ilen" procedure, and looks like this:

/* ilen in k197 in k194 in k191 */
static void C_ccall f_201(C_word c, C_word t0, C_word t1, C_word t2)
{
  C_word tmp /* Unused */; C_word t3; C_word t4; C_word t5;  /* Temporaries */
  C_word ab[6], *a=ab;     /* Memory area set aside on stack for allocation */

  if(c != 3) C_bad_argc_2(c, 3, t0);     /* Check argument count is correct */

  C_check_for_interrupt; /* Check pending POSIX signals, and thread timeout */

  if(!C_stack_probe(&a)) {   /* Stack full?  Then perform GC and try again. */
    C_save_and_reclaim((void*)tr3, (void*)f_201, 3, t0, t1, t2);
  }
  t3 = C_a_i_bytevector(&a,1,C_fix(4));   /* Needed to have a proper object */
  t4 = C_i_foreign_unsigned_integer_argumentp(t2);   /* Check argument type */
  t5 = t1;                          /* The continuation of the call to ilen */
  /* Call stub7 inline, and pass result to continuation: */
  ((C_proc2)(void*)(*((C_word*)t5+1)))(2, t5, stub7(t3, t4));
}

The comment at the top indicates the name of the Scheme procedure and its location in the CPS-converted Scheme code. The k197 in k194 etc indicate the nesting in the generated continuations, which can sometimes be useful for debugging. These continuations can be seen in the CPS-converted code by compiling with csc -debug 3 test.scm.

Much of the code you might sort-of recognise from the code in my article about the CHICKEN garbage collector: The C_stack_probe() corresponds to that post's fits_on_stack(), and C_save_and_reclaim() combines that post's SCM_save_call() and SCM_minor_GC().

All Scheme procedures get compiled down to C functions which receive their argument count (c), the closure/continuation from which they're invoked (t0), so they can access local closure variables (not used here) and in order to perform a GC and re-invoke the closure. Finally, they receive the continuation of the call (t1) and any procedure arguments (everything after it, here only t2). If a procedure has a variable number of arguments, that will use C's varargs mechanism, which is why passing the argument count to every function is important. If a function is called with too many or too few arguments, this will "just work", even if the arguments are declared in the function prototype like here: The function is invoked correctly, but the stack will contain rubbish instead of the expected arguments. That's why it's important to first check the argument count, and then check whether a GC needs to be performed; otherwise, this rubbish gets saved by save_and_reclaim and the GC will attempt to traverse it as if it contained proper Scheme objects, resulting in segfaults or other nasty business.

The variable t3 will contain the buffer in which the return type is stored. It is wrapped in a byte vector, because this makes it a first-class object understood by the garbage collector. That's not necessary here, but this code is pretty generic and is also used in cases where it is necessary. The C_word ab[6] declaration sets aside enough memory space to hold a flonum or a fixnum, which need at most 4 bytes, plus 2 bytes for the bytevector wrapper. I will explain these details later in a separate post, but let's assume it's OK for now.

The argument type gets checked just before calling the C function. If the argument is not of the correct type, an error is signalled and the function will be aborted. The returned value is simply the input, so t4 will contain the same value as t2. Similarly, t1 gets copied as-is to t5. Finally, the continuation gets cast to the correct procedure type (again: a lot of type punning. I will explain this in another post), and invoked with the correct argument count (2), the continuation closure itself, and the return value of the stub function.

Returning complex Scheme objects from C

I've tried to explain above how the basic C types get converted to Scheme objects, but what if we want to get crazy and allocate Scheme objects in C? A simple foreign-lambda* won't suffice, because the compiler has no way of knowing how large a buffer to allocate, and the C function will return, so we'll lose what's on the stack.

To fix that, we have foreign-safe-lambda*, which will allow us to allocate any object on the stack. Before such a function is invoked, a minor garbage collection is triggered to clean the stack and ensure we have plenty of allocation room. Let's look at a simple example. This program displays the list of available network interfaces on a UNIX-like system:

(import foreign)

(foreign-declare "#include \"sys/types.h\"")
(foreign-declare "#include \"sys/socket.h\"")
(foreign-declare "#include \"ifaddrs.h\"")

(define interfaces
  (foreign-safe-lambda* scheme-object ()
    "C_word lst = C_SCHEME_END_OF_LIST, len, str, *a;\n"
    "struct ifaddrs *ifa, *i;\n"
    "\n"
    "if (getifaddrs(&ifa) != 0)\n"
    "  C_return(C_SCHEME_FALSE);\n"
    "\n"
    "for (i = ifa; i != NULL; i = i->ifa_next) {\n"
    "  len = strlen(i->ifa_name);\n"
    "  a = C_alloc(C_SIZEOF_PAIR + C_SIZEOF_STRING(len));\n"
    "  str = C_string(&a, len, i->ifa_name);\n"
    "  lst = C_a_pair(&a, str, lst);\n"
    "}\n"
    "\n"
    "freeifaddrs(ifa);\n"
    "C_return(lst);\n"))

(print "The following interfaces are available: " (interfaces))

This functionality is not available in CHICKEN because it's not very portable (it's not in POSIX), so it's a good example of something you might want to use C for. Please excuse the unSchemely way of error handling by returning #f for now. We'll fix that in the next chapter.

Looking at our definition, the interfaces procedure has no arguments, and it returns a scheme-object. This type indicates to CHICKEN that the returned value is not to be converted but simply used as-is: we'll handle its creation ourselves.

We declare the return value lst, which gets initialised to the empty list, and two temporary variables: len and str, to keep an intermediate string length and to hold the actual CHICKEN string. The variable a is an allocation pointer. Then we have the two variables which hold the start of the linked list of interfaces, ifa, and the current iterator through this list, i.

We retrieve the linked list (if it fails, returning #f), and scan through it until we hit the end. For each entry, we simply check the length of the interface name string, we allocate enough room on the stack to hold a pair and a CHICKEN string of the same length (C_alloc() is really just alloca()). The C_SIZEOF... macros are very convenient to help us calculate the size of an object without having to know its exact representation in memory. We then create the CHICKEN string using C_string, which is put into the allocated space stored in a, and we create a pair which holds the string in the car and the previous list as its cdr.

These allocating C_a_pair and C_string functions accept a pointer to the allocated space (which itself is a pointer). This means they can advance the pointer's value beyond the object, to the next free position. This is quite nice, because it allows us to call several allocating functions in a row, with the same pointer, and at the end the pointer points past the object that was allocated last. Finally, we release the memory used by the linked list and return the constructed list.

Analysing the generated code

Like before, if you're not interested in the details, feel free to skip this section.

The interfaces foreign code itself compiles down to this function:

/* Like before, but no conversion because we "return" a native object: */
#define return(x) C_cblock C_r = (((C_word)(x))); goto C_ret; C_cblockend

/* The prototype _is_ necessary in this case: it declares the function
 * as never returning via C_noret, which maps to __attribute__((noreturn)).
 */
static void C_ccall stub6(C_word C_c, C_word C_self,
                          C_word C_k, C_word C_buf) C_noret;

/* The arguments to the stub function now include the argument count,
 * the closure itself and the continuation in addition to the buffer
 * and arguments (none here).  This is a truly "native" CHICKEN function!
 */
static void C_ccall stub6(C_word C_c, C_word C_self, C_word C_k, C_word C_buf)
{
  C_word C_r = C_SCHEME_UNDEFINED,
        *C_a = (C_word *)C_buf;

  /* Save callback depth; needed if we want to call Scheme functions */
  int C_level = C_save_callback_continuation(&C_a, C_k);

  /* Start of our own code, as-is: */
  struct ifaddrs *ifa, *i;
  C_word lst = C_SCHEME_END_OF_LIST, len, str, *a;

  if (getifaddrs(&ifa) != 0)
    C_return(C_SCHEME_FALSE);

  for (i = ifa; i != NULL; i = i->ifa_next) {
    len = strlen(i->ifa_name);
    a = C_alloc(C_SIZEOF_PAIR + C_SIZEOF_STRING(len));
    str = C_string(&a, len, i->ifa_name);
    lst = C_a_pair(&a, str, lst);
  }

  freeifaddrs(ifa);
  C_return(lst);

C_ret:
#undef return

  /* Pop continuation off callback stack. */
  C_k = C_restore_callback_continuation2(C_level);

  C_kontinue(C_k, C_r); /* Pass return value to continuation. */
}

This is not much different from the foreign-lambda* example, but notice that the arguments are different: this stub looks exactly like the C code generated from an actual Scheme continuation: it gets passed the argument count, its own closure and its continuation. Instead of ending with a regular return from C, it invokes a continuation. This is the crucial difference which integrates our code with the garbage collector: by passing it to the next continuation's C function, the "returned" value is preserved on the stack. In other words, it is allocated directly in the nursery.

Even though the stub is a "native" Scheme procedure, a wrapper is still generated: if the foreign-safe-lambda is defined to accept C arguments, it'll still need to convert from Scheme objects, it needs to check the argument count, and it needs to invoke the GC before the procedure can be called:

/* interfaces in k197 in k194 in k191 */
static void C_ccall f_201(C_word c, C_word t0, C_word t1){
  /* This is the function that corresponds to the Scheme procedure.
   * This is the first stage of the procedure: we invoke the GC with
   * a continuation which will do conversions and call the C stub.
   */
  C_word tmp; C_word t2; C_word t3;
  C_word ab[3], *a = ab;

  /* As before: */
  if (c!=2) C_bad_argc_2(c, 2, t0);

  C_check_for_interrupt;

  if (!C_stack_probe(&a)) {
    C_save_and_reclaim((void*)tr2,(void*)f_201,2,t0,t1);
  }

  /* Create the continuation which will be invoked after GC: */
  t2 = (*a = C_CLOSURE_TYPE|2, /* A closure of size two: */
        a[1] = (C_word)f_205,  /* Second stage function of our wrapper, */
	a[2] = t1,             /* and continuation of call to (interfaces). */
	tmp = (C_word)a,       /* Current value of "a" must be stored in t2...*/
	a += 3,                /* ... but "a" itself gets advanced... */
	tmp);                  /* ... luckily tmp holds original value of a. */

  C_trace("test.scm:8: ##sys#gc"); /* Trace call chain */

  /* lf[1] contains the symbol ##sys#gc.  This invokes its procedure. */
  ((C_proc3)C_fast_retrieve_symbol_proc(lf[1]))(3, *((C_word*)lf[1]+1),
                                                t2, C_SCHEME_FALSE);
}

/* k203 in interfaces in k197 in k194 in k191 */
static void C_ccall f_205(C_word c, C_word t0, C_word t1)
{
  /* This function gets invoked from the GC triggered by the above function,
   * and is the second stage of our wrapper function.  It is similar to the
   * wrapper from the first example of a regular foreign-lambda.
   */
  C_word tmp; C_word t2; C_word t3; C_word t4;
  /* Enough room for a closure of 2 words (total size 3) and a bytevector
   * of 3 words (total size 4).  This adds up to 7; The missing 1 is to
   * make room for a possible alignment of the bytevector on 32-bit platforms.
   */
  C_word ab[8], *a=ab;

  C_check_for_interrupt;

  if (!C_stack_probe(&a)) {
    C_save_and_reclaim((void*)tr2, (void*)f_205, 2, t0, t1);
  }

  t2 = C_a_i_bytevector(&a, 1, C_fix(3)); /* Room for one pair */

  t3 = (*a = C_CLOSURE_TYPE|2, /* Create a closure of size 2: */
        a[1] = (C_word)stub6,  /* Our foreign-safe-lambda stub function, */
	a[2] = ((C_word)li0),  /* and static lambda-info for same (unused). */
	tmp = (C_word)a,       /* Update "a" and return original value, */
	a += 3,                /* exactly like we did in f_201. */
	tmp);
	
  /* Trace procedure name generated by (gensym). Kind of useless :) */
  C_trace("test.scm:8: g9");

  t4 = t3; /* Compilation artefact; don't worry about it */

  /* Retrieve procedure from closure we just created, and call it,
   * with 3 arguments: itself (t4), the continuation of the call
   * to "interfaces" (t0[2]), and the bytevector buffer (t2).
   */
  ((C_proc3)C_fast_retrieve_proc(t4))(3, t4, ((C_word*)t0)[2], t2);
}

Our foreign-lambda's wrapper function now consists of two stages. The first stage first creates a continuation for the usual wrapper function. Then it calls the garbage collector to clear the stack, after which this wrapper-continuation is invoked. This wrapper is the second function here, and it corresponds closely to the wrapper function we saw in the ilen example. However, this wrapper constructs a closure around the C stub function instead of simply calling it. This closure is then called: C_fast_retrieve_proc simply extracts the function from the closure object we just created, it is cast to a 3-argument procedure type and invoked with the continuation of the interfaces call site.

You can see how closures are created in CHICKEN. I will explain this in depth in a future blog post, but the basic approach is pretty clever: the whole thing is one big C expression which stores successive words at the free slots in the allocated space a, while ensuring that after the expression a will point at the next free word. The dance with tmp ensures that the whole expression which allocates the closure results in the initial value of a. That initial value was the first free slot before we executed the expression, and afterwards it holds the closure. Don't worry if this confuses you :)

Calling Scheme from C

Now, with the basics out of the way, let's do something funkier: instead of calling C from Scheme, we call Scheme from C! There is a C API for embedding CHICKEN in a larger C program, but that's not what you should use when calling Scheme from C code that was itself called from Scheme.

The "easy" way

Our little interfaces-listing program has one theoretical flaw: the list of interfaces could be very long (or the names could be long), so we may theoretically run out of stack space. So, we should avoid allocating unbounded lists directly on the stack without checking for overflow. Instead, let's pass the allocated objects to a callback procedure which prints the interface, in a "streaming" fashion.

As I explained before, a regular foreign-lambda uses the C stack in the regular way, it doesn't know about continuations or the Cheney on the MTA garbage collection style, and there's no way to call CHICKEN functions from there, because the GC would "collect" away the C function by longjmp()ing past it. However, the foreign-safe-lambda has a special provision for that: it can "lock" the current live data by putting a barrier between this C function and the Scheme code it calls:

(import foreign)

(foreign-declare "#include \"sys/types.h\"")
(foreign-declare "#include \"sys/socket.h\"")
(foreign-declare "#include \"ifaddrs.h\"")

(define interfaces
  (foreign-safe-lambda* scheme-object ((scheme-object receiver))
    "C_word len, str, *a;\n"
    "struct ifaddrs *ifa, *i;\n"
    "\n"
    "if (getifaddrs(&ifa) != 0)\n"
    "  C_return(C_SCHEME_UNDEFINED);\n"
    "\n"
    "for (i = ifa; i != NULL; i = i->ifa_next) {\n"
    "  len = strlen(i->ifa_name);\n"
    "  a = C_alloc(C_SIZEOF_STRING(len));\n"
    "  str = C_string(&a, len, i->ifa_name);\n"
    "  C_save(str);\n"
    "  C_callback(receiver, 1);\n"
    "}\n"
    "\n"
    "freeifaddrs(ifa);\n"
    "C_return(C_SCHEME_UNDEFINED);\n"))

(print "The following interfaces are available: ")
(interfaces print)

This will display the interfaces one line at a time, by using CHICKEN's print procedure as the callback.

We won't look at the compiled source code for this implementation, because it is identical to the earlier one, except for the changed foreign-lambda body. The implementation of C_callback() is of interest, but it is a little hairy, so I'll leave it you to explore it yourself.

The basic idea is rather simple, though: it simply calls setjmp() to establish a new garbage collection trampoline. This means that the foreign-lambda will always remain on the stack. The callback is then invoked with a continuation which sets a flag to indicate that the callback has returned normally, in which case its result will be returned to the foreign-lambda. If it didn't return normally, we arrived at the trampoline because a GC was triggered. This means the remembered continuation will be re-invoked, like usual.

However, when the callback did return normally, we can simply return the returned value because the foreign-lambda's stack frame is still available due to the GC barrier we set up.

The C_save macro simply saves the callback's arguments on a special stack which is read by C_do_apply. It is also used by callback_return_continuation: it saves the value and triggers a GC to force the returned value into the heap. That way, we can return it safely to the previous stack frame without it getting clobbered by the next allocation.

A harder way

The above code has another flaw: if the callback raises an exception, the current exception handler will be invoked with the continuation where it was established. However, that might never return to the callback, which means we have a memory leak on our hands!

If the callback doesn't return normally, the foreign-lambda will remain on the stack forever. How do we avoid that little problem? The simplest is of course to wrap the callback's code in handle-exceptions or condition-case. However, that's no fun at all.

Besides, in real-world code we want to avoid the overhead of a GC every single time we invoke a C function, so foreign-safe-lambda is not really suitable for functions that are called in a tight loop. In such cases, there is only one way: to deeply integrate in CHICKEN and write a completely native procedure! Because truly native procedures must call a continuation when they want to pass a result somewhere, we'll have to chop up the functionality into three procedures:

(import foreign)
(use lolevel)     ; For "location"

(foreign-declare "#include \"sys/types.h\"")
(foreign-declare "#include \"sys/socket.h\"")
(foreign-declare "#include \"ifaddrs.h\"")

(define grab-ifa!
  (foreign-lambda* void (((c-pointer (c-pointer "struct ifaddrs")) ifa))
    "if (getifaddrs(ifa) != 0)\n"
    "  *ifa = NULL;\n"))

(define free-ifa!
  (foreign-lambda* void (((c-pointer (c-pointer "struct ifaddrs")) ifa))
    "freeifaddrs(*ifa);\n"))

(define next-ifa
  (foreign-primitive (((c-pointer (c-pointer "struct ifaddrs")) ifa))
    "C_word len, str, *a;\n"
    "\n"
    "if (*ifa) {\n"
    "  len = strlen((*ifa)->ifa_name);\n"
    "  a = C_alloc(C_SIZEOF_STRING(len));\n"
    "  str = C_string(&a, len, (*ifa)->ifa_name);\n"
    "  *ifa = (*ifa)->ifa_next;\n"
    "  C_kontinue(C_k, str);\n"
    "} else {\n"
    "  C_kontinue(C_k, C_SCHEME_FALSE);\n"
    "}"))

(define (interfaces)
  ;; Use a pointer which the C function mutates.  We could also
  ;; return two values(!) from the "next-ifa" foreign-primitive,
  ;; but that complicates the code flow a little bit more.
  ;; Sorry about the ugliness of this!
  (let-location ((ifa (c-pointer "struct ifaddrs"))
                 (i (c-pointer "struct ifaddrs")))
    (grab-ifa! (location ifa))
    (unless ifa (error "Could not allocate ifaddrs"))
    (set! i ifa)

    (handle-exceptions exn
      (begin (free-ifa! (location ifa))      ; Prevent memory leak, and
             (signal exn))                   ; re-raise the exception
      (let lp ((result '()))
        (cond ((next-ifa (location i)) =>
               (lambda (iface)
                 (lp (cons iface result))))
              (else
               (free-ifa! (location ifa))
               result))))))

;; We're once again back to constructing a list!
(print "The following interfaces are available: " (interfaces))

This compiles to something very similar to the code behind a foreign-safe-lambda, but it's obviously going to be a lot bigger due to it being cut up, so I won't duplicate the C code here. Remember, you can always inspect it yourself with csc -k.

Anyway, this is like the foreign-safe-lambda, but without the implicit GC. Also, instead of "returning" the value through C_return() we explicitly call the continuation C_k through the C_kontinue() macro, with the value we want to pass on to the cond. If we wanted to return two values, we could simply use the C_values() macro instead; we're free to do whatever Scheme can do, so we can even return multiple values, as long as the continuation accepts them.

If an exception happens anywhere in this code, we won't get a memory leak due to the stack being blown up. However, like in any C code, we need to free up the memory behind the interface addresses. So we can't really escape our cleanup duty!

You might think that there's one more problem with foreign-primitive: because it doesn't force a GC before calling the C function, there's still no guarantee about how much space you still have on the stack. Luckily, CHICKEN has a C_STACK_RESERVE, which defines how much space that is guaranteed to be left on the stack after each C_demand(). Its value is currently 0x10000 (i.e., 64 KiB), which means you have some headroom to do basic allocations like we do here, but you shouldn't allocate too many huge objects. There are ways around that, but unfortunately not using the "official" FFI (that I'm aware of, anyway). For now we'll stick with the official Scheme API.

The die-hard way: calling Scheme closures from C

So far, we've discussed pretty much only things you can find in the CHICKEN manual's section on the FFI. Let's take a look at how we can do things a little differently, and instead of passing the string or #f to a continuation, we pass the callback as a procedure again, just like we did for the "easy" way:

(import foreign)
(use lolevel)

(foreign-declare "#include \"sys/types.h\"")
(foreign-declare "#include \"sys/socket.h\"")
(foreign-declare "#include \"ifaddrs.h\"")

(define grab-ifa!
  (foreign-lambda* void (((c-pointer (c-pointer "struct ifaddrs")) ifa))
    "if (getifaddrs(ifa) != 0)\n"
    "  *ifa = NULL;\n"))

(define free-ifa!
  (foreign-lambda* void (((c-pointer (c-pointer "struct ifaddrs")) ifa))
    "freeifaddrs(*ifa);\n"))

(define next-ifa
  (foreign-primitive (((c-pointer (c-pointer "struct ifaddrs")) ifa)
                      (scheme-object more) (scheme-object done))
    "C_word len, str, *a;\n"
    "\n"
    "if (*ifa) {\n"
    "  len = strlen((*ifa)->ifa_name);\n"
    "  a = C_alloc(C_SIZEOF_STRING(len));\n"
    "  str = C_string(&a, len, (*ifa)->ifa_name);\n"
    "  *ifa = (*ifa)->ifa_next;\n"
    "  ((C_proc3)C_fast_retrieve_proc(more))(3, more, C_k, str);\n"
    ;; Alternatively:
    ;; "  C_save(str); \n"
    ;; "  C_do_apply(2, more, C_k); \n"
    ;; Or, if we want to call Scheme's APPLY directly (slower):
    ;; "  C_apply(5, C_SCHEME_UNDEFINED, C_k, more, \n"
    ;; "          str, C_SCHEME_END_OF_LIST); \n"
    "} else {\n"
    "  ((C_proc2)C_fast_retrieve_proc(done))(2, done, C_k);\n"
    ;; Alternatively:
    ;; "  C_do_apply(0, done, C_k); \n"
    ;; Or:
    ;; "  C_apply(4, C_SCHEME_UNDEFINED, C_k, done, C_SCHEME_END_OF_LIST);\n"
    "}"))

(define (interfaces)
  (let-location ((ifa (c-pointer "struct ifaddrs"))
                 (i (c-pointer "struct ifaddrs")))
    (grab-ifa! (location ifa))
    (unless ifa (error "Could not allocate ifaddrs"))
    (set! i ifa)

    (handle-exceptions exn
      (begin (free-ifa! (location ifa))
             (signal exn))
      (let lp ((result '()))
        (next-ifa (location i)
                  (lambda (iface)               ; more
                    (lp (cons iface result)))
                  (lambda ()                    ; done
                    (free-ifa! (location ifa))
                    result))))))

(print "The following interfaces are available: " (interfaces))

The magic lies in the expression ((C_proc3)C_fast_retrieve_proc(more))(3, more, C_k, str). We've seen something like it before in generated C code snippets: First, it extracts the C function pointer from the closure object in more. Then, the function pointer is cast to the correct type; C_proc3 refers to a procedure which accepts three arguments. This excludes the argument count, which actually is the first argument in the call. The next argument is the closure itself, which is needed when the closures has local variables it refers to (like result and lp in the example). The argument after the closure is its continuation. We just pass on C_k: the final continuation of both more and done is the continuation of lp, which is also the continuation of next-ifa. Finally, the arguments following the continuation are the values passed as arguments: iface for the more closure.

The done closure is invoked as C_proc2 with only itself and the continuation, but no further arguments. This corresponds to the fact that done is just a thunk.

I've shown two alternative ways to call the closure. The first is to call the closure through the C_do_apply function. This is basically a dispatcher which checks the argument count and uses the correct C_proc<n> cast and then calls it with the arguments, taken from a temporary stack on which C_save places the arguments. The implementation behind it is positively insane, and worth checking out for the sheer madness of it.

The second alternative is to use C_apply, which is the C implementation of Scheme's apply procedure. It's a bit awkward to call from C, because this procedure is a true Scheme procedure. That means it accepts an argument count, itself and its continuation and only then its arguments, which are the closure and the arguments to pass to the closure, with the final argument being a list:

(apply + 1 2 '(3 4)) => 10

In C this would be:

C_apply(6, C_SCHEME_UNDEFINED, C_k, C_closure(&a, 1, C_plus),
        C_fix(1), C_fix(2), C_list2(C_fix(3), C_fix(4)));

It also checks its arguments, so if you pass something that's not a list as its final argument, it raises a nice exception:

(import foreign)
((foreign-primitive ()
   "C_word ab[C_SIZEOF_CLOSURE(1)], *a = ab; \n"
   "C_apply(4, C_SCHEME_UNDEFINED, C_k, "
   "        C_closure(&a, 1, (C_word)C_plus), C_fix(1));"))

This program prints the following when executed:

 Error: (apply) bad argument type: 1
         Call history:
         test.scm:2: g11         <--

And this brings us to our final example, where we go absolutely crazy.

The guru way: Calling Scheme closures you didn't receive

You might have noticed that the error message above appears without us passing the error procedure to +, and if you had wrapped the call in an exception handler it would've called its continuation, without us passing it to the procedure. In some situations you might like to avoid boring the user with passing some procedure to handle some exceptional situation. Let's see if we can do something like that ourselves!

It turns out to be pretty easy:

(import foreign)
(use lolevel)

(foreign-declare "#include \"sys/types.h\"")
(foreign-declare "#include \"sys/socket.h\"")
(foreign-declare "#include \"ifaddrs.h\"")

(define grab-ifa!
  (foreign-lambda* void (((c-pointer (c-pointer "struct ifaddrs")) ifa))
    "if (getifaddrs(ifa) != 0)\n"
    "  *ifa = NULL;\n"))

(define free-ifa!
  (foreign-lambda* void (((c-pointer (c-pointer "struct ifaddrs")) ifa))
    "freeifaddrs(*ifa);\n"))

(define (show-iface-name x)
  (print x)
  #t)

(define next-ifa
  (foreign-primitive (((c-pointer (c-pointer "struct ifaddrs")) ifa))
    "C_word len, str, *a, show_sym, show_proc;\n"
    "\n"
    "if (*ifa) {\n"
    "  len = strlen((*ifa)->ifa_name);\n"
    "  a = C_alloc(C_SIZEOF_INTERNED_SYMBOL(15) + C_SIZEOF_STRING(len));\n"
    "  str = C_string(&a, len, (*ifa)->ifa_name);\n"
    "  *ifa = (*ifa)->ifa_next;\n"
    ;; The new bit:
    "  show_sym = C_intern2(&a, C_text(\"show-iface-name\"));\n"
    "  show_proc = C_block_item(show_sym, 0);\n"
    "  ((C_proc3)C_fast_retrieve_proc(show_proc))(3, show_proc, C_k, str);\n"
    "} else {\n"
    "  C_kontinue(C_k, C_SCHEME_FALSE);\n"
    "}"))

(define (interfaces)
  (let-location ((ifa (c-pointer "struct ifaddrs"))
                 (i (c-pointer "struct ifaddrs")))
    (grab-ifa! (location ifa))
    (unless ifa (error "Could not allocate ifaddrs"))
    (set! i ifa)

    (handle-exceptions exn
      (begin (free-ifa! (location ifa))
             (signal exn))
      (let lp ()
        ;; next-ifa now returns true if it printed an interface and is
	;; ready to get the next one, or false if it reached the end.
        (if (next-ifa (location i))
            (lp)
            (free-ifa! (location ifa)))))))

(print "The following interfaces are available: ")
(interfaces)

This uses C_intern2 to look up the symbol for "show-iface-name" in the symbol table (or intern it if it didn't exist yet). We store this in show_sym. Then, we look at the symbol's first slot, where the value is stored for the global variable identified by the symbol. The value slot always exists, but if it is undefined, the value is C_SCHEME_UNDEFINED. Anyway, we assume it's defined and we call it like we did in the example before this one: extract the first slot from the closure and call it.

This particular example isn't very useful, but the technique can be used to invoke hook procedures, and in fact the core itself uses it from barf() when it invokes ##sys#error-hook to construct and raise an exception when an error situation occurs in the C runtime.

CHICKEN internals: the garbage collector

2014-03-29T10:58:20Z

One of CHICKEN's coolest features has to be its unique approach to garbage collection. When someone asked about implementation details (hi, Arthur!), I knew this would make for an interesting blog post. This post is going to be long and technical, so hold on to your hats! Don't worry if you don't get through this in one sitting.

Prerequisites

There's a whole lot of stuff that we'll need to explain before we get to the actual garbage collector. CHICKEN's garbage collection (GC) strategy is deeply intertwined with its compilation strategy, so I'll start by explaining the basics of that, before we can continue (pun intended) with the actual GC stuff.

A short introduction to continuation-passing style

The essence of CHICKEN's design is a simple yet brilliant idea by Henry Baker, described in his paper CONS Should Not CONS Its Arguments, Part II: Cheney on the M.T.A.. The paper is pretty terse, but it's well-written, so I recommend you check it out before reading on. If you grok everything in it, you probably won't get much out of my blog post and you can stop reading now. If you don't grok it, it's probably a good idea to re-read it again later.

Baker's approach assumes a Scheme to C compiler which uses continuation-passing style (CPS) as an internal representation. This is the quintessential internal representation of Scheme programs, going all the way back to the first proper Scheme compiler, RABBIT.

Guy L. Steele (RABBIT's author) did not use CPS to make garbage collection easier. In fact, RABBIT had no GC of its own, as it relied on MacLISP as a target language (which compiled to PDP-10 machine code and had its own garbage collector). Instead, continuations allowed for efficient implementation of nested procedure calls. It eliminated the need for a stack to keep track of this nesting by simply returning the "next thing to do" to a driver loop which took care of invoking it. This made it possible to write down iterative algorithms as a recursive function without causing a stack overflow.

Let's consider a silly program which sums up all the numbers in a list, and shows the result multiplied by two:

(define (calculate-sum lst result)
  (if (null? lst)
      result
      (calculate-sum (cdr lst) (+ result (car lst)))))

(define (show-sum lst)
  (print-number (* 2 (calculate-sum lst 0))))

(show-sum '(1 2 3))

A naive compilation to C would look something like this (brutally simplified):

void entry_point() {
  toplevel();
  exit(0); /* Assume exit(1) is explicitly called elsewhere in case of error. */
}

void toplevel() {
  /* SCM_make_list() & SCM_fx() allocate memory.  "fx" stands for "fixnum". */
  SCM_obj *lst = SCM_make_list(3, SCM_fx(1), SCM_fx(2), SCM_fx(3));
  show_sum(lst);
}

SCM_obj* show_sum(SCM_obj *lst) {
  SCM_obj result = calculate_sum(lst, SCM_fx(0));
  /* SCM_fx_times() allocates memory. */
  return SCM_print_number(SCM_fx_times(SCM_fx(2), result));
}

SCM_obj* calculate_sum(SCM_obj *lst, SCM_obj *result) {
  if (lst == SCM_NIL) { /* Optimised */
    return result;
  } else {
    /* SCM_fx_plus() allocates memory. */
    SCM_obj *tmp = SCM_cdr(lst);
    SCM_obj *tmp2 = SCM_fx_plus(result, SCM_car(lst));
    return calculate_sum(tmp, tmp2); /* Recur */
  }
}

SCM_obj *SCM_print_number(SCM_obj *data) {
  printf("%d\n", SCM_fx_to_integer(data));
  return SCM_VOID;
}

This particular implementation probably can't use a copying garbage collector like CHICKEN uses, because the SCM_obj pointers which store the Scheme objects' locations would all become invalid. But let's ignore that for now.

Due to the recursive call in calculate_sum(), the stack just keeps growing, and eventually we'll get a stack overflow if the list is too long. Steele argued that this is a silly limitation which results in the proliferation of special-purpose "iteration" constructs found in most languages. Also, he was convinced that this just cramps the programmer's style: we shouldn't have to think about implementation details like the stack size. In his time people often used goto instead of function calls as a performance hack. This annoyed him enough to write a rant about it, which should be required reading for all would-be language designers!

Anyway, a compiler can transparently convert our Scheme program into CPS, which would look something like this after translation to C:

/* Set up initial continuation & toplevel call. */
void entry_point() {
  SCM_cont *cont = SCM_make_cont(1, &toplevel, SCM_exit_continuation);
  SCM_call *call = SCM_make_call(0, cont);
  SCM_driver_loop(call);
}

void SCM_driver_loop(SCM_call *call) {
  /* The trampoline to which every function returns its continuation. */
  while(true)
    call = SCM_perform_continuation_call(call);
}

SCM_call *toplevel(SCM_cont *cont) {
  SCM_cont *next = SCM_make_cont(1, &show_sum, cont);
  SCM_obj *lst = SCM_make_list(3, SCM_fx(1), SCM_fx(2), SCM_fx(3));
  return SCM_make_call(1, next, lst);
}

SCM_call *show_sum(SCM_cont *cont, SCM_obj *lst) {
  SCM_cont *next = SCM_make_cont(1, &show_sum_continued, cont);
  SCM_cont *now = SCM_make_cont(2, &calculate_sum, next);
  return SCM_make_call(2, now, lst, SCM_fx(0));
}

SCM_call *calculate_sum(SCM_cont *cont, SCM_obj *lst, SCM_obj *result) {
  if (lst == SCM_NIL) { /* Optimised */
    return SCM_make_call(1, cont, result);
  } else {
    SCM_obj *tmp = SCM_cdr(lst);
    SCM_obj *tmp2 = SCM_fx_plus(result, SCM_car(lst));
    SCM_cont *now = SCM_make_cont(2, &calculate_sum, cont);
    return SCM_make_call(2, now, tmp, tmp2); /* "Recur" */
  }
}

SCM_call *show_sum_continued(SCM_cont *cont, SCM_obj *result) {
  SCM_cont *now = SCM_make_cont(1, &SCM_print_number, cont);
  SCM_obj *tmp = SCM_fx_times(SCM_fx(2), result);
  return SCM_make_call(1, now, tmp);
}

SCM_call *SCM_print_number(SCM_cont *cont, SCM_obj *data) {
  printf("%d\n", SCM_fx_to_integer(data));
  return SCM_make_call(1, cont, SCM_VOID);
}

In the above code, there are two new data types: SCM_cont and SCM_call.

An SCM_cont represents a Scheme continuation as a C function's address, the number of arguments which it expects (minus one) and another continuation, which indicates where to continue after the C function has finished. This sounds recursive, but as you can see the very first continuation created by entry_point() is a specially prepared one which will cause the process to exit.

An SCM_call is returned to the driver loop by every generated C function: this holds a continuation and the arguments with which to invoke it. SCM_perform_continuation_call() extracts the SCM_cont from the SCM_call and invokes its C function with its continuation and the arguments from the SCM_call. We won't dwell on the details of its implementation now, but assume this is some magic which just works.

You'll also note that the primitives SCM_car(), SCM_cdr(), SCM_fx_plus() and SCM_fx_times() do not accept a continuation. This is a typical optimisation: some primitives can be inlined by the compiler. However, this is not required: you can make them accept a continuation as well, at the cost of further splintering the C functions into small sections; the calculate_sum() function would be split up into 4 separate functions if we did that.

Anyway, going back to the big picture we can see that this continuation-based approach consumes a more or less constant amount of stack space, because each function returns to driver_loop. Baker's fundamental insight was that the stack is there anyway (and it will be used by C), and if we don't need it for tracking function call nesting, why not use it for something else? He proposed to allocate all newly created objects on the stack. Because the stack would hopefully fit the CPU's cache in its entirety, this could give quite a performance benefit.

Generational collection

To understand why keeping new data together on the stack can be faster, it's important to know that most objects are quite short-lived. Most algorithms involve intermediate values, which are accessed quite a bit during a calculation but are no longer needed afterwards. These values need to be stored somewhere in memory. Normally you would store them together with all other objects in the main heap, which may cause fragmentation of said heap. Fragmentation means that memory references may cross page boundaries. This is slow, because it will clear out the CPU's memory cache and may even require swapping it in, if the machine is low on memory.

On top of that, generating a lot of intermediate values means generating a lot of garbage, which will trigger many GCs during which a lot of these temporary objects will be cleaned up. However, during these GCs, the remaining longer-lived objects must also be analysed before it can be decided they can stick around.

This is rather wasteful, and it turns out we can avoid doing so much work by categorising objects by age. Objects that have just been created belong to the first generation and are stored in their own space (called the nursery - I'm not kidding!), while those that have survived several GC events belong to older generations, which each have their own space reserved for them. By keeping different generations separated, you do not have to examine long-lived objects of older generations (which are unlikely to be collected) when collecting garbage in a younger generation. This can save us a lot of wasted time.

Managing data on the stack

The Cheney on the M.T.A. algorithm as used by CHICKEN involves only two generations; one generation consisting of newly created objects and the other generation consisting of older objects. In this algorithm, new objects get immediately promoted (or tenured) to the old generation after a GC of the nursery (or stack). Such a GC is called a minor GC, whereas a GC of the heap is called a major GC.

This minor GC is where the novelty lies: objects are allocated on the stack. You might wonder how that can possibly work, considering the lengths I just went through to explain how CPS conversion gets rid of the stack. Besides, by returning to the trampoline function whenever a new continuation is invoked, anything you'd store on the stack would need to get purged (that's how the C calling convention works).

That's right! The way to make this work is pretty counter-intuitive: we go all the way back to the first Scheme to C conversion I showed you and make it even worse. Whenever we want to invoke a continuation, we just call its function. That means that the example program we started out with would compile to this:

/* Same as before */
void entry_point() {
  SCM_cont *cont = SCM_make_cont(1, &toplevel, SCM_exit_continuation);
  SCM_call *call = SCM_make_call(0, cont);
  SCM_driver_loop(call);
}

SCM_call *saved_cont_call; /* Set by SCM_save_call, read by driver_loop */
jmp_buf empty_stack_state; /* Set by driver_loop, read by minor_GC */

void SCM_driver_loop(SCM_call *call) {
  /* Save registers (including stack depth and address in this function) */
  if (setjmp(empty_stack_state))
    call = saved_cont_call; /* Got here via longjmp()? Use stored call */

  SCM_perform_continuation_call(call);
}

void SCM_minor_GC() {
  /* ...
     Copy live data from stack to heap, which is a minor GC.  Described later.
     ... */
  longjmp(empty_stack_state); /* Restore registers (jump back to driver_loop) */
}

void toplevel(SCM_cont *cont) {
  if (!fits_on_stack(SCM_CONT_SIZE(0) + SCM_CALL_SIZE(1) +
                     SCM_FIXNUM_SIZE * 3 + SCM_PAIR_SIZE * 3)) {
    SCM_save_call(0, &toplevel, cont); /* Mutates saved_cont_call */
    SCM_minor_GC(); /* Will re-invoke this function from the start */
  } else {
    /* The below stuff will all fit on the stack, as calculated in the if() */
    SCM_cont *next = SCM_make_cont(1, &show_sum, cont);
    SCM_obj *lst = SCM_make_list(3, SCM_fx(1), SCM_fx(2), SCM_fx(3));
    SCM_call *call = SCM_make_call(1, next, lst);
    SCM_perform_continuation_call(call);
  }
}

void show_sum(SCM_cont *cont, SCM_obj *lst) {
  if (!fits_on_stack(SCM_CONT_SIZE(0) * 2 +
                     SCM_CALL_SIZE(2) + SCM_FIXNUM_SIZE)) {
    SCM_save_call(1, &show_sum, cont, lst);
    SCM_minor_GC();
  } else {
    SCM_cont *next = SCM_make_cont(1, &show_sum_continued, cont);
    SCM_cont *now = SCM_make_cont(2, &calculate_sum, next);
    SCM_call *call = SCM_make_call(2, now, lst, SCM_fx(0));
    SCM_perform_continuation_call(call);
  }
}

void calculate_sum(SCM_cont *cont, SCM_obj *lst, SCM_obj *result) {
  /* This calculation is overly pessimistic as it counts both arms
     of the if(), but this is acceptable */
  if (!fits_on_stack(SCM_CALL_SIZE(1) + SCM_FIXNUM_SIZE +
                     SCM_CONT_SIZE(1) + SCM_CALL_SIZE(2))) {
    SCM_save_call(2, &calculate_sum, cont, lst, result);
    SCM_minor_GC();
  } else {
    if (lst == SCM_NIL) {
      SCM_call *call = SCM_make_call(1, cont, result);
      SCM_perform_continuation_call(call);
    } else {
      SCM_obj *tmp = SCM_cdr(lst);
      SCM_obj *tmp2 = SCM_fx_plus(result, SCM_car(lst));
      SCM_cont *now = SCM_make_cont(2, &calculate_sum, cont);
      SCM_call *call = SCM_make_call(2, now, tmp, tmp2);
      SCM_perform_continuation_call(call); /* "Recur" */
    }
  }
}

void show_sum_continued(SCM_cont *cont, SCM_obj *result) {
  if (!fits_on_stack(SCM_CONT_SIZE(1) + SCM_CALL_SIZE(1) + SCM_FIXNUM_SIZE)) {
    SCM_save_call(1, &show_sum_continued, cont, result);
    SCM_minor_GC();
  } else {
    SCM_cont *now = SCM_make_cont(1, &SCM_print_number, cont);
    SCM_obj *tmp = SCM_fx_times(SCM_fx(2), result);
    SCM_call *call = SCM_make_call(1, now, tmp);
    SCM_perform_continuation_call(call);
  }
}

void SCM_print_number(SCM_cont *cont, SCM_obj *data) {
  if (!fits_on_stack(SCM_CALL_SIZE(1))) {
    SCM_save_call(1, &show_sum_continued, cont, data);
    SCM_minor_GC();
  } else {
    printf("%d\n", SCM_fx_to_integer(data));
    SCM_call *call = SCM_make_call(1, cont, SCM_VOID);
    SCM_perform_continuation_call(call);
  }
}

Whew! This program is quite a bit longer, but it isn't that different from the second program I showed you. The main change is that none of the continuation functions return anything. In fact, these functions, like Charlie in the M.T.A. song, never return. In the earlier version every function ended with a return statement, now they end with an invocation of SCM_perform_continuation_call().

To make things worse, allocating functions now also use alloca() to place objects on the stack. That means that the stack just keeps filling up like the first compilation I showed you, so we're back to where we started! However, this program is a lot longer due to one important thing: At the start of each continuation function we first check to see if there's enough space left on the stack to accommodate the objects this function will allocate.

If there's not enough space, we re-create the SCM_call with which this continuation function was invoked using SCM_save_call(). This differs from SCM_make_call() in that it will not allocate on the stack, but will use a separate area to set aside the call object. The pointer to that area is stored in saved_cont_call.

SCM_save_call() can't allocate on the stack for a few reasons: The first and most obvious reason is that the saved call wouldn't fit on the stack because we just concluded it is already full. Second, the arguments to the call must be kept around even when the stack is blown away after the GC has finished. Third, this stored call contains the "tip" of the iceberg of live data from which the GC will start its trace. This is described in the next section.

After the minor GC has finished, we can jump back to the trampoline again. We use the setjmp() and longjmp() functions for that. When the first call to SCM_driver_loop() is made, it will call setjmp() to save all the CPU's registers to a buffer. This includes the stack and instruction pointers. Then, when the minor GC finishes, it calls longjmp() to restore those registers. Because the stack and instruction pointer are restored, this means execution "restarts" at the place in driver_loop() where setjmp() was invoked. The setjmp() then returns again, but now with a nonzero value (it was zero the first time). The return value is checked and the call is fetched from the special save area to get back to where we were just before we performed the GC.

This is half the magic, so please make sure you understand this part!

The minor GC

The long story above served to set up all the context you need to know to dive into the GC itself, so let's take a closer look at it.

Picking the "live" data from the stack

As we've seen, the GC is invoked when the stack has completely filled up. At this point, the stack is a complete mess: it has many stack frames from all the function calls that happened between the previous GC and now. These stack frames consist of return addresses for the C function calls (which we're not even using), stack-allocated C data (which we don't need) and somewhere among that mess there are some Scheme objects. These objects themselves also belong to two separate categories: the "garbage" and the data that's still being used and needs to be kept around (the so-called live data). How on earth are we going to pick only the interesting bits from that mess?

Like I said before, the saved call contains the "tip of the iceberg" of live data. It turns out this is all we need to get at every single object which is reachable to the program. All you need to do is follow the pointers to the arguments and the continuation stored in the call. For each of these objects, you copy them to the heap and if they are compound objects you follow all the pointers to the objects stored within them, and so on. Let's take a look at a graphical representation of how this works. In the picture below I show the situation where a GC is triggered just after the second invocation of calculate-sum (i.e., the first recursive call of itself, with the list '(2 3)):

After the initial shock from seeing this cosmic horror has worn off, let's take a closer look. It's like a box of tangled cords: if you take the time to carefully untangle them, it's easy, but if you try to do it all at once, it'll leave you overwhelmed. Luckily, I'm going to talk you through it. (by the way: this is an SVG so you can zoom in on details as far as you like using your browser's zooming functionality).

Let's start with the big picture: On the left you see the stack, on the right the heap after copying and in the bottom centre there's a small area of statically allocated objects, which are not subject to GC. To get your bearings, check the left margin of the diagram. I have attempted to visualise C stack frames by writing each function's name above a line leading to the bottom of its frame.

Let's look at the most recently called function, at the top of the stack. This is the function which initiated the minor GC: the second call to calculate_sum(). The shaded area shows the pointers set aside by SCM_save_call(), which form the tip of the iceberg of live data. More on that later.

The next frame belongs to the first call to calculate_sum(). It has allocated a few things on the stack. The topmost element on the stack is the last thing that's allocated due to the way the stack grows upwards in this picture. This is a pointer to an SCM_call object, marked with "[call]", which is the name of the variable which is stored there. If you go back to the implementation of calculate_sum(), you can see that the last thing it does is allocate an SCM_call, and store its pointer in call. The object itself just precedes the variable on the stack, and is marked with a thick white border to group together the machine words from which it is composed. From bottom to top, these are:

A tag which indicates that this is a call containing 2 arguments,
a pointer to an SCM_cont object (taken from the now variable),
a pointer to an SCM_obj object (the cdr of lst, taken from tmp) and
a pointer to an SCM_obj object (a fixnum, taken from tmp2).

Other compound objects are indicated in the same way.

You'll also have noticed the green, white and dashed arcs with arrow tips. These connect pointers to their target addresses. The dashed ones on the right hand side of the stack indicate pointers that are used for local variables in C functions or SCM_call objects. These pointers are unimportant to the garbage collector. The ones on the left hand side of the stack are pointers from Scheme objects to other Scheme objects. These are important to the GC. The topmost pointer inside the call object we just looked at has a big dashed curve all the way down to the cdr of lst, and the one below it points at the value of result, which is the fixnum 1.

If you look further down the stack, you'll see the show_sum procedure which doesn't really allocate much: an SCM_call, the initial intermediate result (fixnum 0), and two continuations (next and now in the C code). The bulk of the allocation happens in toplevel, which contains the call to show_sum and allocates a list structure. This is on the stack in reverse order: first the pair X = (3 . ()), then the pair Y = (2 . <X>) and the pair Z = (1 . <Y>). The () is stored as SCM_NIL in the static area, to which the cdr of the bottom-most pair object on the stack points, which is represented by a long green line which swoops down to the static area.

Copying the live data to the heap

The green lines represent links from the saved call to the live data which we need to copy. You can consider the colour green "contagious": imagine everything is white initially, except for the saved call. Then, each line starting at the pointers of the call are painted green. The target object to which a line leads is also painted green. Then, we recursively follow lines from pointers in that object and paint those green, etc. The objects that were already in the heap or the static area are not traversed, so they stay white.

When an object is painted green, it is also copied to the heap, which is represented by a yellow line. The object is then overwritten by a special object which indicates that this object has been moved to the heap. This special object contains a forwarding pointer which indicates the new location of the object. This is useful when you have two objects which point to the same other object, like for example in this code:

(let ((a (list 3 2 1))
      (b (cons 4 a))
      (c (cons 4 a)))
  ...)

Here you have two lists (4 3 2 1) which share a common tail. If both lists are live at the moment of GC, we don't want to copy this tail twice, because that would result in it being split into two distinct objects. Then, a set-car! on a might only be reflected in b but not c, for example. The forwarding pointers prevent this from happening by simply adjusting a copied object's constituent objects to point to their new locations. Finally, after all data has been copied, all the newly copied objects are checked again for references to objects which may have been relocated after the object was copied.

The precise algorithm that performs this operation is very clever. It requires only two pointers and a while loop, but it still handles cyclic data structures correctly. The idea is that you do the copying I described above in a breadth-first way: you only copy the objects stored in the saved call (without touching their pointers). Next, you loop from the start of the heap to the end, looking at each object in turn (initially, those are the objects we just copied). For these objects, you check their components, and see whether they exist in the heap or in the stack. If they exist in the stack, you copy them over to the end of the heap (again, without touching their pointers). Because they are appended to the heap, the end pointer gets moved to the end of the last object, so the while loop will also take the newly copied objects into consideration. When you reach the end of the heap, you're done. In C, that would look something like this:

SCM_obj *slot;
int i, bytes_copied;
char *scan_start = heap_start;

for(i = 0; i < saved_object_count(saved_call); ++i) {
  obj = get_saved_object(saved_call, i);
  /* copy_object() is called "mark()" in CHICKEN.
     It also set up a forwarding pointer at the original location */
  bytes_copied = copy_object(obj, heap_end);
  heap_end += bytes_copied;
}

while(scan_start < heap_end) {
  obj = (SCM_obj *)scan_start;
  for(i = 0; i < object_size(obj); ++i) {
    slot = get_slot(obj, i);
    /* Nothing needs to be done if it's in the heap or static area */
    if (exists_in_stack(slot)) {
      if (is_forwarding_ptr(slot)) {
        set_slot(obj, i, forwarding_ptr_target(slot));
      } else {
        bytes_copied = copy_object(slot, heap_end);
        set_slot(obj, i, heap_end);
        heap_end += bytes_copied;
      }
    }
  }
  scan_start += object_size(obj);
}

This algorithm is the heart of our garbage collector. You can find it in runtime.c in the CHICKEN sources in C_reclaim(), under the rescan: label. The algorithm was invented in 1970 by C.J. Cheney, and is still used in the most "state of the art" implementations. Now you know why Henry Baker's paper is called "Cheney on the M.T.A." :)

After the data has been copied to the heap, the longjmp() in minor_GC() causes everything on the stack to be blown away. Then, the top stack frame is recreated from the saved call. This is illustrated below:

Everything in the shaded red area below the stack frame for driver_loop() is now unreachable because there are no more pointers from live data pointing into this region of the stack. Any live Scheme objects allocated here would have been copied to the heap, and all pointers which pointed there relayed to this new copy. Unfortunately, this stale copy of the data will permanently stick around on the stack, which means this data is forever irreclaimable. This means it is important that the entry point should consume as little stack space as possible.

The major GC

You might be wondering how garbage on the heap is collected. That's what the major GC is for. CHICKEN initially only allocates a small heap area. The heap consists of two halves: a fromspace and a tospace. The fromspace is the heap as we've seen it so far: in normal usage, this is the part that's used. The tospace is always empty.

When a minor GC is copying data from the stack to the fromspace, it may cause the fromspace to fill up. That's when a major GC is triggered: the data in the fromspace is copied to the tospace using Cheney's algorithm. Afterwards, the areas are flipped: the old fromspace is now called tospace and the old tospace is now called fromspace.

During a major GC, we have a slightly larger set of live data. It is not just the data from the saved call, because that's only the stuff directly used by the currently running continuation. We also need to consider global variables and literal objects compiled into the program, for example. These sorts of objects are also considered live data. Aside from this, a major collection is performed the same way as a minor collection.

The smart reader might have noticed a small problem here: what if the amount of garbage cleaned up is less than the data on the stack? Then, the stack data can't be copied to the new heap because it simply is too small. Well, this is when a third GC mode is triggered: a reallocating GC. This causes a new heap to be allocated, twice as big as the current heap. This is also split in from- and tospace. Then, Cheney's algorithm is performed on the old heap's fromspace, using one half of the new heap as tospace. When it's finished, the new tospace is called fromspace, and the other half of the new heap is called tospace. Then, the old heap is de-allocated.

Some practical notes

The above situation is a pretty rough sketch of the way the GC works in CHICKEN. Many details have been omitted, and the actual implementation is extremely hairy. Below I'll briefly mention how a few important things are implemented.

Object representation

You might have noticed that the stack grows pretty quickly in the CPS-converted C code I showed you. That's because the SCM_obj representation requires allocating every object and storing a pointer to it, so that's a minimum of two machine words per object.

CHICKEN avoids this overhead for small, often-used objects like characters, booleans and fixnums. It ensures all allocated objects are word-aligned, so all pointers to objects have their lower bits set to zero. This means you can easily see whether something is a pointer to an object or something else.

All objects in CHICKEN are represented by a C_word type, which is the size of a machine word. So-called immediate values are stored directly inside the machine word, with nonzero lower bits. Non-immediate values are cast to a pointer type to a C structure which contains the type tag and bits like I did in the example.

Calls are not represented by objects in CHICKEN. Instead, the C function is simply invoked directly from the continuation's caller. Continuations are represented as any other object. For didactic reasons, I used a separate C type to distinguish it from SCM_obj, but in Scheme continuations can be reified as first-class objects, so they shouldn't be represented in a fundamentally different way.

Closures

You might be wondering how closures are implemented, because this hasn't been discussed at all. The answer is pretty simple: in the example code, a SCM_call object stored a plain C function's address. Instead, we could store a closure instead: this is a new type of object which holds a C function plus its local variables. Each C function receives this closure as an extra argument (in the CHICKEN sources this is called self). When it needs to access a closed-over value, it can be accessed from the closure object.

Mutations

Another major oversight is the assumption that objects can only point from the stack into the heap. If Scheme was a purely functional language, this would be entirely accurate: new objects can refer to old objects, but there is no way that a preexisting object can be made to refer to a newly created object. For that, you need to support mutation.

But Scheme does support mutation! So what happens when you use vector-set! to store a newly created, stack-allocated value in an old, heap-allocated vector? If we used the above algorithm, the newly created element would either be part of the live set and get copied, but the vector's pointer would not be updated, or it wouldn't be part of the live set and the object would be lost in the stack reset.

The answer to this problem is also pretty simple: we add a so-called write barrier. Whenever a value is written to an object, it is remembered. Then, when performing a GC, these remembered values are considered to be part of the live set, just like the addresses in the saved call. This is also the reason CHICKEN always shows the number of mutations when you're asking for GC statistics: mutation may slow down a program because GCs might take longer.

Stack size

How does CHICKEN know when the stack is filled up? It turns out that there is no portable way to detect how big the stack is, or whether it has a limit at all!

CHICKEN works around this simply by limiting its stack to a predetermined size. On 64-bit systems, this is 1MB, on 32-bit systems it's 256KB. There is also no portable way of obtaining the address of the stack itself. On some systems, it uses a small bit of assembly code to check the stack pointer. On other systems, it falls back on alloca(), allocating a trivial amount of data. The address of the allocated data is the current value of the stack pointer.

When initialising the runtime, just before the entry point is called, the stack's address is taken to determine the stack's bottom address. The top address is checked in the continuation functions, and the difference between the two is the current stack size.

A small rant

While doing the background research for this post, I wanted to read Cheney's original paper. It was very frustrating to find so many references to it, which all lead to a a paywall on the ACM website.

I think it's absurd that the ACM charges $15 for a paper which is over forty years old, and only two measly pages long. What sane person would plunk down 15 bucks to read 2 pages, especially if it is possibly outdated, or not even the information they're looking for?

The ACM's motto is "Advancing Computing as a Science & Profession", but I don't see how putting essential papers behind a paywall is advancing the profession, especially considering how many innovations now happen as unpaid efforts in the open source/free software corner of the world. Putting such papers behind a paywall robs the industry from a sorely-needed historical perspective, and it stifles innovation by forcing us to keep reinventing the wheel.

Some might argue that the ACM needs to charge money to be able to host high-quality papers and maintain its high quality standard, but I don't buy it. You only need to look at USENIX, which is a similar association. They provide complete and perpetual access to all conference proceedings, and the authors maintain full rights to their work. The ACM, instead, has now come up with a new "protection" racket, requiring authors to give full control of their rights to the ACM, or pay for the privilege of keeping the rights on their own work, which is between $1,100 and $1,700 per article.

On a more positive note, authors are given permission to post drafts of their papers on their own website or through their "Author-izer" service. Unfortunately, this service only works when the link is followed directly from the domain on which the author's website is located (through the Referer header). This is not how the web works: it breaks links posted in e-mail as well as search engines.

Secondly, the ACM are also allowing their special interest groups to provide full access to conference papers of the most recent conference. However, this doesn't seem to be encouraged in any way, and only a few SIGs seem to do this.

Luckily, I found a copy of the Cheney paper on some course website. So do yourself a favour and get it before it's taken down :(

Update: If you are also concerned about this, please take a small moment to add your name to this petition.

Update 2: I've become aware of a web site called Sci-Hub that makes papers freely available to all. It bypasses paywalls through shared full-access accounts. Sadly, this is technically still illegal in many countries and some of its domains have been seized in attempts at censoring them.

VCS-independent distribution of language extensions

2013-06-04T21:30:09Z

Today I'd like to talk about how CHICKEN Scheme handles distribution of language extensions (which we call "eggs"). There are some unique features of our setup that might be interesting to users of other languages as well, and I think the way backwards compatibility was kept is rather interesting.

In the beginning

First, a little bit of history, so you know where we're coming from. CHICKEN was initially released in the year 2000, and the core system was available as a tarball on the website. In 2002 it was moved into CVS and in 2004 to Darcs (yes, there were good open source DVCSes before Git).

Throughout this period, eggs were simply stored as tarballs (curiously bearing a ".egg" extension) in some well-known directory on the CHICKEN website. The egg installation tool had this location built in. For example, the egg named foo would be fetched from http://www.call-with-current-continuation.org/eggs/foo.egg.

To contribute (or update!) an extension, you simply sent a tarball to Felix and he would upload it to the site. This was a very centralised way of working, creating a lot of work for Felix. So in 2005, he asked authors to put all eggs in a version control system: Subversion. At the time, every contributer was given write access to the entire repo! These were simpler times, when we had only a handful of contributors.

The switch to Subversion allowed for a neat trick: whenever an egg was modified, it triggered a "post-commit hook" which tarred up the egg and uploaded it to the website. This was a very simple addition which automated the work done by Felix, while ensuring the existing tools did not have to be modified. Egg authors now had the freedom to modify their code as they liked, and new releases would appear for download within seconds.

If an author used the conventional trunk/tags/branches layout, the post-commit hook automatically detected this and would upload the latest tag. In other words, we reached a level of automation where "making a release" was simply tagging your code!

Documentation for eggs originally lived on the same website as the eggs did, but this was eventually moved into svnwiki, one of the first wikis to use Subversion as a backing store. To make things even simpler, the core system was also moved into Subversion. Now everything was in one system, for everyone to hack on, using the same credentials everywhere. Life was good!

Start of the DVCS wars

This worked great for years, and the number of contributors steadily increased. Meanwhile, distributed version control systems were gaining mainstream popularity, and contributors started experimenting with Git, Mercurial, Bazaar and Fossil. People grumbled that CHICKEN was still on Subversion.

The next major release, CHICKEN 4.0, provided for a "clean slate", with the opportunity to rewrite the distribution system. This simplified things, replacing the brittle post-commit script with a CGI program called "henrietta", which would serve the eggs via HTTP. The download location for eggs was put into a configuration file, which allowed users to host their own mirror. This is useful if for example a company wants to set up a private deployment-server containing proprietary eggs. We also gained a mirror for general use, graciously provided by Alaric.

The difference was that now there was no static tarball, but when you downloaded an egg, its files would be served straight from either svn, a local directory tree or a website. If we ever decided to migrate the egg repository to a completely different version control system, we could simply add a new back-end to Henrietta. Nothing would have to be modified on the client.

The new system

In 2009, CHICKEN core was moved into a Git repository, as it looked like Git was winning the DVCS wars. New users were often complaining about having to use crusty old Subversion. By this time, people even used DVCSes exclusively, only synchronising to the svn repo. This meant it was no longer the "canonical" repository for all eggs. It was becoming nothing but a hassle for those who preferred other VCSes.

Another problem was that we had still a maintenance problem: commit access on the svn repo is centrally managed, through one big mod_authz_svn configuration file, listing which users have access to which "sub-repositories". If someone wants to grant commit access to another developer, this has to be requested via the mailing list or the server's maintainer.

Requirements

To solve these problems, we started to consider new ways to allow users to develop their eggs using their favorite VCS. The new system had a few strict requirements:

It had to be completely backwards-compatible. No changes should be made to CHICKEN core. New eggs published through this system should be available to older CHICKENs, too.
It had to be completely VCS-independent. We want to avoid extra work when the next VCS fad comes along. Furthermore, it should work with all popular code hosters, for maximum freedom of choice. Self-hosting should explicitly be an option.
The existing workflow of egg authors should not fundamentally change; especially the release procedure of making a tag should stay.
There should be a way to avoid broken links if someone takes down their repo.
Most of all, the system had to be simple.

A simple solution

The simplest way to make the distribution system VCS-independent is to ignore VCSes altogether! Instead, we download source files over HTTP and mirror them from the CHICKEN server.

This idea was rather natural: our Subversion setup had always allowed direct access to plain files over HTTP through mod_dav_svn. Most popular code hosting sites (Github, Bitbucket, Google Code etc) also allow this, either directly or via some web repo viewer's "download raw file" link, which can be constructed from a VCS tag and file name. Also, Henrietta already supported serving eggs from a local directory tree which meant we had to make almost no modifications to our existing tool chain.

To make this work, all that's needed is:

Some daemon which periodically fetches new eggs.
A "master list" of where each egg is hosted.
For each egg, a list of released versions for that egg.
A "base URI" where the files for that release can be downloaded.
A list of files for that release (or the name of a tarball, which is equivalent).

We already had a so-called ".meta-file", which contains info about the egg (author, license, name, category etc). In an earlier incarnation of the post-commit hook this file also contained a list of the files that the egg consisted of, so it made sense to re-use this facility.

We only needed to take care of the daemon, the master egg list and a way to communicate the base URI. This was simple, and I wrote the daemon (dubbed "henrietta-cache") over a weekend during a hackathon. It really is simple and consists of only 300+ lines of (rather ugly) Scheme code. At the hackathon, Moritz helped out by moving the existing eggs to this new scheme, and testing with various hosting providers.

But not the simplest solution

The clever reader has probably already noted that the setup could be simplified by putting the henrietta-cache logic into the client program. We chose not to do this because it would break two requirements: that of backwards compatibility and that of avoiding broken links.

Strictly speaking, the backwards compatibility problem could be solved by embedding the functionality into chicken-install and eventually removing henrietta-cache from the server.

Broken links are a bigger problem, though. Currently, if a repo becomes unavailable, this is no problem; we still have a cached copy on our servers. Even if the repo goes offline forever and nobody has a copy of it anymore, we can still import the cached files into a fresh repo and take over maintenance from there.

Some incremental improvements

Unfortunately, the new system made it easier for Github and Bitbucket users than for CHICKEN Subversion users to maintain their eggs, because these sites allow tarball downloads, while the Subversion users had to list each file in their egg in the meta file. Under the old system this was not required, because it simply offered the entire svn egg directory for download.

After some people complained about having to do this extra step, I wrote another simple "helper" egg with the tongue-in-cheek name "pseudo-meta-egg-info". This is a small (80 lines) Spiffy web application which can generate "pseudo" meta files containing a full list of all the files in a Subversion subdirectory, and a list of all the tags available. This all happens on-the-fly, which means that egg authors could now revert to their old workflow of simply tagging their egg to make a release!

Technically, this helper webapp can be extended and deployed for any hosting site, so if you decide to host your own repository it could generate the list of tags and files for that, too. CHICKEN isn't big enough to ask Google, Github or Bitbucket to run this on their servers, of course, so some helper plug-ins and shell scripts for svn, hg and git were made as well. These will generate the list of tags and file names and put them in the meta- and release-info files.

Current status

The new system has been in use for over two years (since March 29th, 2011) and it has been doing a good job, requiring only very little maintenance and few modifications after the initial release. We've already reaped the benefits of our setup: Github and Bitbucket both had several periods of downtime, during which eggs were still available, even if they were hosted there.

The following graph shows the number of available CHICKEN eggs, starting with the "release 4" branch (requires an SVG-capable browser). There's a small skew because the script I used to generate the graph only checked for existence, not whether the egg was released.

As you can see, Mercurial (hg) and Git took off almost simultaneously, but where git is still steadily increasing in popularity, hg mostly stagnated. Subversion (svn) saw a few drops from eggs that were moved into hg/git. You'd guess that most git users would use Github, but it turns out that Bitbucket is reasonably popular among Chicken users too. We also have three authors who have opted to host their own repositories. You can see this in the breakdown of today's eggs by host:

Hosting site	VCSes	# of eggs
code.call-cc.org	svn	454
github.com	git	85
bitbucket.org	hg, git	41
gitorious.org	git	5
chust.org	fossil	5
kitten-technologies.co.uk	fossil	3
code.stapelberg.de	git	1

Finally, the graph shows that people are still releasing new eggs from svn, but most new development takes place in git. And yes, there are a few eggs in Fossil, too! Bazaar is currently not listed. One possible explanation is that Loggerhead (its web viewer) does not allow easy construction of stable URLs to raw files for a particular tag (or zip file/tarball), so serving up eggs straight from a repo is not possible. Another reason could be that bzr simply isn't that popular among CHICKEN users. If you're a bzr user and would like to use this distribution scheme, please have a look at Loggerhead issues #473691 and #739022. If you know a way around this, please share your knowledge on our release instructions page.

Things to improve

Needless to say, I'm rather happy that the system satisfied all the requirements we set for it, and that it saw such uptake. The majority of newly released eggs are using one of the new systems (too bad it's Git, but I guess that's inevitable).

However, as always, there is room for improvement. The current system has a few flaws, all of which are due to the fact that henrietta-cache simply copies code off an external site:

There's no "stable" tarball per egg release. This is required for OS package managers, which usually verify with a checksum whether the source package has not changed. Recently, Mario improved on this situation by providing tarballs, but these are merely tarballs of the henrietta-cache mirror on that particular server. However, these should be expected to be stable...
If an egg author moves tags around, nobody will know. Different henrietta-cache mirrors may then have an inconsistent view of the distributed repository. We have two egg mirrors, and so far this has happened once or twice. This requires some manual intervention: just blow away all the cached files and wait for it to re-synch, or trigger the synch manually.
Egg authors cannot sign their eggs; each egg is downloaded from a source that may not be trustworthy. This is tricky, especially because most people don't want to mess around with PGP keys anyway. CHICKEN core releases aren't signed either, so this isn't very high on our priority list.

I think some of these problems are a result of "going distributed", similar to the problem that you should not rewrite history that has already been pushed.

Random thoughts on the substring procedure

2013-02-17T10:57:03Z

Recently there was a small flame war on the Chicken-hackers mailing list. A user new to Scheme asked an innocuous question that drew some heated responses:

 Is there a good reason for this behavior?
 # perl -e 'print substr("ciao",0,10);'
 ciao 
 # ruby -e 'puts "ciao"[0..10]'
 ciao
 # python -c 'print "ciao"[0:10];'
 ciao
 # csi -e '(print (substring "ciao" 0 10))'
 Error: (substring) out of range 0 10

Some popular dynamic languages have a generic "slice" operator which allows the user to supply an end index that's beyond the end of the object, and it'll return from the start position up until the end. Instead, Chicken (and most other Schemes) will raise an error.

On the list, I argued that taking characters 0 through 10 from a 3-character string makes no bloody sense, which is why it's signalling an error. For the record: this can be caught by an exception handler, which makes it a controlled error situation, not a "crash".

Our new user retorted that it's perfectly sane to define the substring procedure as:

 Return a string consisting of the characters between the start position
 and the end position, or the end of the string, whichever comes first.

I think this is a needlessly complex definition. It breaks the rule "do one thing and do it well", from which UNIX derives its power: Conceptually crisp components ease composition.

One of the most valuable things a programming language can offer is the ability to reason about code with a minimum of extra information. This is also why most Schemers prefer a functional programming style; it's easier to reason about referentially transparent code. Let's see what useful facts we can infer from a single (substring s 0 10) call:

The variable s is a string.
The string s is at least 10 characters long.
The returned value is a string.
The returned string is exactly 10 characters long.

If either of the preconditions doesn't hold, it's an error situation and the code following the substring call will not be executed. The above guarantees also mean, for example, that if later you see (string-ref s 8) this will always return a character. In "sloppier" languages, you lose several of these important footholds. This means you can't reason so well about your code's correctness anymore, except by reading back and dragging in more context.

Finally, it is also harder to build the "simple" version of substring on top of the complex one than it is to build the complex one as a "convenience" layer on top of the simpler one. On our list it was quickly shown that it's trivial to do so:

(define (substring/n s start n)
  (let* ((start (min start (string-length s)))
         (end (min (string-length s) (+ start n))))
    (substring s start end)))

;; Easy to use and re-use:
(substring/n "ciao" 1 10) => "iao"

There's even an egg for Chicken called slice which provides a generic procedure which behaves like the ranged index operator in Python/Ruby.

A tangential rant on the hidden costs of sloppiness

The difference in behaviour between these languages is not a coincidence: it's a result of deep cultural differences. The Scheme culture (and in some respects the broader Lisp culture) is one that tends to prefer correctness and precision. This appears in many forms, from Shivers' "100% correct solution" manifesto to Gabriel's Worse Is Better essay and all the verbiage dedicated to correct "hygienic" treatment of macros.

In contrast, some cultures prefer lax and "do what I mean" over rigid and predictable behaviour. This may be more convenient for writing quick one-off scripts, but in my opinion this is just asking for trouble when writing serious programs.

Let's investigate some examples of the consequences of this "lax" attitude. You're probably aware of the recent discovery of several vulnerabilities in Ruby on Rails. Two of these allowed remote code execution simply by submitting a POST request to any Rails application. As this post explains, the parser for XML requests was "enhanced" to automatically parse embedded YAML documents (which can contain arbitrary code). My position is that YAML has absolutely nothing to do with XML (or JSON), which means that if a program wants to parse YAML embedded in XML it must do that itself, or at least explicitly specify it wants automatic type conversions in XML/JSON documents. The Rails developers allowed misplaced convenience and sloppiness to trump precision and correctness, to the point that nobody knew what their code really did.

Another example would be the way PHP, Javascript, and several other languages implicitly coerce types. You can see the hilarious results of the confusion this can cause in the brilliant talk titled "Wat". There are also people filing bug reports for PHP's == operator. Its implicit type conversion is documented, intended, behaviour but it results in a lot of confusion and, again, security issues, as pointed out by PHP Sadness #47. If you allow the programmer to be sloppy and leave important details unspecified, an attacker will gladly fill in those details for you.

Some more fun can be had by looking at the MySQL database and how it mangles data. The PostgreSQL culture also strongly prefers correctness and precision, whereas MySQL's culture is more lax. The clash between these two cultures can be seen in a thread on the PostgreSQL mailinglist where someone posted a video of a comparison between PostgreSQL and MySQL's behaviour. These cultural differences run deep, as you can tell by the responses of shock. And again, the lax behaviour of MySQL has security implications. The Rails folks have discovered that common practices might allow attackers to abuse MySQL's type coercion. Because Rails supports passing typed data in queries, it's possible to force an integer in a condition that expects a string. MySQL will silently coerce non-numerical strings to zero:

SELECT * FROM `users` WHERE `login_token` = 0 LIMIT 1;

This will match the first record (which usually just happens to be the administrative account). Just as with the innocent little substring behaviour we started our journey with, it is possible to work around this, but things would be a lot easier if the software behaved more rigidly and strict, so that this kind of conversion would only be done upon explicit request of the programmer.

Incidentally, it is possible to put MySQL into a stricter "SQL mode":

SET sql_mode='TRADITIONAL';

This is rarely done, probably because most software somehow implicitly relies on this broken behaviour. By the way, does anyone else think it's funny that this mode is called "traditional"? As if it were somehow old-fashioned to expect precise and correct behaviour!

Take back control

It is high time people realised that implicit behaviour and unclear specifications are a recipe for disaster. Computers are by nature rigid and exact. This is a feature we should embrace. Processes in the "real world" are often fuzzy and poorly defined, usually because they are poorly understood. As programmers, it's our job to keep digging until we have enough information to describe the task to a computer. Making APIs fuzzier is the wrong response to this problem, and a sign of weakness. Do you prefer to know exactly what your program will do, or would you rather admit defeat and allow fuzziness to creep into your programs?

In case you're wondering, this rant didn't come out of the blue. One of three reasons this blog is called more magic is as a wry reference to the trend of putting more "magic" into APIs, which makes them hard to control. This is a recurring frustration of mine and I would like to see a move towards less magic of this kind. Yeah, I'm a cynical bastard ;)

A new domain

2013-02-11T18:27:52Z

I've finally decided to get a proper domain name: http://www.more-magic.net. Please update your bookmarks and feed readers!

I used to run this blog on a hostname from the good folks at DynDNS, which I registered in my college days. DynDNS had the benefit of being 100% free (great for poor college students!), but the disadvantage of having to run a tool called ddclient. This tool is intended to update DNS entries for hosts with dynamically assigned IP address, and if you don't run it, your hostname will expire.

Occasionally ddclient gets "stuck", not performing updates anymore. This happens unnoticably, until you get an e-mail from DynDNS stating that your domain will expire in 5 days unless you click the reactivation link and restart ddclient. The hassle of this and the risk of ddclient getting stuck at a bad time, together with the unprofessional quality of running under a domain that's obviously not your own (and harder to remember) finally got me to consider paying for a proper domain. So there you have it: more-magic.net :)

Lessons learned from NUL byte bugs

2012-12-10T21:19:52Z

Last time I explained how sloppy representations can cause various vulnerabilities. While doing some research for that post I stumbled across NUL byte injection bugs in two projects. Because both have been fixed now, I feel like I can freely talk about them with a clear conscience.

These projects are Chicken Scheme and the C implementation of Ruby. The difference in the way these systems deal with NUL bytes clearly shows the importance of handling security issues in a structural way. We'll also see the importance of truly grokking the problem when implementing a fix.

A quick recap

Remember that C uses NUL bytes to delimit strings. Many other languages store the length of the string instead. In these languages, NUL bytes can occur inside strings. This can cause unintended reinterpretation when strings cross the language border into C.

In my previous post I already pointed out how Chicken automatically prevents this reinterpretation in its foreign function interface (FFI). You just describe to Scheme that your C function accepts a string, and it will take care of the rest:

(define my-length (foreign-lambda int "strlen" c-string))

 ;; Prints 12:
(print (my-length "hello, there"))

;; Raises an exception, showing the following message:
;; Error: (##sys#make-c-string) cannot represent string with NUL
;;   bytes as C string: "hello\x00there"
(print (my-length "hello\x00there"))

The FFI's feature of automatically checking for NUL bytes in strings before passing them on to C was only added in late 2010 (Chicken 4.6.0). However, because everything uses this interface, this mismatch could easily be fixed, in a central location, securing all existing programs in one fell swoop.

Now, you may be thinking "well, that's nothing special; it's good engineering practice that there must be a single point of truth, and that you Don't Repeat Yourself". And you'd be right! In fact, this is a key insight: solid engineering is a prerequisite to secure engineering. It can prevent security bugs from happening, and help to fix them quickly once they are discovered. A core tenet of "structural security" is that without structure, there can be no security.

When smugness backfires

To drive home the point, let's take a look at what I discovered while writing my previous blog post. After describing Chicken's Right Way solution and feeling all smug about it, I noticed an embarrassing problem: for various reasons (some good, others less so), there are places in Chicken where C functions are called without going through the FFI. Some of these contained hand-rolled string conversions!

It turns out that we overlooked these places when first introducing the NUL byte checks, and as a consequence several critical procedures (standard R5RS ones like with-input-from-file) were left vulnerable to exactly this bug:

;; This program outputs "yes" twice in Chickens < 4.8.0
(with-output-to-file "foo\x00bar" (lambda () (print "hai")))
(print (if (file-exists? "foo") "yes" "no"))
(print (if (file-exists? "foo\x00bar") "yes" "no"))

To me, this just validates the importance of approaching security measures in a structural rather than an ad-hoc way; the bug was only in those parts of the code that didn't use the FFI. Deviation from a rule is where bugs are often found!

You can also see that we fixed it as thoroughly as possible, especially given the at times awkward structure of the Chicken code. We commented every special situation extensively, assigned a new error type C_ASCIIZ_REPRESENTATION_ERROR for this particular error, and added regression tests for at least each class of functionality (string to number conversion, file port creation, process creation, environment access, and low-level messaging functionality). There's definitely room for improvement here, and I hope to one day reduce the special cases to the bare minimum. By documenting special cases it's easy to avoid introducing new problems. It also makes them easier to find when refactoring. The tests help there too, of course.

When you run the above program in a Chicken version with the fix, it behaves like expected:

 Error: cannot represent string with NUL bytes as C string: "foo\x00bar"

Another approach

The Ruby situation is a little more complicated. It has no FFI but a C API, so it works the other way around: you write C to interface "up" into Ruby. It has a StringValueCStr() macro, which is documented as follows (sic):

 You can also use the macro named StringValueCStr(). This is just
 like StringValuePtr(), but always add nul character at the end of
 the result. If the result contains nul character, this macro causes
 the ArgumentError exception.

However, this isn't consistently used in Ruby's own standard library:

File.open("foo\0bar", "w") { |f| f.puts "hai" }
puts File.exists?("foo")
puts File.exists?("foo\0bar")

In Ruby 1.9.3p194 and earlier, this shows the following output, indicating it's vulnerable:

 true
 test.rb:4:in `exists?': string contains null byte (ArgumentError)
         from test.rb:4:in `<main>'

It turns out that internally, Ruby strings are stored with a length, but also get a NUL byte tacked onto the end, to prevent copying when calling C functions. This performance hack undermines the safety of Ruby to C string conversions, and is the direct cause of these inconsistencies. True, there is a safe function that extracts the string while checking for NUL bytes, but there are also various ways to bypass this, and if you accidentally use the wrong macro to extract the (raw) string, your code won't break. Of course, this is only true for benign inputs...

The complexity of Ruby's implementation makes it hard to ensure that it's safe everywhere. Indeed, the various places where strings are passed to C all do it differently. For example, the ENV hash for manipulating the POSIX environment has its own hand-rolled test for NUL, which you can easily verify; it produces a different error message than the one exists? gave us earlier:

irb(main):001:0> ENV["foo\0bar"] = "test"
ArgumentError: bad environment variable name

There is no reason this couldn't just use StringValueCStr(). So, even though Ruby has this safe macro, which provides a mechanism to check for poisoned NUL bytes in strings, it's rarely used by Ruby's own internals. This could be fixed just like Chicken; here too, the best way to do that would be to generalize and eliminate all special cases. Simpler code is easier to secure.

A fundamental misunderstanding

When I reported the bug in the File class to the Ruby project, they quickly had a fix, but unfortunately they seemed uninterested in going through Ruby's entire code to fix all string conversions (quoting from private e-mail conversation):

 > I agree that this looks like a good place to fix the File/IO
 > class, but there are many other places where strings are passed to C.
 > Are all of those secured?
 All path names should be converted with "to_path" method if possible.
 If any methods don't obey the rule, it is another bug.  Please let us
 know if you find such case.

In retrospect, there is the possibility that I didn't quite make myself clear enough. Perhaps this person thought I was referring to other path strings in the code. However, to me it sounds a lot like they made the same conceptual mistake that the PHP team made when they "fixed" NUL injections.

The PHP solution was to add a special "p" flag for converting path strings. This happens for all PHP functions declared in C (via zend_parse_parameters()). By the way, notice how this is a new flag. There probably are tons of PHP extensions out there which aren't using this flag yet. Also, who can verify that they managed to find all the strings in PHP which represent paths?

The PHP team was completely missing the point here. This fix means that path arguments aren't allowed to have embedded NUL bytes. Other string type arguments are not checked. They are missing the fact that this isn't just a path issue. Rather, as I described before, it's a fundamental mismatch at the language boundary where strings are translated from the host language to C. However, there seems to be a widespread belief that this can only be exploited in path strings.

I'm not entirely sure why this is, but I can guess. First off, "poisoned NUL byte" attacks have been popularized by a 1999 Phrack article. This article shows a few attacks, but only the path examples are really convincing. Of course, another reason is that injecting NUL bytes in path strings really is the most obvious and practical way to exploit web scripts.

Recently, however, different NUL byte attacks have been documented. For example, they can be used to truncate LDAP and SQL queries and to bypass regular expression filters on SQL input, but you could argue these are all examples of failure to escape correctly. I found a more convincing example in the (excellent!) book The Tangled Web: it contains a one-sentence warning about using HTML sanitation C libraries from other languages. Also, NUL bytes can sometimes be used to hide attacks from log files.

However, the most impressive recent exploit is without a doubt this common vulnerability in SSL certificate verification systems. In an attack, an embedded NUL byte causes a certificate to be accepted for "www.paypal.com", when the CN (Common Name) section (that is, the server's hostname) actually contains the value "www.paypal.com\0.thoughtcrime.org". Certificate authorities generally just accepted this as a valid subdomain of "thoughtcrime.org", ignoring the NUL byte. Client programs (like web browsers) tended to use C string comparison functions, which stop at the NUL byte. Luckily, this was widely reported, and has been fixed in most programs.

I believe that NUL byte mishandling represents a big and mostly untapped source of vulnerabilities. High-level languages are gaining popularity over C for client-side programs, but many crucial libraries are still written in C. This combination means that the problem will grow unless this is structurally fixed in language implementations.

Structurally fixing injection bugs

2012-09-23T14:10:49Z

The two biggest threats to the web are caused by the same underlying mistake. It is time this problem was fixed at its root. This article will attempt to provide the tools do so.

Input sanitation, input filtering or output escaping?

The Open Web Application Security Project (OWASP) does a great job at educating people and suggesting practical solutions to avoid common weaknesses. Unfortunately, most security bloggers focus on vulnerabilities rather than the prevention of attacks, and those that do often give bad advice. For example, common advice is to avoid XSS (cross-site scripting) and SQL injection bugs by "sanitizing" or "validating" input. Now, by itself this is good advice.

Unfortunately, the phrase "sanitize your inputs" is often misunderstood and the advice itself can be misguided. For example, Chris Shiflett says:

 If you reject [anything but alphanumerics], Berners-Lee and O'Reilly will be
 rejected, despite being valid last names.   However, this problem is easily
 resolved.  A quick change to also allow single quotes and hyphens is all you
 need to do.  Over time, your input filtering techniques will be perfected.

I think this advice is a little unhealthy. Those are valid names, and rejecting them will only scare away customers and reinforce the idea that the "security Nazis" are out to inconvenience people. I wish people would place less emphasis on filtering and sanitizing. Citing this XKCD comic has become a cliché, which (while funny) makes it worse:

Validating and sanitizing input is good when it refers to parsing input into meaningful values immediately when receiving it, so that you don't, say, get a URL when you are expecting an integer. The horror story of ROBCASHFLOW shows how important input restrictions can be (but see also this cautionary list. Tl;dr: you're doomed either way).

However, input sanitation will (in general) not prevent XSS or SQL injection. The OWASP XSS prevention "cheat-sheet" recognizes input validation and sanitation for what it is; a good secondary security measure in a broader "defense in depth" strategy.

Instead, there are only two correct ways to prevent "injection" bugs. The best is often even omitted from advice, which is to avoid the problem entirely (see below). The other is to escape output. Unfortunately, advice to escape often seems to imply that you should manually call escaping procedures; "just use mysql_real_escape_string()". This is a very bad idea; it's tedious, it's easy to forget, it makes code less readable and it requires everyone working on the code to be equally informed about security issues (a great idea, but not very realistic).

Let's investigate how we can prevent these vulnerabilities easily and automatically. This will help us secure applications in a structural rather than an ad-hoc way.

The trouble with strings

The underlying problem of all these vulnerabilities is that a tree structure (e.g., the SQL script's AST or the HTML DOM tree) is represented as a string, and user input which should be a node in the tree is inserted into this string. If this includes characters from the meta-language which describes the tree's structure, it can influence that structure. Here's an example:

<p>{username} said the following: {message}</p>

When message is "So you see, if a<b and c<a, then b>c.", you get output like this (depending on the browser, HTML version and phase of the moon):

Math teacher said the following: So you see, if ac.

This code is simply incorrect, and this bug will frustrate users like the math teacher. But this can turn into a security nightmare; any punk can make you look like a fool by making your images dance around, taking over your users' sessions by stealing cookies, or do much worse. The underlying reason this nonsense is possible at all is the fact that you are mixing user input strings with HTML.

In other words, you're performing string surgery on the serialized representation of a tree structure. Just stop and think how insane that really sounds! Why don't we use real data types? While researching this topic, I found an insightful article called "Safe String Theory for the Web". The author has a good grasp on the problem and comes close to the solution, but he never transcends the idea of representing everything as a string.

Many people don't, so despite the flawed concept, there are several solutions that take string splicing as a given. For instance, some frameworks have a sort of "safe HTML buffers", which automatically HTML-escape strings. These solutions don't deal with the context problem from "Safe String Theory for the Web". There's only one built-in string type, and making it context-aware is extremely hard, maybe even impossible. Strongly typed languages have an advantage here, though!

Representing HTML as a tree helps preventing injection bugs, and has other advantages over automatic escaping. For example, we need to worry less about generating invalid HTML; our output is always guaranteed to be well-formed. The essence of an XSS attack is that it breaks up your document structure and re-shapes it. These are just two sides to the same coin: By taking control of the HTML's shape, XSS is also avoided.

There's another, more insidious problem with splicing HTML strings, which I haven't seen discussed much either. It's another context problem; if your complex web application contains many "helper" functions, it becomes very hard to keep track of which helper functions accept HTML and which accept text. For example, is the following PHP function safe?

function render_latest_topicslist() {
  $out = '';
  foreach(Forum::latestPosts(10) as $topic) {
    $link = render_link('forum/show/'.(int)$topic['id'], $topic['title']);
    $out .= "<li>{$link}</li>";
  }
  return "<ul id=\"latest-topics\">{$out}</ul>";
}

This is (of course) a trick question. Consider:

$dest_url = ... some URL ...
$dest = htmlspecialchars($dest_url, ENT_QUOTES, 'UTF-8');
echo render_link($dest_url, "<span>Go to <em>{$dest}</em> directly.</span>");

Either this second example is wrong and the tags will come out literally (i.e., as <span>...</span> in the HTML source), or the first example was wrong and you have an injection bug. You can't tell without consulting render_link's API documentation or implementation. With many helper procedures, how can you keep track of which accept fully formed HTML and which escape their input? What happens when a function which auto-encodes suddenly needs to be changed to accept HTML?

This style of programming results in ad-hoc security. You add escaping in just the right places, decided on a case-by-case basis. This is unsafe by default; you must remember to escape, which makes it error-prone. It's also hard to spot mistakes in this style. The alternative to ad-hoc security is structural security: a style which makes it virtually impossible to write insecure code by accident, thus eliminating entire classes of vulnerabilities.

For example, in PHP we could use the DOM library to represent an HTML tree:

function get_latest_topicslist($document) {
  $ul = $document->createElement('ul');
  $ul->setAttribute('id', 'latest-topics');

  foreach(Forum::latestPosts(10) as $topic) {
    $title = $document->createTextNode($topic['title']);
    $link = get_link($document, 'forum/show/'.(int)$topic['id'], $title);

    $li = $document->createElement('li');
    $li->appendChild($link);
    $ul->appendChild($li);
  }
  return $ul;
}

And the second example:

$contents = $document->createElement('span');
$contents->appendChild($document->createTextNode('Go to '));
$em = $document->createElement('em');
$em->appendChild($document->createTextNode($dest));
$contents->appendChild($em);
$contents->appendChild($document->createTextNode(' directly.'));
$link = get_link($document, $dest_url, $contents);

Unfortunately, this code is very verbose. The stuff that really matters gets lost in the noise of DOM manipulation. The advantage is that this is safe; text content cannot influence the tree structure, since the type of every function argument is enforced to be a DOM object and string contents are automatically XML-encoded on output.

Language design to the rescue!

Language design can help a great deal to improve security. For example, domain-specific languages like SXML and SSQL can save the programmer from having to remember to escape while writing most "normal", day-to-day code. This frees precious brain cycles to think about more essential things, like the program's purpose. Here's the example again, using SXML:

(define (latest-topics-list)
  `(ul (@ (id "latest-topics"))
       ,(map (lambda (topic)
               `(li ,(make-link `("forum" "show" (alist-ref 'id topic))
                                (alist-ref 'title topic)))))
             (forum-latest-posts 10)))

And the second example:

(make-link destination-url `(span "Go to " (em ,destination) " directly."))

This code is safe from XSS, like the PHP DOM example. However, this code is (to a Schemer) just as readable as the naive PHP version. And, most importantly, the safety is achieved without any effort from the programmer.

This shows the immense safety and security advantages that can be gained from language design. Of course, this isn't completely foolproof: We still need to ensure URIs used in href attributes have an allowed scheme like http: or ftp: and not, say, javascript:. Note that input filtering and sanitation can help in situations like these! Also, just like with automatic escaping, strings in sub-languages (like JS or CSS) aren't automatically escaped. However, there is less "magic" involved; this is a representation for HTML, so it's obvious that only HTML meta-characters will be encoded. If we're also using DSLs for sub-languages, this auto-escaping effect can be nested, solving the "context problem" in a way automatic escaping cannot.

SXML rewards programmers for writing safe code by making it look clean, concise, and easy to write. String splicing looks ugly and verbose in Scheme. In plain PHP this looks clean and simple, while DOM manipulation looks ugly. This subtly guides programmers into writing unsafe code. However, there are some PHP libraries that make safe code look clean. For example, Drupal has a "Forms API". It's a little ugly, but it's idiomatic in Drupal, which means code that uses it is considered cleaner than code that doesn't. Facebook is another attractive target for attackers, so they had to come up with a structural solution. Their solution is a language extension called XHP which adds native support for HTML literals.

These solutions are all specific to some codebase, not part of basic PHP. A framework or an existing codebase has "default" libraries, but when writing from scratch most programmers prefer to use what's available in the base language. This means a language should only include libraries that are safe by default. Otherwise, alternative safe libraries have to compete with the standard ones, which is an unfair disadvantage!

Sidestepping the SQL injection problem entirely

Even though it's possible to write safe code in almost any language if you try hard enough, the basic design of a language itself subtly influences how people will program in it by default. Consider the following example, using the Ruby PG gem:

# This code is vulnerable to SQL injection if the variables store user input
res = db.query("SELECT first, last FROM users "
               "WHERE login = '#{login}' "
               "AND customer = '#{customer}' "
               "AND department = '#{department}'")

Here we're using string interpolation, which is the expansion of variable names within a string. We saw this before, in PHP, but in Ruby you can drop back to the full language, which makes the safe solution a little easier to write:

# This code is safe
res = db.query("SELECT first, last FROM users "
               "WHERE login = '#{db.escape_string(login)}' "
               "AND customer = '#{db.escape_string(customer)}' "
               "AND department = '#{db.escape_string(department)}'")

Still, it looks uglier than the first example.

The documentation says the escape_string method is considered deprecated. That's because sidestepping the problem entirely is much smarter than escaping. This is done by passing the user-supplied values completely separate ("out of band") from the SQL command string. This way, the data can't possibly influence the structure of the command. They are kept separate even in the network protocol, so it is enforced all the way up into the server. As an added bonus, this is only slightly more verbose than the naive version:

# This code is even safer
res = db.query("SELECT first, last FROM users "
               "WHERE login = $1 AND customer = $2 AND department = $3",
	       [login, customer, department])

This scales only to about a dozen parameters. With more, it becomes hard to mentally map the correct parameter to the correct position. A DSL can do this automatically for you. For example, Microsoft's LinqToSQL language extension seems to do this. SSQL currently auto-escapes, but it could transparently be changed to use positional parameters.

Pervasive (in)security through (bad) design

I'm not a native English speaker, so I looked up the word "interpolation" on Merriam-Webster:

 interpolate, transitive verb:
 To alter or corrupt (as a text) by inserting new or foreign matter

To corrupt, indeed!

Interpolation of user-supplied strings is rarely correct, and it puts almost any conceivable safe API at a disadvantage by making the wrong thing easier and shorter to write than the right thing. Beginners, unaware of the security risks, will try to use this "neat feature". It's put in there for a reason, right? Some people are trying to fix string interpolation, which is a noble goal but I wouldn't expect this to be adopted as the "native" string interpolation mechanism in a language any time soon.

The Ruby examples show the importance of good documentation and library design. The docs pointed us in the right direction by marking the escape_string method as deprecated. Its good design is more apparent when contrasting it with the MySQL gem. This has no support for positional arguments in query, having only escape_string and prepare. The latter allows you to pass parameters separately, but it conflates value separation with statement caching and has an unwieldy API. Finally, the docs are quite sparse. Taken together, this all gently nudges developers into the direction of string interpolation by making that the easiest way to do it. Much of this is due to the design of MySQL's wire protocol, which dictates the API of the C library, which in turn guides the design of "high-level" libraries built on top of it.

I think high-level libraries should strive to abstract away unsafe or painful aspects of the lower levels. For example, the Chicken MySQL-client egg emulates value separation:

(conn (string-append "SELECT first, last FROM users "
                     "WHERE login = '?login' "
                     "AND customer = '?cust' "
                     "AND department = '?dept'")
      `((?login . ,login) (?cust . ,customer) (?dept . ,department)))

Ruby's MySQL gem could easily have done this, but they chose to restrict themselves to making a thin layer which maps closely to the C library.

Not all is lost with crappy libraries: Abstractions can solve such problems at an even higher level. Rails can safely pass query parameters via Arel, in a database-independent way, even though MySQL is one of the back-ends. This is true for SQLAlchemy, PDO and many others.

Other examples

This section will show more examples of the same bug. They can all be structurally solved in two simple ways: Automatic escaping (by using proper data structures) or passing data separately from the control language. But let's start with one where this won't work :)

Poisoned NUL bytes

As you may know, strings in the C language are pointers to a character array terminated by a NUL (ASCII 0) byte. Many other languages represent strings as a pointer plus a length, allowing NUL "characters" to simply occur in strings, with no special meaning.

This representational mismatch can be a problem when calling C functions from these languages. In many cases, a C character array of the length of the string plus 1 is allocated, the string contents are copied from the "native" string to the array and a NUL byte is inserted at the end. This causes a reinterpretation of the string's value if it contains a NUL byte, which opens up a potential vulnerability to a "poisoned" NUL byte attack.

Let's look at a toy example in Chicken Scheme:

(define greeting "hello\x00, world!")

(define calculate-length-in-c
  (foreign-lambda int "strlen" c-string))

(print "Scheme says: " (string-length greeting))
(print "C says: " (calculate-length-in-c greeting))

As far as Scheme is concerned, the NUL byte is perfectly legal and the string's length is 14, but for C, the string ends after hello, which makes for a length of 5. There is no way in C to "escape" NUL bytes, and we can't sidestep it here, either. Our only option is to raise an error:

 Scheme says: 14
 
 Error: (##sys#make-c-string) cannot represent string with
    NUL bytes as C string: "hello\x00, world!"

This is a good example of structural security; it doesn't matter whether the programmer is caffeine-deprived, on a tight deadline or simply unaware of this particular vulnerability. He or she is protected from accidentally making this mistake because it's handled at the boundary between C and Scheme, which is exactly where it should be handled.

HTTP response splitting/Header injection

HTTP response splitting and HTTP header injection are two closely related attacks, based on a single underlying weakness.

The idea is simple: HTTP (response) headers are separated by a CRLF combination. If user input ends up in a header (like in a Location header for a redirect), this can allow an attacker to split a header in two by putting a separator in it. Let's say that http://example.com/foo gets redirected to http://example.com/blabla?q=foo.

An attacker can trick someone (or their browser) into following this link (%0d%0a is an URI-encoded CRLF pair):

 http://www.example.com/abc%0d%0aSet-Cookie%3a%20SESSION%3dwhatever-i-want

This could cause the victim's session cookie for example.com to be overwritten:

 Location: http://www.example.com/blabla?q=abc
 Set-Cookie: SESSION=whatever-i-want

This is a session fixation attack. For this particular bug, the real solution is of course to properly percent-encode the destination URI, but the general solution can be as simple as disallowing newlines in the header-setting mechanism (e.g., PHP does this since 5.1.2). Doing it in the only procedure which is capable of emitting headers is a structurally secure approach, but it won't protect against all attacks.

For example, even if we disallow newlines it is still possible to set a parameter (attribute) or a second value for a header, splitting it with a semicolon or a comma, respectively:

 Accept: text/html;q=0.5, text/{user-type}

If this is done unsafely, extra content-types can be added. They can even be given preference:

 Accept: text/html;q=0.5, text/plain;q=0.1, application/octet-stream;q=1.0

Protecting against these sorts of attacks can only be done with libraries which know each header's syntax and use rich objects to represent them. This approach is taken by intarweb and Guile's HTTP library, and is similar to representing HTML as a (DOM) tree. I'm not aware of any other libraries which use fully parsed "objects" to represent HTTP header values.

Running subprocesses

For some reason, often people use a procedure like system() to invoke subprocesses. It's the most convenient way to quickly run a program just like you would from the command line. It will pass this string to the Unix shell, which expands globs ("wildcards") and runs the program:

(system (sprintf "echo \"~A\"" input))  ;; UNSAFE:   byebye files"; rm -rf / "

Several languages have specialized syntax for invoking the shell and putting the output in a string using backticks, e.g., `echo hi`. The really bad part is that string interpolation is supported within the backtick operator, e.g., `echo Hi, "{$name}"`. This is dangerous because the shell is yet another interpreter with its own language, and we've learned by now that we shouldn't embed user input directly into a sublanguage. Here too, string interpolation makes the wrong thing very convenient, which increases the risk of abuse and bugs. After all, spaces and quotes are perfectly legal inside filenames, but when used with unsafely interpolated parameters, they will cause errors.

It is possible to escape shell arguments, but it's very tricky: no two shells provide exactly the same command language with the same meta-characters. Is your /bin/sh really bash, dash, ash, ksh or something else? It is even unspecified whether the sh used is /bin/sh.

However, a better approach is often available. Many programming languages offer an interface to one or more members of the POSIX exec() function family. These allow passing the arguments to the program in a separate array, and they don't go through the shell to invoke the program at all. This is faster and a lot more secure.

(use posix)
;; Accepts a list of arguments:
(process "echo" (list "Hello, " first-name " " last-name))

By sidestepping the problem we've made it simpler, shorter than the system call above and safer, which is our goal. In languages with string interpolation this will probably be slightly more verbose than the system() version.

There is one small problem: by eliminating a call to the shell, we've also lost the ability to easily construct pipelines. This can be done by calling several procedures, but this is way more complicated than it is in the shell language. The obvious solution to that is to design a safe DSL. This is what the Scheme Shell does with its notation for UNIX pipelines:

;; This will echo back the input, regardless of "special" characters
(define output (run/string (| (echo input) (caesar 5) (caesar 21))))
(display output)

Almost as convenient as the backtick operator, but without its dangers.

Summary

Language design can help us write applications which are structurally secure. We should strive to make writing the right thing easier than the wrong thing so that even naively written, quick and dirty code has some chance of being safe. To reach this goal, we can use the following approaches, in roughly decreasing order of safety:

"Sidestep" the issue by keeping data separated from commands.
Represent data in proper data structures, not strings. On output, escape where needed.
Use "safe buffers" which auto-escape concatenated strings.
If escaping or separation is impossible, raise an error on bad data.
If all else fails you can escape manually, but use coding conventions that make unsafe code stand out.

These approaches are your first line of defense. Besides using these, you should also filter and sanitize your input. Just don't mistake that as the fix for injection vulnerabilities!

This is the positive advice I can give you. The negative advice is simply to avoid building language or library features which make unsafe code easier to write than safe code. An example of such a feature is string interpolation, which causes more harm than good.

Designing Lispy DSLs, part 4: SSQL

2012-08-20T19:19:40Z

Today we'll look at an old, experimental DSL of my own design. I've always referred to it as a failed experiment, but perhaps it's really a successful experiment, because it helped me figure out why this type of DSL doesn't work too well. Whatever the status, I'll use it as an example of what makes a bad DSL.

The DSL in question is SSQL, a way of embedding SQL as S-expressions into Scheme code. Interestingly, it seems I had a bad feeling about it from the start; the initial commit had the following message:

 Add another doomed project - ssql

It turned out not to be completely doomed, because Moritz Heidkamp has kindly taken over maintenance and has been improving and polishing the library. I might even use it again for my own projects if I ever get tired of working directly with SQL.

Scoped access

For my day job I used to write a lot of Rails code, and I got tired of the restrictions in ActiveRecord. I have to mention that this was in the days before Arel, which is a great improvement in the way you can use custom queries in Rails.

With ActiveRecord, you could write code that would automatically prevent users from accessing things they shouldn't be able to access with the scoped_access plugin. This allowed you to write things in your controller like the following:

scoped_access Customer

def method_scoping
  ScopedAccess::ClassScoping.new(Customer, :user => {:id => current_user.id})
end

I don't recall exactly how it worked, but when you had a complex query, this could cause clashes when the same table was joined in twice, especially if the condition was complex. In different situations, different queries could be generated. Back then, you also needed to know internally-generated join aliases in order to scope related tables. Remember, this was quite a while ago, and I was a bit of a newbie and had been programming Ruby and Rails for only a year or two. There may have been better ways to do this even then.

In any case, this scoping problem annoyed me no end and I knew there had to be a better way. It was obvious that if you represent the query in a more complex data structure than a simple string, you can easily fetch all the references to a particular table (even if it is aliased), and add some scoping to it. This could be done even if it required the addition of joins, and even if those tables were already joined under arbitrary names, as long as you would alpha-rename all aliases to avoid clashes with user-created aliases.

Here's an example of the SSQL syntax. This example is based on a toy data model for an IMDB-clone with films, actors and their roles in them:

'(select (columns (col actors id firstname lastname)
                  (col roles character movie_id))
         (from actors roles)
         (where (and (= (col actors firstname) "Bruce")
                     (= (col actors lastname) "Campbell")
		     (= (col actors id) (col roles actor_id)))))

The regular SQL equivalent of this:

 SELECT actors.id, actors.firstname, actors.lastname,
        roles.character, roles.movie_id
 FROM actors, roles
 WHERE actors.firstname = 'Bruce'
   AND actors.lastname = 'Campbell'
   AND actors.id = roles.actor_id;

The SSQL for column selection can be a little ugly or verbose, so it's also allowed to specify columns with a dot instead of the col form (probably a mistake, complicating the DSL design):

'(select (columns actors.id actors.firstname actors.lastname
                  roles.character roles.movie_id)
         (from actors roles)
         (where (and (= actors.firstname "Bruce")
                     (= actors.lastname "Campbell")
		     (= actors.id roles.actor_id))))

The columns "noise word" is still required, because that makes it easier to walk the expression and programmatically manipulate it. In any case, scoping a table is easy, even for arbitrarily complex cases:

(let ((query
        '(select (columns actors.firstname actors.lastname
                          roles.character movies.title)
                 (from (join left
                             (join left actors
                                   (join inner roles (as movies m2)
                                         (on (and (= m2.id roles.movie_id)
                                                  (> m2.year 2000))))
                                   (on (= roles.actor_id actors.id)))
                             movies
                             (on (= movies.id roles.movie_id)))))))
  (scope-table 'movies '(< (col movies year) 2005) query))

;; Results in the following:
(select (columns actors.firstname actors.lastname
                 roles.character movies.title)
        (from (join left
                    (join left actors
                          (join inner roles (as movies m2)
                                (on (and (= m2.id roles.movie_id)
                                         (> m2.year 2000))))
                          (on (= roles.actor_id actors.id)))
                    movies
                    (on (= movies.id roles.movie_id))))
        (where (and (< (col m2 year) 2005)
                    (< (col movies year) 2005))))

The initial query selects all the films in the database, including all actors with the roles they played in that film. However, the actors are only included for films that were released after the year 2000. Earlier films are returned without the actors.

Now, the magic happens in the call to scope-table, which returns the same query, but with all occurrences of the movies table scoped to include only films released before the year 2005. Note that this scopes both the main query and the joined table m2 even though it's aliased.

It's all about the syntax

Okay, so it turns out that this idea works beautifully. Let's look at why I think this DSL was a failure. One reason is the fact that SQL is a huge language, especially when you consider all the extensions provided by various implementations.

You could say "but you don't have to support the full language". That's true, but the problem with a language that maps directly to SQL is that users will expect being able to do all the things they can do in regular SQL. For example, when Common Table Expressions were first introduced into PostgreSQL, I started seeing many places in my code bases at work where those would be useful. The same was true for Window Functions. These are both extremely useful extensions, and I'm now making regular use of them. I wouldn't want to miss them, so any SQL DSL really needs to support them for me to take it seriously.

The thing both extensions have in common is that they introduce completely new syntax. That's because there are absolutely no common building blocks for language constructs; every feature is a set of arbitrarily-placed keywords to help a parser make sense of it (with many optional "noise" keywords to help a human make sense of it). This means each feature has to be taught separately to SSQL, resulting in a large set of rules on how to convert them to SQL.

The SQL grammar is so complicated that its sheer size has serious performance implications on a parser, as pointed out by this blog post. Because EXPLAIN is a PostgreSQL extension, they simply decided to change this command's syntax to make it faster to parse. The old syntax is still supported for backwards-compatibility, but this change is a great illustration of how much of a moving target the SQL syntax really is. Other SQL implementations don't generally move as fast as PostgreSQL in adding features, but as I indicated earlier, I really like these features and use them on a regular basis.

Database independence with SQL-based syntax?

Another complication is supporting multiple databases. SSQL supports ANSI SQL as a baseline, with optional extensions that are available if the back-end supports it. The nice thing is that this provides a degree of database independence. All back-ends can automatically quote strings and table names correctly depending on the database, making SQL injection bugs effectively impossible. For example,

'(select (columns (col actors firstname lastname birth-date))
         (from actors)
         (where (= actors.lastname "O'Neill")))

gets output as the following in PostgreSQL and SQLite:

 SELECT actors.firstname, actors.lastname, actors."birth-date"
 FROM actors
 WHERE actors.lastname = 'O''Neill';

The MySQL back-end outputs the following:

 SELECT actors.firstname, actors.lastname, actors.`birth-date`
 FROM actors
 WHERE actors.lastname = 'O\'Neill';

These differences are relatively small and don't affect the syntax of the S-expression version. However, there are other examples that do. For example, MySQL's INSERT statement allows syntax which mirrors the UPDATE statement, using SET:

 INSERT INTO movies SET title = 'Alien', year = 1979;

whereas PostgreSQL only allows the standard syntax (which MySQL also supports):

 INSERT INTO movies (title, year) VALUES ('Donnie Darko', 2001);

The question then becomes whether the (unnecessary) syntax with SET should be allowed, and, if so, whether this should be emulated in PostgreSQL by rewriting it to the standard syntax. There are tens of such silly examples (CONCAT versus || versus logical OR, case insensitive LIKE versus ILIKE, etc), but there are a lot of more fundamental differences, too. Finally, using ANSI as a baseline is nice, but many of ANSI's features aren't widely implemented. Common Table Expressions are a good example; they're standardized, but neither MySQL nor SQLite support them, and Postgres only started supporting them very recently. Oh, and fuck proprietary RDBMSes; Oracle long ignored ANSI and invented more nonstandard extensions than MySQL ever did, and as a result, their users are as clueless about ANSI SQL as the average MySQL user. Finally, there are many ANSI features that none of these databases support. This means you have to implement a feature in ANSI, then override it to produce an error message saying it's unsupported in this dialect for all implementations that don't support it. An alternative approach is to implement no base but make everything completely implementation-specific. However, this results in a bigger risk of producing DSL inconsistencies between dialects.

A better approach

Recently, the relational algebra has been gaining some more interest. For example, there's Alf for Ruby and the UNIX shell, and of course Arel, which I mentioned earlier.

I think this is a better approach; relational algebra has just a handful of concepts and there's no syntax associated with it, so you can invent your own syntax to best fit your DSL. It also prevents you from getting distracted by the differences in various SQL implementations. You can see this with Alf already; it has total abstraction over the DBMS. It can use flat files or SQL, or any other back-end you'd like, as long as it fits the relational model (to be fair, so can PostgreSQL with SQL/MED foreign data wrappers). The flip side of such a high level of abstraction is that it will be harder to make use of any killer features offered by your RDBMS; you get the lowest common denominator in features.

Optimizing queries also becomes hard. You can no longer hand-optimize them when writing them, and you'd probably end up with an optimizer in your library. This is pretty insane, since there's also an SQL optimizer and query planner inside your RDBMS, so you're doing twice the work, and there's twice the opportunity for getting it wrong.

Despite these disadvantages, the "relational algebra DSL" approach is more viable than the "SQL DSL" approach. ClojureQL also initially took the approach of providing a DSL closely modeled on SQL, but later completely revised the DSL to be more abstract and closer to relational algebra than to SQL.

I think it's interesting to see what other SQL-like DSL projects will do. For example, Clojure also has Korma, which is rather close to SQL and looks like it can currently only perform a limited subset of all possible queries. I wonder what they'll do when users start clamoring for richer back-end support? Racket used to have SchemeQL, but that project seems to have vanished from the web. The website of its parent project, Schematics, doesn't mention it at all anymore. The same seems to have happened to a Common Lisp interface called CL-RDBMS (at least the "homepage" link currently points to a broken web site).

There's a popular library for Common Lisp called CLSQL. It looks like an enormous amount of engineering went into it. If that's required to get a useful SQL DSL, it might not be worth it unless the advantages outweigh the effort required. Note that even after 10 years of development, CLSQL still has no outer join support. I think that's indicative of how hard it is to properly support SQL from a DSL.

Wrap-up

The lessons I learned from the SSQL experiment are in retrospect rather simple, and seem to echo earlier blog posts:

The language you're targeting should be small and have few core concepts.
The relevant standards should be fully implemented in all back-ends you want to support.
Back-ends shouldn't have any arbitrary extensions that you're expected to support.
Look for an underlying theory; this may be a better abstraction than the target language.
Try to find examples of similar libraries. Did others try, and fail or give up? If so, why? How complex are existing implementations? Are they complete?

This post will be the last post in this series, at least for a while. There aren't that many other interesting DSLs with which I'm familiar, and I've exhausted the list of novel design concepts that I'm able to distill from existing DSLs.

Designing Lispy DSLs, part 3: SRE

2012-08-14T17:52:14Z

Today I'd like to discuss Scheme Regular Expressions (SRE). Originally introduced in a library for the Scheme Shell, this DSL has recently been gaining some popularity due to the release of Irregex, a pure R5RS Scheme regex engine with SRE as its native syntax. Irregex has been integrated as the core regex system in Chicken Scheme and Jazz Scheme, and you can easily use it from any other Scheme due to its portability.

Back in 1998, the author of the first SRE implementation (Olin Shivers, one of the funniest Schemers around) posted an announcement to several newsgroups about this new syntax. It's well worth a read; especially the preamble about 100% and 80% solutions is a very inspiring call to arms which provides a good insight into the Scheme way of thinking. By the way, if you liked this, you'll also want to read the classic essay "Worse Is Better", if you haven't already. The bit about "The Right Way" is especially relevant. Consider yourself warned, Schemer!

Figuring out the rules

The "Discussion and design notes" section from the announcement is particularly interesting as it discusses the DSL from a point of view similar to this series of blog posts. It's also the largest section, so we'll just touch upon the important points here. The first thing that really jumps out is the fact that the author has taken a look at many different regex packages for various languages, and even asked Tom Lord and Henry Spencer (both wrote their own regex engines) about obscure details. Doing this kind of in-depth research is a great way to get started when designing a DSL since it provides you with a nicely broad perspective on the various viewpoints of others who went before you. This will reduce your "blindness to complexity". Initially every target language seems simple but there are always pitfalls which, if overlooked, would result in a DSL that's hard to extend or doesn't provide all the features of the target language which a user would need. By looking at other implementations you see how they deal with the more complex nooks and crannies of the target language.

The other main point is that he drew a very clear line in the sand of what features would go into SRE and which wouldn't. The SRE syntax doesn't support any "extended regex" features which would force a particular implementation strategy. This makes the SRE syntax independent of the underlying regex engine, which allows for greater portability and generality, but more importantly, it leaves open the possibility of efficient implementations. This was misunderstood by many people; he had to educate Richard Stallman about why supporting back references in the general SRE syntax is a bad idea:

 My feeling about back references is as follows. Regexps are based on a deep
 theory -- regular sets and DFA's -- that has tremendous implications about
 the operations you can perform on them and the ways you can implement them.
 Back-references completely shatter this framework. They rule out certain
 extremely efficient implementations. They rule out certain operations. They
 have nothing to do with the idea of a "regular expression." They are not one
 with the deep structure of the system.

Repeat that in your mind: They are not one with the deep structure of the system. DSL design notes don't get any more philosophical than that! This points right to the core design principle of SRE (and regexes in general). If you are designing a DSL, you can consider yourself very lucky when you find a guiding principle which is that strong. You should let it inform all your design choices because it will help you achieve a good, cohesive design. This also makes it easier to defend your choices when users start complaining about missing features...

Representational issues

After my SCSS post, I was asked why you'd want to represent CSS using first-class values. I think the SXML DSL example illustrates how powerful a first-class representation can be, but I must admit, I don't see many valid use-cases for "first-classing" a CSS DSL.

However, one important lesson in programming is that you never know what clever things people are going to do with your code; clever things you only wish you thought about. You should see first-class values as an enabler for other people to take your DSL's usefulness to new heights. The "Prime Clingerism" applies to DSLs as well; without a first-class representation, additional features will appear necessary to perform useful operations.

One interesting aspect of the design of the original SRE library for SCSH is how it deals with first-class regexps. It contains a large set of procedures to manipulate the underlying regular expression ADT (abstract data type). Olin believed a separate ADT was required for easy manipulation of regex objects, and it would also allow extension of the supported operator set. Directly operating on the SRE expressions would be harder for programmatic extensions. This distinction allows for a baroque but user-friendly SRE syntax in which it is possible to write one thing in many different ways, while also offering ease of manipulation from user code.

Olin is quick to point out that this does cause massive complexity in the code (a point also raised by Richard Stallman), but says the work is done now and anyone is free to take his code and re-use it (this fits the "100% solution" ideology mentioned at the beginning). This ADT approach is comparable to how Lisp/Scheme compilers internally rewrite the full language to a simpler to manipulate "core language", so it isn't completely unique to SRE.

At first glance, this seems a little hard to defend, especially the fact that there's also a seemingly unnecessary rx macro in his design, while Irregex gets by fine without these. I've asked Alex Shinn, the author of Irregex, about this, and he mentioned that the macro and the ADT were needed in SCSH because it depended on an underlying POSIX regex engine rather than implementing it natively in Scheme. SCSH first reads SREs at compile-time and the macro tries to compile down to the ADT as much as possible. Then, at run-time, this ADT is converted to a POSIX regex string which is compiled by the underlying C regex engine.

Because Irregex is written natively in Scheme, this extra step is not necessary and Alex decided to get rid of these distractions and implement only the SRE syntax, without the rx macro and ADT. The result is a very compact implementation, having about the same size as the SCSH package, but including a full matching engine! As far as I know there isn't any widely-used extension for SCSH which makes use of the ADT interface, and the SRE syntax hasn't been extended as much as Olin foresaw might happen. For these reasons, the choice to drop all that complexity seems like a wise one. However, only time will tell whether that's really the case.

Wrapping up

Let's see what we have learned from the design of SRE:

Do your research! Inspect as many libraries and DSLs as possible to gain a broad perspective and avoid "blindness to complexity". If in doubt, ask a domain guru.
Relentlessly strip all features that preclude efficient implementation strategies. When users request them back, resist!
Design for extensibility and programmability; strive to support a first-class representation.
When things have settled down, re-evaluate the design and drop unnecessary features.
You don't always need nitty-gritty details from code examples to analyse a DSL :)

I admit, some of these rules are not for the faint of heart, but they make for a very strong and coherent DSL which might see wider adoption than just your initial implementation.

Designing Lispy DSLs, part 2: SXML

2012-08-05T14:07:01Z

After last time's example of SCSS, I'd like to take a look at SXML, another Lispy DSL I'm using in this blog. It's more successful and more widely-used than SCSS and even has an official specification!

The observation that XML is really an obnoxiously verbose Lisp without parens is common, but the details are (of course) hairier than that. Let's look at an XHTML example:

<div>
  <span>Hello, <strong>dear</strong> friends.</span>
  <span>This is a &lt;simple&gt; example.</span>
</div>

Converting this HTML fragment to an S-expression is straightforward:

'(div
   (span "Hello, " (strong "dear") " friends.")
   (span "This is a <simple> example."))

It's a bit more cumbersome to type because you have to break up the strings for the "strong" element, but aside from that it's simpler, shorter, and less error-prone; "special" characters can be written as-is since they are automatically escaped when the XML document is written. Especially when dealing with large templates and generated content it can be a big time-saver to represent XML as S-expressions; doubly so if you're using paredit. Plus, Scheme is your templating language, and Lisps are rather good at processing lists :)

You might be wondering about XML attributes; S-expressions don't have anything that maps naturally to these. Some XML-in-Lisp variants use keywords for attributes, others use alternating symbols and strings to indicate attributes. SXML takes the more interesting approach that attributes in XML were a mistake; there should only be elements. To compensate, SXML uses a tag name that can't exist in XML (the "@"-sign) and it has the convention that this element can appear as the first child of any element. Child element names represent attribute names, their text contents represent values. The "@"-sign is particularly well-chosen because W3C also uses it elsewhere to indicate attributes (e.g. in XPath and XSLT).

<div id="welcome" class="section">
  <span>Hello, <strong class="affectionately">dear</strong>friends.</span>
  <span>This is a &lt;simple&gt; example.</span>
</div>

becomes:

'(div (@ (id "welcome") (class "section"))
  (span "Hello, " (strong (@ (class "affectionately")) "dear") " friends.")
  (span "This is a <simple> example."))

Let's look at what makes SXML such a good DSL. First, XML has a hierarchical structure, which maps well to S-expressions. It is built up out of only a handful of atoms: it has start tags with attributes, end tags, entities, and textual content in between. In SXML, tag names are mapped to symbols, which can represent any string, so this naturally extends to all possible XML tag names.

When building websites, the fact that regular HTML is less strict than XML is irrelevant; you don't need features like, say, omitting an end tag. In fact, end tags don't even exist in SXML; it models the underlying concept of elements rather than tags; it simply treats tags as artifacts of the serialized textual representation of an element. S-expressions can be seen as an alternative serialized textual representation of the same document described by the "angular brackets and tags" notation.

This is another important aspect of good DSLs; they tend to ignore surface syntax. Instead, they map the underlying tree-like structure to S-expressions. SXML uses elements instead of letting itself get distracted by tags, and it generalizes attributes to fit the tree structure. By representing the structure in S-expressions, you know what parts need to be "escaped" in order to preserve this structure. When writing SXML to XML, all string elements in an SXML document get their angular brackets < and > converted to < and >. The only angular brackets ending up in the output are those that result from serialization of elements to start/end tags. When reading XML, all entities are automatically converted to the characters they represent, so in Scheme you get to work directly with the text contents at the conceptual level. CDATA sections are also eliminated; they are simply represented by their string value.

Some complications

XML isn't as simple as you'd think at first glance. Remember the same observation about CSS? This is a common theme with web technology. Don't even get me started about HTTP! In the words of Oleg Kiselyov, author of the SXML specification and many tools in the SSAX project:

There exists a myth that parsing of XML is easy. An article "Parsing XML"
in the January 2000 issue of Dr.Dobb's Journal states the ease of parsing
as an alleged fact. The author of that article must have overlooked that
there is more to XML than the grammar presented in the XML Recommendation.
There are attribute normalization rules, well-formedness constraints, let
alone validation constraints. XML Namespaces add another layer of complexity.

You can almost hear his frustration... Here's an example to illustrate some things that so far we have glossed over. This isn't a fragment, but a full XML document (with thanks to Jim Ursetto):

<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <content type="xhtml">
      <div xmlns="http://www.w3.org/1999/xhtml">
        <p>I'm invincible!</p>
      </div>
    </content>
  </entry>
</feed>

There are two new things to notice: The document starts with a "special" syntax to indicate that we're using XML version 1.0. These so-called processing instructions provide a generic way of passing (meta-)information to the application, outside the XML document itself. The second new thing is that the Atom feed in this example holds an XHTML document fragment as a sub-document. It uses a namespace declaration to indicate that the div and p tags are taken from a different XML schema than the main document.

Let's see how SXML deals with these new concepts. This document can be represented in several ways, but the following is arguably the simplest:

(*TOP* (@ (*NAMESPACES* (atom "http://www.w3.org/2005/Atom")
                        (xhtml "http://www.w3.org/1999/xhtml")))
  (*PI* xml "version=\"1.0\"")
  (atom:feed
    (atom:entry
      (atom:content (@ (type "xhtml"))
        (xhtml:div (xhtml:p "I'm invincible!"))))))

If you look carefully at the original XML document, you'll see that while feed is the document's root node, it still has a sibling: the processing instruction! There's a "virtual" root element that holds these two, which XML calls the "document entity". SXML generalizes this to an element called *TOP* (for "top-level"). The *NAMESPACES* element (an "attribute" of *TOP*) stores an association list of element name prefixes used to indicate a namespace. The *PI* element is SXML's way of representing the processing instruction (its version "attribute" isn't parsed because it isn't a true attribute in terms of XML; it just looks like one. Don't ask...).

All three pseudo-elements are represented by symbols that are invalid XML tag names due to the asterisk. Just like the @ for attributes, this ensures that they can't possibly clash with any tag name that might occur in a particular document type or future versions of XML.

One disadvantage of encoding namespaces as part of the tag name is that you can't see what namespace a particular element belongs to without first converting the symbol to a string, and then splitting it at the colon. This means namespaces aren't really first class. Most likely this is the case because namespaces were added at a later stage when most of the SXML syntax was already set in stone, and modifying it in some other way to support namespaces would be too invasive.

Tool set

You only realize the absolute brilliance of SXML when you look at its tool set. The XML ecosystem is an entire zoo of mini-languages. Most of these languages (for example XPath, XLink, XSLT) have some kind of corresponding DSL in the SSAX project. This makes it a complete toolbox for anyone working with XML. Sure, the documentation is incoherent and a little on the "academic" side, and the SSAX SourceForge project is a random collection of loosely-related tools that aren't exactly idiomatic Scheme (if there even is such a thing), but go ahead and compare it with tools in other languages.

Most "stock" XML libraries are awkward contraptions. Usually they expose highly verbose object-oriented APIs based on the W3C DOM specification, where constructing even a small tree takes several lines of code. It's so awkward that many programmers will tend to prefer generating the XML manually, by writing out strings.

Dynamic languages tend to do better. First, in Perl, there's XML::Simple. This is a little awkward due to Perl's hash and array syntax, but other than that it is a lot like SXML. However, this library is deprecated in favor of one of those awkward OO libraries, XML::LibXML.

Ruby and Python have convenient "builder" objects which can really speed up generation of XML, but as the name says, these are for building. The format in which you build isn't directly the first-class representation, which makes the API slightly disparate. For both languages, these are not the default libraries either, which makes them less likely to be used by people who want to minimize dependencies.

Finally, even though it's a very static language, Haskell has a pretty good builder-like library too, which seems quite popular. If you ever need to generate XML (or "just" HTML) with one of these languages, do yourself a favor and use one of these libraries.

But back to SXML; let's see how you'd read in a document, manipulate it, and write it back out again. For a change, the code presented is a complete (Chicken) Scheme program. This will read one of the first XML documents in this post, change it, and send it to standard output.

;; First, run the following from the shell to ensure this program will work:
;; $ chicken-install ssax sxml-modifications sxml-serializer
(use ssax sxml-modifications sxml-serializer)

(define doc (ssax:xml->sxml (current-input-port) '()))

(define change
  (sxml-modify
    `("div/@id" replace (id "good-bye"))
    `("../span[1]/text()[1]" replace "Goodbye, ")
    `("../following::*" replace (span "This was more " (em "complex")))
    `("self::*" insert-into ", don't you think?")))

(serialize-sxml (change doc) output: (current-output-port))

The arguments to sxml-modify comprise a mini-DSL, representing actions to take on the XML document. Each action is a list; first an XPath expression which selects node(s) from the document, then the name of the action to perform, followed by the value to use for this action. Each action is executed in sequence, so the XPath expression is relative to the previous action's node set, a little like how chaining works in the excellent jQuery library. Actually, I think there are still some lessons in convenience to learn from jQuery, but that's a different story.

Let's invoke the program and see what happens:

$ cat welcome.xml
<div id="welcome" class="section">
 <span>Hello, <strong class="affectionately">dear</strong>friends.</span>
 <span>This is a &lt;simple&gt; example.</span>
</div>
$ csi -s convert.scm < welcome.xml
<div id="good-bye" class="section">
 <span>Goodbye, <strong class="affectionately">dear</strong>friends.</span>
 <span>This was more <em>complex</em>, don't you think?</span>
</div>

Not too shabby for a 10-line program!

Unfortunately, there's a catch. Originally, I planned to use the Atom feed from the previous section as input, but it turns out that the modifications sub-language doesn't support passing a namespace map to the underlying XPath library. Also, I was unable to use an sxpath expression instead of a standard string XPath expression. This could be bad documentation (the docs for SXML modifications are pretty sparse), or perhaps it's a lack of support for namespaces. A quick look at the source seems to confirm the latter. The lack of support for sxpath expressions is also serious and indicates how "random" the selection of tools in SSAX really is; some of these tools don't even support each other! Luckily, it looks like both limitations aren't fundamental, and could be addressed by a (small?) change in the tools.

I mentioned earlier that the SSAX project is a loose collection of tools with incoherent documentation. My failure in figuring out how to combine namespaces with SXML modifications or use the sxpath DSL from the "SXML modifications" DSL helps point out the importance of a good, robust, and well-documented tool set. This might possibly be more important than a good DSL; if nobody can use your DSL, it might just as well not exist.

Wrapping up

The following rules can be distilled from the SXML design:

Do not slavishly translate surface syntax to S-expressions, but model the structure.
Eliminate or generalize all features that are strictly unnecessary.
When generalizations demand new names, pick ones that are invalid in the source language, but try to borrow familiar conventions from the domain.
When generating output, ensure structural integrity by escaping all content.
People will avoid clumsy DSLs, to the point of falling back on string manipulation.
No matter how well-designed your DSL is, it needs good tools and documentation.
DSLs within the same domain should be mutually supportive.

There are still many aspects of XML we barely touched upon. However, this post is already long enough, and my knowledge of XML (and SXML) only goes so far, so we won't go into more detail. Of course, you can always dig in and find out more yourself; there are plenty of links in this document you can use to study the subjects.

Designing Lispy DSLs, part 1: SCSS

2012-07-28T19:39:04Z

Setting up this blog was a good excuse to try out SCSS, which I'd been meaning to do for quite a long time. Working with SCSS and exploring its limitations got me thinking about what makes a good Lispy DSL (domain specific language). This post is the first of a series. Today we'll look at SCSS; in future installments we'll explore other examples of Lispy DSLs.

The idea behind SCSS isn't unique; by generating CSS from a more powerful language you get to use the abstraction systems provided by that language. Abstractions are sorely needed when writing advanced CSS; for example, you often need to use one color in many different situations. In plain CSS, you need to repeat this color value for every usage and, if a few instances need to change, you must go find and replace them. You can imagine it's easy to forget one, or to replace too many! Where HTML makes it easy to write semantically meaningful content (by assigning IDs and classes, for example), CSS doesn't have any way to indicate how style elements logically relate.

As an interesting side note, one of the creators of CSS, Bert Bos, thinks that "real-language" features are unnecessary in CSS. He goes as far as saying constants shouldn't even be added to CSS. His main argument basically boils down to other people are stupid so you don't get to use advanced features, either. Luckily, many people disagree and have written their own server-side preprocessing languages.

Some of these projects (like Less and Sass) take the approach of adding their own syntax extensions to "plain" CSS, while others (like an older syntax of Sass) design their own custom language that's inspired by the concepts in CSS but quite different in syntax. All these projects are purely about generating CSS from another language. But we are smug Scheme weenies, and to us code and data are one and the same. A typical Schemer would prefer not just to generate CSS from SCSS, but to represent CSS in a first-class value, so that it can be manipulated at will. And that's exactly what SCSS offers... at first glance.

The devil is in the many, messy details

When you first look at CSS, it seems like a simple enough language. Indeed, the core syntax is rather simple. Each rule set has selectors separated by commas followed by declarations between curly braces, separated by semicolons:

#my-id, p.my-class, div {
  background-color: green;
  width: 10em;
  margin-left: 5px;
  border: 1px solid rgb(0, 128, 0);
}

There are three selectors here: the first one selects any element with the id attribute "my-id", the second one selects every p element (paragraph) which has "my-class" listed in its class attribute. The third one simply selects all div elements. The declarations are simple property/value pairs which determine how the selected elements will be displayed.

In Scheme, we can easily represent this as lists of items, where each item is a list of selectors and values, and that's exactly what SCSS does:

`(css+
   (((= id "my-id") (p (= class my-class)) div)
    (background-color "#008000") ; Should we use string values
    (color green)                ; for classes and colors, or symbols?
    (width "10em")
    (margin-left "5px")
    (border "1px solid rgb(0, 128, 0)")))

One neat feature that's added by most of these CSS preprocessors is that you can nest items. This places the full expression of their parent before the sub-item, which means that item will only match the selector within its parent:

`(css+
   (div
    (border "1px solid rgb(0, 128, 0)")
    (((// (= class "some-child")) (// (= id "some-other-child")))
     (color orange))))

This compiles to the following CSS:

div {
  border: 1px solid rgb(0, 128, 0);
}

div .some-child,
div #some-other-child {
  color: orange
}

When looking at the examples we should start to get a funny feeling. Aside from the fact that the selector syntax is rather heavy on parens which makes it hard to read even for a Schemer, there are a few problems. The first problem is the fact that we are representing the property values as flat strings (or symbols). This means you can't easily, say, find all the elements that have a particular color somewhere in their values without very heavy additional parsing (in CSS, green, rgb(0,128,0) and #008000 all mean exactly the same thing). You also can't easily compose declarations with variables without doing string manipulations, which mostly defeats the point of using a first-class representation:

(let ((company-color "#008000")
      (page-width 1000)
      (logo-size 20))
  `(css+
     ((= class "menu")
      (border-left ,(sprintf "1px solid ~A" ,company-color))
      (width ,(sprintf "~Apx" (- page-width logo-size)))
     ((= class "whatever")
      (background ,(sprintf "url(\"img/back.png\") no-repeat 10px 20px ~A"
                            ,company-color))))))

The second problem is that strings, being directly injected into the CSS, don't get "escaped". This means you can't take any user input (let's say a font name, or a color value) and use this in a declaration value; this can destroy your entire layout if it contains a semicolon or curly brace - at best an annoying bug, at worst, a security issue. You might just put everything in one string for all the difference it makes:

`(css+
   (.my-class
    (color "#222; list-style-type: circle; margin-left: 5px")))

The third "problem" points us in the right direction. The border-property is actually a shorthand property. The border declaration from the first example breaks down into the following full declarations:

html {
  border-top: 1px solid rgb(0, 128, 0);
  border-right: 1px solid rgb(0, 128, 0);
  border-bottom: 1px solid rgb(0, 128, 0);
  border-left: 1px solid rgb(0, 128, 0);
}

Unfortunately, this decomposition is impossible to do in SCSS without parsing the property's string values. Besides, even if we were to do that, these properties themselves are shorthands, too! For example, the border-top declaration itself breaks down into these declarations:

html {
  border-top-width: 1px;
  border-top-style: solid;
  border-top-color: rgb(0, 128, 0);
}

This is similar to how in Scheme macros can rewrite convenient notation to a simpler core language. The better approach would be to compile down to the core CSS forms rather than trying to use these complex properties directly.

To get this far, we'd have to decompose everything to its simplest form and assemble more complex properties in terms of simpler ones. In CSS, each property basically has its own free-form "value" syntax which can get quite complex. Some examples:

html {
  /**
   * Images can be full URIs (dragging in another pretty large RFC), which can
   * *optionally* be quoted (why all this unnecessary optional stuff?)
   */
  background-image: url("path/to/image.png");

  /* You can use named "counters" (what, there are no variables in CSS?!) */
  content: "Chapter " counter(my-chapter-counter) ". ";
  counter-increment: my-chapter-counter;      /* Add 1 to chapter */

  /**
   * Lists of font names (strings), separated by spaces and possibly quoted.
   * Also, a restricted set of specially-defined "generic font families"
   * like serif, fantasy (WTF) and monospace, and even specially-defined
   * "system fonts" like status-bar, small-caption, icon, and menu.
   */
  font-family: Helvetica, "Comic Sans MS", fantasy, small-caption;

  /**
   * Different size types: em, ex, px, pt, in, cm, mm, percentage, unit-less.
   * Margins and paddings take 1, 2, 3 or 4 values which expand into -top,
   * -right, -bottom and -left.
   */
  margin: 1px 2em 30% 0;
}

Seriously, who comes up with this stuff? I'm not saying any of these things are useless, but from a language design standpoint, this seems rather excessive. CSS 3 is even more extreme; there, "image" value-types get so complex that they need their own separate document to specify. The background shorthand property grew in complexity as well. Two examples from these drafts (quick, what visual effect do these have? No cheating):

html {
  list-style-image:
      radial-gradient(circle, #006, #00a 90%, #0000af 100%, white 100%);
  background: url("chess.png") 40% / 10em gray round fixed border-box;
}

Finally, the CSS3 animations draft spec adds a completely new syntax element for key frames. This is the only place in plain CSS where curly brace sections are nested inside other curly brace sections.

This highly variable and ever-changing aspect of the syntax means that it's quite an open-ended language. This makes it quite hard to cover all future extensions. The one point that gives me hope is the fact that all this complexity is built up out of a set of core "atoms" like length units, URIs and colors. These atoms do not seem to change too much.

This observation shows us an opportunity for a better CSS DSL; we could try to map these atoms to suitable Scheme values, possibly ignoring the details of how complex values are composed out of these atoms. This is basically what the W3C did with their CSS DOM API. Taking a good look at this DOM API might help to get some inspiration, even if the API itself is unwieldy and un-Lispy (it's very OOP-ish).

In a language without a small set of well-defined atoms, you will need special parsers and generators for each separate type. This is very confusing to people. I know, because this is exactly the approach I took for representing HTTP headers in intarweb. I don't consider intarweb to be a true DSL since it doesn't really have "native" syntax for its header values. Everything passes through construction procedures which do accept "native" values. However, it does illustrate the point; I've had several requests for explanation of how to do common (what I thought were) simple things or "just give me a way to write out the raw header". That's a DSL failure; DSLs ought to be straightforward and easy to understand, yet powerful.

I like to think that Intarweb isn't a complete failure, because when working with intarweb, once everything is parsed, it's often rather nice not to have to deal with parsing anymore. Things like cookies or authentication attributes are notoriously hard to parse correctly, and if everyone up the entire server-side HTTP stack needs to roll their own parser, that's a lot of wasted effort, and a lot of inconsistent implementations with their own bugs. Manipulating these values is also a breeze and never involves string manipulation.

What might a better SCSS look like?

From our new understanding of the nature of CSS, let's try improving it iteratively. For starters, we would like to use parenthetical notation for everything. Plain strings should be disallowed except where they are appropriate and are always quoted and escaped. Making this simple change gives us the following:

`(scss+
   (((= class "foo") (= class "bar"))
    (border-left-color (rgb 0 128 0))
    (border-left-width (em 1))
    ;; unsure whether we should allow this shorthand..
    (border-right (px 1) solid ,orange)
    (width (px ,(- page-width sidebar-width)))
    ((// p)
     (color green)
     (font-family #("Helvetica" "Comic Sans MS" sans-serif)))))

I've used vectors to describe sequences of things, whereas composite declarations like border-right are simply expressions with more than two subexpressions. Built-ins like sans-serif and green are symbols. As you can see, because there are no strings, lengths can be calculated without having to perform string manipulation. Another valid approach would be having a special "color" object type with associated procedures that operate on them. If we wanted to do this, SCSS could export variables with color definitions so that green is simply an alias for (rgb 0 128 0), and you could perform "color-algebraic" operations:

`(scss+
   (((= class "foo") (= class "bar"))
    (border-left-color ,(rgb 0 128 0))   ; "rgb" is a constructor procedure now
    (border-left-width ,(em 1))          ; So are "em"...
    (border-right ,(px 1) solid ,orange) ; .. and "px"
    ((// p)
     (color ,green)
     ;; A green background which is darker by 50%
     (background-color ,(darken green .5))
     (font-family #("Helvetica" "Comic Sans MS" sans-serif)))))

I can't think of any useful operations on font types, so I've kept sans-serif a symbol here. How far you want to go depends on your goals, and involves striking a balance between ease of use, safety, and power. For instance, you could define a separate type for everything, including fonts, but that would make it harder to use. It would also make it harder to introduce mistakes, especially if the CSS generator will validate while rendering. However, strict validation also means allowing extensions (like those from CSS3) becomes harder!

The selector syntax could use some love too, but I'm less critical of that. The basic idea is fine; it can extend to include arbitrary selectors. It currently supports the + sibling and > child selector as well as the class and id comparisons. Because these operators are in the operator position of a list, adding new ones is as simple as adding a new procedure in Scheme. A pseudo-selector like p:first-child for example could simply be translated to (: p first-child) without breaking anything else.

Right now selectors are simply grouped by adding an extra set of parens around them to put them in a list. Using a visual cue like and or or to indicate grouping might help for readability, as would getting rid of the // selector for hierarchical nesting. As long as we make sure all selectors are unused property names there's no ambiguity in simply nesting a new rule inside another one:

`(scss+
   ((= class "foo")
    (color ,orange)
   (div
    (margin-left ,(px 1))
    ((or (= class (or "foo" "bar"))
         (= id qux))
     (border-left-color ,(rgb 0 128 0))
     (font-family #("Helvetica" "Comic Sans MS" sans-serif))))))

Instead of repeating the class selection, we just put the (or ...) around the class, which is a nice abbreviation, but overall I'm not too happy about this version, so let's back up a step.

We can't guarantee that the selector symbols will remain unused as property values, because we don't know what property names the CSS spec might add in the future. We should strive to avoid potential clashes with future extensions. Also, dropping the // makes it harder to traverse an SCSS tree and perform manipulations since the traversal code would need a full list of all known selectors. So after all, it looks like it's better to keep the //. But we can drop some unnecessary parens by taking the previous example and just putting the // before the selector. Since it's been modified to be one s-expression, we can do that. We can also allow the = selector to accept any attribute (not just classes). While we're at it, this selector should also accept multiple values to avoid repetition:

`(scss+
   ((or (~= p class "foo")    ; Change to (has-word? p class "foo") ?
        (+ div (= p class "bar" my-attr "qux")))
    (border-left-color ,(rgb 0 128 0))
    (font-family #("Helvetica" "Comic Sans MS" sans-serif))

    (// (= * class (or "foo" "bar"))
        (color ,orange)))

   (div
      (display block)
      (// span
          (text-align left))))

The example above also shows the extensibility of operators by adding the ~= selector (a very unschemely name...). Let's see the CSS this would compile to:

p[class~="foo"],
div + p.bar[my-attr="qux"] {
  border-left-color: rgb(0, 128, 0);
  font-family: "Helvetica", "Comic Sans MS", sans-serif;
}

p[class~="foo"] *.foo,
p[class~="foo"] *.bar,
div + p.bar[my-attr="qux"] *.foo,
div + p.bar[my-attr="qux"] *.bar {
  color: orange;
}

div {
  display: block;
}

div span {
  text-align: left;
}

That's not too bad! There's a lot of redundancy in the resulting CSS that we abstracted away via the combination of shortened or-alternatives and hierarchical nesting. The original SCSS also had this hierarchical nesting, by the way, so this type of redundancy is already avoided even by using a slightly flawed DSL.

In CSS, the #foo and .bar syntaxes are shorthands for selecting on IDs and classes, because these are so common. There is no technical need to support these shortcuts, so if this makes your design less clean, you can always drop them and opt to use the generic selectors everywhere. For IE6 and other crippled browsers, the renderer could detect class selection and rewrite it to the short syntax. You could always consider extending the Scheme reader to get the same brevity at a higher level, while keeping SCSS itself simple (not that I would recommend doing that, but the option exists).

Lessons learned

I will try to wrap up each blog post in this series by listing the general design rules that we can extract from the DSL under discussion. To wrap an existing language like CSS into a DSL, the following approach seems useful:

First, identify the atomic building blocks. If there are many, this may spell trouble.
Decide which building blocks are essential to be represented "first class" in a structured way, and which can be unstructured strings or symbols (Lisp's atoms).
Determine the combination rules of these atoms and how to translate this to s-expressions.
Think about whether you want to rely on the host language and expose shorthands and abstractions directly, or if you want to rely on Scheme's abstraction facilities.
If possible, look in what direction the language evolved, and how it has been extended in the past. Your design must be able to accommodate changes in these directions.
Finally, use parentheses and "noise symbols" sparingly, but effectively! Try striking a balance between notation and manipulation convenience.

I realize that some of the things I've said in this post might be contradictory. I might be too vague and hand-wavery in some places. Hell; many things are probably bloody obvious to some of you. But the main point is that it's important to remember that design is hard, and will always involve trade-offs.

I hope that you understand that when designing a DSL you'd better think about what use cases you want it to support before considering how to answer a particular design question. It's very easy to get carried away and overdesign a DSL, but another pitfall is to have too little design (like SCSS, in my opinion). Next time we'll look at a design that's pretty close to ideal, and show that even with that, there are some problems.

A new beginning

2012-07-22T16:14:35Z

Welcome to my renewed website. I've decided to stop procrastinating and instead of rolling my own blog software I decided to simply take Hyde and actually start blogging!

The quest for better blogging software

I've started writing blog software from scratch about three times by now. The first time I attempted making what one could call a "clone" of Wordpress or other similar blogging "platforms". This means that software would be based on other people's preconceptions of what blogging software is supposed to be like:

Dynamically generated from a database.
A web-based backend interface for writing posts, with a session-based login system.
A full-fledged rights system.
HTML as markup language, maybe with a WYSIWYG editor since HTML is so painful to write manually.

I think I started with this project because I thought that having a fully featured web publishing platform written in Chicken Scheme would ensure all the prerequisites for building "large-scale" web applications would be available, thereby possibly making it more useful for my dayjob. There are still quite a few components that are missing, like a proper "safe HTML" filter (useful for when you have guest bloggers who shouldn't be able to post using all available HTML tags), multipart/form-data support for file uploads, a good e-mailing library, etc.

Of course this project was doomed to fail because this type of software is nowhere near my ideal workflow; I prefer tapping away at the keyboard using my current favorite text editor, Emacs. The text-input interface provided by most web browsers is pretty horrible and if you close the window or refresh the page by mistake, you lose everything you just typed in, too! Like most programming geeks, I also generally prefer storing stuff in a version control system so I can track the edit history of my writings. Of course if I was going to make a serious blogging platform, I'd have to replicate most features from a version control system too...

Return to sanity

Sick and tired of writing tedious HTML nonsense I was never going to enjoy using in the first place, I decided to get started on my second attempt, more attuned to my own preferences with severely scaled-down scope:

Dynamically assembled content from simple textfiles with minimal markup (using simplified svnwiki syntax).
Content stored in a version control tool like mercurial.
No comments, since on technical blogs people often just bicker and argue in comments anyway, and you also end up fighting lots spam - that's not worth my time.

I almost managed to get this "finished" (I was at the the eternal 90% done stage of development), but then I decided it's silly to write something like that when there's a perfectly useful alternative. The not invented here approach to software ends here, now.

One of the advantages of a static website generator rather than using something dynamic is that there can be no security problems related to the blog software. Hyde generates flat HTML files which are served up by the webserver. There is no dynamic content at all. This is also really fast, since all the parsing of blog posts and the conversion to HTML is done at "compilation time". And instead of pouring infinite amounts of time into writing code I'll never use, I can focus on the content (and spend way too many hours tweaking the styling...)

The current site sticks to minimalism everywhere; even the archive page is very straightforward, listing only the titles and dates of all posts. If it turns out I write so many posts that a fancier paging system is necessary, I'll write it at that moment, but not a moment sooner!

More Magic

FOSS for digital sovereignty in the EU

The added value of FOSS

Public sector and consultancies

Education and mind share

Existing FOSS companies and economic situation

Conclusion

Trustworthy software through non-profits?

Why free and open source software is not enough

Non-profit is not a panacea

What about volunteer driven efforts?

Conclusion

Let's CRUNCH!

Introduction

History

Installation

Basic Operation and Usage

Module system and integration into CHICKEN

Restrictions

The Runtime System

Optimizations

Performance

The Debugger

Differences to PreScheme

Future Plans

Disclaimer

Links

What to expect from CHICKEN 6

Introduction

Versions

UNICODE

String representation

Dealing with the outside world

Port encodings

R7RS support

Foreign Function Interface changes

Platform support and build system

Minor changes

Future directions

Some moody ramblings about Scheme standards

Acknowledgements

Adding weak references to CHICKEN

A quick recap of garbage collection

A tale of two spaces

Onward! I mean, forward!

How weak pairs work in the GC

Fixing up the dangling weak pairs, naively

Smarter fixing up of weak pairs

Garbage collecting weak pairs, MIT style

Further reading

Clojure from a Schemer's perspective

Overall design

(non) Lispiness

Symbols and keywords

Numeric tower

Syntax

Minimalism

Java integration

Development style

REPL-driven development

Maps and keywords for everything

Multimethods with keywords

Conclusion

An appeal to the WHATWG

My request

My background

What I expect from a spec

Other reasons why I think a formal grammar is important

What to expect from CHICKEN 5

Overhaul of built-in modules

Saner module imports

Full numeric tower

Declarative egg description language

Improved support for static compilation

Other noteworthy things

Conclusion

CHICKEN's numeric tower: part 5

Ratnums and cplxnums

Bignums

Performance takes a hit

We need `dynamic-wind`, but it creates problems