
Tech Talks are in-depth technical discussions. Scala is a functional and object oriented programming language built on the JVM. Scala Native takes this language, loved by many, and brings it to bare metal. Scala Native is an optimizing ahead-of-time co...
Loading summary
A
Welcome to Co Recursive where we bring you discussions with thought leaders in the world of software development. I am Adam, your host. Scala Native takes the Scala language which traditionally runs on the JVM and brings it to bare metal. It is an optimizing ahead of time compiler as well as a lightweight managed runtime designed specifically for the Scala language. Dennis Chablin is a research assistant at the EPFL and the primary creator of Scala Native. In this episode I talked to him about the motivations behind the project, how it was implemented and future directions. One thing exciting that he mentions in this episode is an effort to bring the Scala compiler to Scala Native and and how doing so sped things up. Scala is a language built on the jvm. Could you give a brief overview of Scala the language before we get into Scala Native?
B
So Scala is this pretty cool language originally designed for jvm, can be described as a mix of functional and object oriented programming. It really doesn't bias toward one or another style. It really tries to blend both together because there is both good and bad on both ends. Like for example, functional programming is basically considered the better type of Scala we have and object oriented to be more of a Java style. Old schooly, less popular side, but still language doesn't bias towards one side or another. You can perfectly do object oriented, classical object oriented programming and fancy functional programming at the same time, which is pretty unique because most languages are heavily on one side or another, which is often considered to be like a negative side of Scala because it's very unopeneted. But anyway, that's what it is.
A
Yeah, makes sense. So Scala lets you kind of you can combine sort of a Java style OO with a ML or Haskell style functional composition.
B
Absolutely, yeah, that's it.
A
What is Scala Native?
B
So Scala traditionally has been a JVM centric language, so it used to compile only to JVM bytecode as the only target. And what it means is that it's really plug and play on jvm. You just compile your Scala courses to bytecode and then you can run it alongside your Java application. That was the original backend, the original platform for Scala. But since then we got way more things. I think the first major experiment to do Scala outside JVM was NET Backend. That didn't work so well because of the differences between how JVM and common language runtime handle generics. So it was a bit difficult. I think the first successful alternative platform for Scala is scalajs by Sebastian Doran. So basically it's really a major difference in terms of how you run your Scala apps because you compile them to JavaScript through this very elaborate advanced toolchain they have. And Scala Native is very much a similar project to Scalajs, but instead of compiling to JavaScript it compiles native code. And when I say native code I mean more like C C standalone binaries that completely just don't require any virtual machine to run it. So just you get x86 or ARM binaries that you can just copy paste onto any machine with the same architecture and just run it. Of course it has good and bad of native style development, but it's really kind of the core idea of the project is very simple is to compile Scala to native binaries.
A
Makes sense. So Scala JS gives you the ability to run in the browser. What problem does being able to run as a native application solve?
B
One of the issues with JVM as I see it, is that JVM is a really heavy, heavy machinery. So it really requires quite a bit of footprint just to run the vm. So you can see it in terms of memory used, you can see it in terms of application startup time, you can see it sometimes in terms of overhead of the whole services like JIT compilation behind the scenes. And it's really because JVM is very, very advanced multi stage, multi tier VM and it's really hard to support all this functionality without incurring some overhead. So the way Scala native is different is that we do most of the expensive parts like compilation ahead of time. So it means you already have a pre compiled and pre optimized binary. So when you start it, it just runs your app. It doesn't do the whole multi tier VM thing. So we don't have interpreter, we don't have multiple tiers of compilation and we just whenever we emit a binary that's it. There is no recompilation at runtime, there is no change tricks we do. It's a simple binary which means we have way lower footprint both in terms of memory and in terms of startup time. This can be useful for a number of use cases of applications like for example command line tools. So for command line it's tool it's extremely important to start up quick, do your job and then die. This is the area where JVM is really bad at right now because just starting the VM is extremely expensive operation. So you definitely see this kind of like initial slowdown if your app is not long running.
A
So that is what people refer to as like The JVM warmup, is that right?
B
Yeah, exactly. So it's like JIT warmup. And it also has to do with the fact that when you run code in jvm it actually goes through a number of stages. So first you go through interpreter. Interpreter is really, really slow. It is not meant to be there for a long time. So it tries to go to compiled mode as soon as possible. And there are at least two compilers right now which are used in production, JVM C1 and C2. C1 emits simple code to avoid the interpretation cost and C2 does a very, very advanced optimizer that only optimizes heavily used parts. So basically you have this very elaborate machinery which means that you don't get to optimize code only until your application is warmed up in native. We have basically equivalent assumptions in between C1 and C2. So we already heavily pre optimize your code before you run it. But at the same time it's not quite the same as the vm, so it draws some pros and cons.
A
Yeah, so if you have a long running Java app or Scala app, I guess then the costs of this warmup may be to it doesn't matter so much. But if something has to start up frequently, then re optimizing it is a lot of overhead.
B
Absolutely. Yeah, that's it. And another area apart from command line tools is different types of user facing apps where you can observe and perceive the startup time, which often are simple graphical apps. For example, starting Eclipse takes minutes I don't know before it's like. And you never close it because you're afraid to close it because once you close Eclipse you you'll have to go through the same thing again. It's basically not Eclipse problem. As far as I can tell it's largely a JVM problem because for example, other IDEs start much faster than Eclipse.
A
Yeah, I use Intellij but I find the same thing. Yeah, I try to keep it running. In your talk Skalago's native you were describing the JVM as a golden cage. Is this sort of what you mean by that or could you describe this concept?
B
So this metaphor was basically tried to motivate why we don't try to artificially limit what you can do in Scala native like JVM does. In particular, we don't try to sandbox your code so that you can only write a very safe code which never escapes the cage. Basically that's what I meant by the cage is that we let you use low level system level tools like RAW access to memory, row pointers and stuff like that, which is potentially unsafe, but it's actually necessary in some domains like systems programming, where you want to have a very low level control of your memory. The benefit is that you get more control. As developer, you're not limited by language and the VM because you can do whatever you can do in cnc. On the other hand, you do lose some of the safety you get on jvm. But this is kind of a trade off we take.
A
That makes sense. So is this strictly for memory management or how about interoperability?
B
So generally interrupt is basically so Scala native exposes a number of language extensions aimed primarily at interop with C code. And in a way it's a bit like writing C code in Scala. So we expose things like pointers and structs so you can do familiar C style programming. But it also means it's extremely easy to call C code. You don't need to go through a number of layers of bindings together. You can just do it all in Scala without any C and C code in the bino. But it also means that if you can call arbitrarily C code, you also can get arbitrarily issues that C code has, like for example, different types of safety issues around buffer overflows and so on and so forth. So it's definitely trade off. It's not free, like calling C code. It's not free in terms of safety guarantees you get. But again, this is the kind of trade off we make. We don't try to be as safe as possible, we try to be as flexible as possible.
A
Makes sense you want to give people that tool even if they, you know, it could go wrong. So how about memory management on the jvm?
B
So JVM is basically GC only platform. There are some very well hidden but extremely well known areas like Sandrisk and Safe, which lets you do unmanaged memory and manage it yourself. Like for example, you can allocate unmanaged memories through SunMisc and Save and then do raw memory accesses on it. It's basically as unsafe as pointers in C. Only JVM people try to hide it from you even though every major performance centric framework actually uses it. Like for example Spark does off heap memory to manage most of its data because it's just too expensive to okay everything on on GC here. But JVM people like you to believe that you only have gc, which is the main paradigm and it's actually Like GC on JVM is really good. So very often it's a good idea just to just use that and not do any unsafe memory management. In Scala native we say both coexist and you have APIs for both and they're both easy to use. So definitely unmanaged memory is dangerous thing, you can definitely shoot yourself. But for example, if you want to do some domain specific thing to optimize your memory layout and so on so forth, you can do that without jumping through the hood cycle.
A
JVM makes sense. So you know the JVM complaint, it purports to be memory managed, but if you look at these high performance apps, they're using backdoors to sort of actually do manual memory management.
B
Yeah. And it's like most of them really go far away to try to sidestep the cage, to find this small doors outside to get more freedom. But on JVM it's really hard to do those kind of things.
A
That's where the cage metaphor comes in.
B
Yeah.
A
Okay, so now that we understand some of the motivations for, you know, getting Scala to run natively, not on the jvm, maybe let's discuss some of the implementation. So could you describe the compilation steps that it takes to get from Scala source to a native application?
B
Oh sure. So it's actually a bit involved and there is a good reason, we have every single step on the line, but it's actually a really multi, multi step process. So first thing you do when you go from a Scala source, so you always start with Scala sources, which is a source of truth. And in case of Scala native, you end up with native binary. But it's not the one step thing like from Scala source native binary. The first thing you do is you do parse and type check and basically do the pipeline from the main Scala compiler from jvm. This contains a number of things, but the most important one I guess is type checking, because type checking in Scala is very involved. We don't re implement type checking or the language. We keep the same core, the same language language of Scala and jvm. And then later, once Scala compiler is almost done, we branch off and we emit something called nir. NIR is short for native ir, which is our own intermediate representation. So this NIR is the format we work with in our toolchain and when I say our toolchain, I mean linker, optimizer and cogent, they all speak this language as if it was a real language. So to get from NAR to binaries now we have one step closer because NAR is already quite more low level than Scala. So for example, many things are gone. Like there are no nested classes, there are no generics. Type system is much more simpler and it's actually very close to Java Bytecode rather than Scala language. And the main difference from Java Bytecode is that it's an SSA form, which makes it very easy to emit LLVM later. SSA is a forum for code representation, which is very nice for optimizing compilers. So from NIR we need to get native binary and two major steps. Three major steps on this way is first linking. So linking loads a minimal subset of your class pass to satisfy your application requirements. Like for example, an app that doesn't use regular expressions should not pre compile regular expressions in the binary, and so on and so forth. So we try to really limit an amount of code we put in the final binary not to include every single class on the class pass, because sometimes class passes get quite bloated even though we don't use some of the things. Sometimes people depend on a library even though they use a single function from it. So we do something called whole program codelumination at link time. And then after that step you get a minimal subset of the class path which we optimize through our own optimizer, which removes common patterns which LLVM doesn't know how to optimize well. And then in the end we emit LLVM ir, which is another ir. But now for llvm, LLVM is this project for reusable compilers, basically. So it's a core for CLI compiler and it's also used by many, many other open source languages and it's actually very well documented and very nice to work with. And from there on it's basically LLVM's jobs to get from LLVM IR native code.
A
You have your Scala code, you're using the front end Scala compiler to get some intermediate representation and then doing some transformations and then passing that through to llvm. Is that the big picture?
B
That's pretty much the gist of it.
A
And then because there could be a lot in your class path, you're making sure that you only include things in that binary that are actually part of that are actually called within the program.
B
We don't know if they're going to be called, but we try to analyze it and kind of do our best guess and what's going to be called.
A
Yeah, okay, yeah, makes sense. So one of Your frustrations with the JVM was that its garbage collector doesn't fit every use case. So many listeners may know that there are different types of garbage collection strategies. I was wondering if you could describe a couple strategies for performing garbage collection.
B
So on jvm you actually have a number of built in garbage collector as far as I understand in Java 9 the default one is called G1 and G1 is the latest collector from Oracle which is optimized for latency centric workflows. So typically GCs are often kind of put in either latency sensitive or throughput sensitive buckets. So what latency sensitive means is that the GC is optimized for shortest pause the GC can take to collect garbage, but this poses can be extremely frequent. But every single pause is small and throughput centric collectors care about not the length of a single pose, but rather the total sum of time spent in gc. So for example throughput sentential collector can take less pauses but makes them much longer. And basically on JVM right now the official ones is G1 for latency sensitive and parallel GC for throughput density workflows as far as I understand. And CMS which was previously the default is Deprecated as of Java 9, which is a bit sad because on some of our workloads like Scala compiler, I think CMS is still the best one. But otherwise it's basically three main collectors we have right now with CMS being deprecated. The general themes for all of those collectors is that they're typically generational, they're typically at least parallel, often concurrent. So what concurrent means is that collector runs alongside your application and tries not to stop your application as much as possible. So basically it does garbage collection not just in parallel as in doing multiple threads of garbage collection, but also concurrently to your application. So compared to all this. So where does Cull native stand? Right now we have a rather simple garbage collector called Imax is inspired by a paper. You can see more information on our website if you're interested. But the general idea is it's a single generation collector which is right now optimized for predictability. It's not concurrent today, it's stop the world and we currently optimize mostly for throughput and latency sensitive is our next big milestone which we haven't reached yet.
A
Okay, that makes sense. You also have like, as I understand it, you have more than one GC available in Scala native. Yeah, maybe describe what they are.
B
Right now the default one is actually not mx. It's called Boem. So Boem GC is this super easy to use plug and play garbage collector which was designed originally for C and C. And the reason why it's even possible at all to make it work in this environment is that garbage collection is conservative. So what does it mean? It means that garbage collector doesn't really require your app to declare ahead of time kind of the layout of all objects. It will conservatively guess what objects are based on their size and layout. For example, if some specific offset looks like a pointer, it can consider the pointer even if it's not, as long as it satisfies a bunch of properties that GCE wants to see from pointers. This is more expensive than precise garbage collection. So precise garbage collection knows exactly where at which offsets you have pointers and which offsets. It's just data, so it needs to do less work. The main reason why our current new collector called MX is faster is because it's precise. So we do use information about the whole object layout and it's way easier to collect the garbage. It's still conservative in one small aspect, but it's typically doesn't matter much. The stacks are conservative, but typically it's not a problem. Another cool thing about IMIX compared to Boem is that it actually uses a very smart data structure for allocation and collection, which lets it bump allocate most of the time, which is really important because bump allocation is the fastest way to allocate. And boy I'm still using freelist from time to time. And freelists are typically quite expensive in our own experience. And apart from these two, we have another collector called nodegc or settings called native GC colon equals none. So that one lets you completely disable the garbage collector. And the idea behind that one is to kind of have a rough understanding of how much time was spent in garbage collection. And what's the baseline performance? What's basically the perfect garbage collector? Because essentially allocating and never freeing is actually extremely close to perfect garbage collection. It's not perfect because it will still allocate objects far apart if objects were not allocated at the same time. So it can still cause problems with memory locality. But most of the time it basically spends zero time in garbage collection. So it means it's as low overhead most of the time for most applications and we use it as baseline to benchmark our GCs. So it's basically main purpose is benchmarking. And apart from that there are some use cases like extremely short lived applications which really don't need to manage memory because they run for less than a second and they don't allocate gigabytes of memory, but maybe hundreds. So for those kind of apps, it's actually beneficial to be able to disable Garbage Collector because it means they will run at best performance possible.
A
Makes sense. So none exists as sort of a for performance testing, but in actual fact it can be used for like a command line up. So you have none. So you can test what you're calling like a, you know, a perfect GC against the two that you have when you do this type of testing. Like how do they perform compared to a perfect standard?
B
So compared to our reference. So typically Emacs is somewhere around 20% overhead.
A
So.
B
So this means if you add MX, your app will run 20% slower in comparison to Boyem. Boyem is somewhere around 100%. So basically enable GC slows down your application by a factor of 2x, which is pretty bad. And it mostly has to do with the conservative nature of the collector. So emix is at 20%. It's actually still higher than we want it to be. I think we can get to 10 or maybe even less without changing the design of the collector too much.
A
Interesting. Do you happen to know like 20% away from absolutely perfect? Doesn't sound too bad.
B
Yeah.
A
Do you know where the JVM's generational garbage collector would fit on such a measure?
B
It's a bit hard to compare with something like CMS or G1 because they run concurrently. So it's typically under 5% and I would probably say probably even less than that because for concurrent Garbage Collector, you never perform the garbage collection on the actual application thread. You have a separate thread which only pauses application to do simple things like scan the stack or wait for this condition to hold. So it's typically short pauses of 5 milliseconds or less. They can be frequent, but typically as far as Send, it's like under 5%. So basically this is our goal. Performance is to be on par with jvm. Right now we don't guarantee priority with JVM in terms of performance, so there's still quite a bit of work to be done there.
A
Makes sense. So some people's complaint with the JVM is sort of the stop the world garbage collection. But you shouldn't go to Scala native to get away from that because that's all you have at this point.
B
Yeah, so at the moment we don't solve the stop the World problem. So we're looking, we're like researching ways to refine our GC further. But right now as of released version only, no GC has no stop the world problems because it doesn't gc.
A
Yeah, makes sense. So now I think I understand how the GC works. I'd like to look a little bit at Scala native usage. So is Scala native the same language? Is it Scala or is it something like a superset?
B
So Scala native at its core is one to one Scala. So there are very few differences in terms of how we treat normal Scala language features. They mostly are around edge cases like what happens when you call a method on a null or what happens when you do a cast, which doesn't make sense. So on jvm those cases are defined to throw exceptions. Some of those are just undefined behavior on native. So it means anything can happen if you do this. Typically it means it just crashes with a segfault, which is basically a bit worse in jvm, but still it's easily debuggable through native tools that will show you a stack trace and will effectively show you as much as a null pointer exception. We don't currently guarantee one to one parity in the edge cases and it's likely we will never have this because it's typically been a non issue for us. It's a bit more annoying to debug some of this, but essentially it simplifies our implementation quite a bit. And apart from the core language which is almost exactly the same, like 99% the same, we have a bunch of extensions for interop. So interop extensions are very different from Scala and jvm. They don't have anything similar. We do have role unmanaged punchers and things that go with them like memory layout types like structs. So you can have pointer to structs and it has meaningful data layout which is the same as in C. We also have function pointers and a bunch of other things to basically make it easy to call C code. Generally you don't have to use this kind of extensions at all, they're actually there only for interrupt. Pointers are also extremely useful for kind of having a lower level GC free subset of language that you can use for extremely performance sensitive applications. But again you don't have to use any of this. So the core Scala is really as close as we can make it to be the same as in JVM Makes sense.
A
And I guess with the pointers then you can kind of approach that perfect GC we were talking about. So if you've added a concept like structs like structured types, functional pointers. Doesn't that make the language like a superset? Like are these new keywords in the language new syntax?
B
We don't add any new syntax whatsoever. So our rule is it should type check without any problems by normal compiler. It might not make sense, but essentially all of our extensions are tied to magical intrinsic methods or magical annotations which modify how we compile things, but at the same time they still type check 1 to 1 bascola compiler without changes. So for example, from a types point of view it's the same language. From a runtime semantics point of view, it's quite different, but types are still the same.
A
That makes sense. Yeah, I think that's a nice way to do it. So I mean, because you're using annotations, does that mean that you can actually cross compile? So the same source can be a native binary and a, you know, a jar?
B
Absolutely. So we do support for cross compilation. So cross compilation is done through this SBT Cross Project plugin. It's an SBT plugin that lets you cross compile against three major targets, which is JavaScript, JVM and native. These targets are basically treated as separate sub projects of one mega project, which is called Cross Project. From SBT point of view, they're kind of like separate projects with separate jars. But we try to streamline end user experience so that it really feels more like one single project which you really just manage through this Cross project API. But overall the idea for cross compilation is you can create a cross project with one or more platforms and then when you compile and publish, you publish one JAR per every platform you want to support.
A
How about libraries like the Scala standard library I think is kind of very important and kind of gives the language a lot of its feel. So do you have the standard libraries available natively?
B
So standard library story is a bit involved, but generally the idea is Scala standard library is there and you can use it unchanged. Things like collections and standard types and they just work. And the way it works is Scala Library is implemented in terms of Java APIs very often and instead of trying to rewrite the whole library and have compatible but different library, we do a bit more involved thing which gives us a better compatibility story is we implement subsets of JDK APIs which are used by Scalastrans library and popular third party projects to be able to have the same code on both JVM and native completely unchanged. Like for example, projects like UTest and FastParse to cross compile to native, they had zero changes in the source, they only had to change the build to support cross projects. That's it.
A
So what about the jdk? Like I assume that's underpinning a lot of this Scala standard libraries JDK calls.
B
Yeah, so basically those are the Java libraries we care about. So typically what it means is we have our own pure Scala implementation of Java Lang, Java Util, Java IO, Java Nio and a bunch of other things which are essentially core APIs which people rely on in open source projects and in Scala library we try to implement those as faithfully as possible to their reference implementation on the reference jvm, but we don't look at the source of the reference JVM because we try to kind of stay away from the JPL code as much as we can. And essentially Scala is BSD3 closed licensed and our implementation is BSD3 close licensed. And one of the only inspiration for some of the parts of APIs we implemented was Apache Harmony project which is a reimplementation of Oracle APIs without GPL but under Apache license. So we sometimes use it for some cases where it's hard to reverse engineer underline behavior of the JVM and we need some help there.
A
Interesting, I hadn't heard of that project. So if you're recreating the. I'm just thinking there could be the case where an implementation detail of some aspect of the JDK actually becomes something that becomes dependent on and then when you have a new native implementation and somehow that varies and things break. Have you come across any cases like this?
B
We already experienced some of those. Technically every time we see Samulgas it's a bug in native and we fix it as soon as we can. There are differences that we know of, some of them seemingly minor, but this can still cause accidental breakage like for example our float tostring like Java Link float box type tostring has slightly different output format which still outputs the same number but has sometimes more trail than zeros than the one on jvm. And it has caused some open source test projects which rely on tostring output to be exactly the same as in JVM to fail. We try to fix those as fast as possible. For some of them it's a bit hard, but our goal, our philosophy is if you can observe the difference from the reference foundation as a bug, well.
A
That'S a hard standard to hold yourselves to. I mean to me it almost seems like their tests shouldn't be be dependent on the number of zeros that a two string implementation does Yeah, I know. I'm interested to hear if any like of the large Scala frameworks can run on native. I'm thinking like Spark or Akka. I don't even know the Play framework has any large project been taken over.
B
So as far as I know nothing major has happened yet. Probably the biggest code base that has been cross compiled is Kala C, which has been done as part of our recent experiments. Technically it's not hard to compile the source to NIR like the first step. What's hard is to satisfy all of the Java dependencies, all of the Java library assumptions which are expected by these projects. Like for example to run Akka you need good I O support. Like for example to run Akka HTTP you need complete socket support. Some of these parts are still working for us. Like for example Sockets has been just merged in a. Initial support for sapas has been just merged in the previous release and we're still working there. So it's a bit early for like major frameworks like Spark to just happen out of the box. But we are constantly looking at basically what's blocking people in terms of Java library coverage and in terms of APIs we support. And in fact we're often implementing things just based on reports of people trying to port libraries. Typically right now it's smaller scale open source projects like UTest and Fastbars, but still even for those to run, cross compile and test them often all of these small differences in the library semantics are important.
A
So you mentioned a Scala C the compiler has been ported over. Could you describe why and how that went?
B
So they had this still private, kind of mostly private experiment to port the Scala compiler to to the Scala native. And the idea is right now on JVM because of the startup issues you kind of have to have SBT always in the background because otherwise compiler is just unusably slow. It's only usable after it's warmed up after a few compilations. But if you have native we don't really have to have this problem because the very first run is already optimized. So you can already run optimized code immediately. And what we observed in our very early experiments right now is that we offer significantly faster performance on cold compilations and on simple projects like understands line of code can be like times faster. So basically cold build with Scala C on JVM can be like two to three times slower than code build on native.
A
Wow, that's amazing. I mean one of my frustrations with Scala is yeah, the cold compilation time can be longer than any other language that I can think of. So what were some of the challenges of this moving it to native?
B
So probably the major challenge was to have enough I O. So we had a long story of doing File I O and different types of File I O. Because Scala C uses almost every single type of file I O JVM has. Don't ask me why, I don't know. But it's basically uses NIO.
A
It's.
B
It uses Java IE and a bunch of other things. So also things like jar and zip APIs. So most of those have been contributed by Martin Duham from Scala center. And it's been extremely helpful to make this even possible because essentially without these libraries, a project depends on, it's hard to run it on native. So basically those are probably the hardest parts. We also have a working progress port of Scala ASM. So ScalaASM is a fork of ASM library, which is a Java bytecode generation toolkit, which basically lets you programmatically emit Java bytecode. That's what Scala C does all the time. So we have a limited subset of that library ported to native to have enough APIs to make Scala C compile and emit class files. But otherwise those basically were the only challenging parts. So we only kind of the library problems. We haven't really discovered any major bugs in Scala native this way. So as soon as we had enough Java libraries, it ran basically. That's basically a typical story of port and stuff native.
A
Once those libraries are in place, then it works great. So if I have a, if I'm in Scala native and I have access to C as well as to, you know, Scala and JDK libraries. Like what is a string? When I create a string, is that a native string? Is that a Java string? Is it immutable?
B
So Scala string is an instance of type Java langstring, which is immutable string baked by a Scala array, which is also garbage collected, which is quite different from what C has for arrays, right? So C has just basically a sequence of bytes in memory which end with trailing zero. It can really be. This memory can be really anywhere because it's C and it's untyped and so on and so forth. So when you call an API which expects a C string, you need to convert Scala strings to C strings. In some cases where you know you have the same data representation in both Scala and C site you can share data structures, but often you have to copies data over if they're in completely different formats. Like for example for file I O when you read or Write bytes. We can just share memory with SCAL native arrays without copying. So it's not often the case that you have to copy data over.
A
So you can use either and you get to choose. And there's some helpers for going back and forth.
B
Absolutely, yeah.
A
I can see why that would be very useful. What hardware architectures, what platforms can scalanative run on?
B
So technically we have very little requirements, but right now we only test on 64 bit architectures. Our CI like all time CI is Mac and Linux 64 bit intel people have reported and it seems to work on 64 bit arm unchanged. Also we don't officially support ARM at the moment as in we don't have CI for it. But generally just about any 64 bit architecture should just work out of the box. We only had reports about ARM and Intel, but maybe more obscure Things like PowerPC would work too, but we don't know for sure because we don't have this kind of hardware. So basically anything with 64 bit pointers should just work.
A
I think now I kind of understand a lot of the usage around Scala native. What interesting projects have you seen making use of this project?
B
So there have been a number of experiments going around. So one of the more interesting ones there is this experimental framework in development called Dinosaur and it's actually very very early stages but it tries to be like native first web framework which currently is built on simple stuff like cgi. The author is experimenting with fast AGI now and it seems like it's an interesting place to be because up until the point we have stable web frameworks, basically the first framework to market will be the main framework for Scala native probably. So it seems like Denizhour has the biggest lead to market so far and there is already quite a bit of code working and quite a bit of experiments. And you can check one of the Richard's blog posts. I think we had some of them retweeted from Scala native Twitter. But basically the idea is to do a native first web framework which is pretty cool. I've also seen people do different types of command line tools and this is basically the area where we excel and this the area where TVM is often borderline unusable performance wise just because of.
A
The warm up time. Yeah, if you write a command line tool it just is slow.
B
Yeah.
A
So because of that quick startup time, I'm interested if anybody has thought of or if you think it'd be a good idea to use Scala native for things like Amazon Lambda, like serverless computing.
B
It's probably an interesting idea. I've never seen anyone try it on. It would be interesting to see how it works out.
A
I think I saw some talk on your website about compiling down to iOS to make an iPhone app. Is that a. Is that a real thing or.
B
Some people try to compile to iOS and it seems to work in principle. The main challenge with iOS is interrupt with Objective C. Right now we don't support Objective C, so it's basically. You're a bit in a uncomfortable place right now. As far as I know, nobody's actively trying that. So it's possible in principle, but it's not directly on our or shared list. In terms of the things we want to do now.
A
I think that you mentioned earlier that you were inspired by Swift with the LLVM intermediate language, is that right?
B
Yeah.
A
So how did it inspire the implementation of Scala native?
B
So Swift is called an Everson, but the major inspiration for Scala native was scalajs because before scalajs it was basically considered internal truths that it's too hard to implement Scala outside jvm. So essentially major inspiration for Scala native is scalajs and not Swift. So the way Swift influenced Scala Native is mostly in terms of compiler technology, in terms of what we do under the hood. So Swift has this intermediate language called sil, which is short for Swift intermediate language and it's kind of like higher level llvmir and it's basically the area we also aiming for with nir, like higher level LLVM IR like thing. The main difference between SIL and NIR is that SIL is reference counted and NIR is garbage collected. And basically this probably is the main major difference between the two. But otherwise they're trying to solve a similar problem. Both are representation for high level optimizing a compiler for high level language. And they both try to optimize part which LLVM cannot do well because LLVM is actually a very low level API and very low level representation because for example, somethings are just simply gone by the time you emit LLVM ir. One of our long standing issues is performance virtual dispatch. We already did a lot to make it pretty fast. But still on llvm, when you compile virtual dispatch you typically end up with calls through function pointers. Basically this is what you compile down to. And when you're at that low level of abstraction, it's really hard to optimize this away. So LVM typically does very little, close to nothing to optimize virtual dispatch. So this is what we do ourselves. So SIL also solves a similar problem. Basically it's a format for pre optimization before LLVM optimization happens. So you try to make LLVM job as easy as possible and to emit high quality coding.
A
Was there any challenges with having a language that has two paradigms like Scala and kind of having this compile to llvm?
B
Actually I don't think this to nature thing was a big problem. Probably the main reason is that essentially Scala compiler already does functional to object oriented part of compilation. Essentially all of the high level features are. All of the high level functional features are replaced by equivalent object oriented features. So typically what you end up by the end of the Scala compiler is very object oriented code. And essentially most of our challenges to make functional code work well are the same as to make object oriented code well, because in the end of the day, for example, closures are just object with virtual methods just the same as any other object oriented thing. So basically it all compiles down to the same representation where it has the same format for both object oriented and functional features.
A
That makes sense, yeah. So that kind of part is taken care of for you. What features are up and coming in Scala native?
B
So right now we are pretty much complete in terms of language support. So we don't know if any major semantic difference which will be a breaking change as in we would like to fix it as soon as possible. And most of the innovation right now is happening is libraries. So we are slowly working towards bigger and bigger coverage of our implementation of Java APIs. One of the major things which we're trying right now are multi threading APIs, like for example, things like locks, concurrent atomic primitives and so on and so forth. And apart from that also networking and things like that. So basically Those are typical APIs you would need for backend microservice kind of app. This is kind of the area which we see scalaji being used more in the future. So apart from library innovation, we do lots of lots of work on the compiler code quality and runtime code quality. So basically those are small iterative changes of the common patterns we see basically to improve performance, to reduce overhead, to reduce footprint, to make it even more lightweight, and so on, so forth. I guess that's pretty much what it is. One of the areas where probably see the biggest changes which are like non iterative incremental slow convergence towards better performance are changes to the garbage factor. It's probably the area where we could do things significantly better than what we do now.
A
So if people would like to Learn more about Scala Native. Where should they go?
B
The starter place is our website, scala-native.org and our Twitter twitter.com scalanative those are two central places for announcements, latest releases and so on, so forth. You can also go to Gitter is like a nice cozy chat room for if you just try this colony if something doesn't work or you have a problem, it's basically a place where you go to ask questions. And of course for all of the active development we use GitHub and GitHub issues like pull requests and discussion on what's going on is happening over there. So basically if you subscribe to Twitter and Gitter and GitHub, that's pretty much you will see everything that's going on.
A
And I understand since you first announced this project, you've had a lot of contributors. Is there a lot of contributions coming in?
B
There's actually quite a bit of contributions right now we have a bit more than 60 contributors overall. It's really nice because people often contribute, sometimes small things, sometimes bigger things, but it's really, really nice to see people interested in the project and trying to help as much as they can.
A
Yeah, that's great. It's great to have a community involvement that it's not just a, you know, a couple people working away on it. Well, thank you so much for your time, Dennis. It's been great to learn about Scala Native.
B
Thank you for having me.
Host: Adam Gordon Bell
Guest: Denys Shabalin, Research Assistant at EPFL, Creator of Scala Native
Date: January 1, 2018
This episode dives into Scala Native, a project spearheaded by Denys Shabalin, which brings the Scala programming language (traditionally dependent on the JVM) to native code through an optimizing ahead-of-time compiler and minimal runtime. Host Adam Gordon Bell interviews Denys about the motivations, design trade-offs, implementation, and direction of Scala Native, and discusses how it enables new scenarios for Scala, especially in environments where JVM overhead is problematic.
On Scala’s dual paradigms:
“It really tries to blend both together because there is both good and bad on both ends.” (B, 01:06)
On the JVM as a barrier:
"The JVM is a really heavy, heavy machinery... it’s really hard to support all this functionality without incurring some overhead." (B, 03:57)
On the ‘golden cage’:
“We let you use low level system level tools like raw access to memory, raw pointers and stuff like that... it’s actually necessary in some domains like systems programming.” (B, 07:45)
On the major advantage of native for CLIs:
“For command line tools, it’s extremely important to start up quick, do your job and then die. This is the area where JVM is really bad at right now...” (B, 03:57)
On cross-compilation:
“The idea for cross compilation is you can create a cross project with one or more platforms...publish one jar per every platform you want to support.” (B, 28:01)
On porting ScalaC:
“Cold build with ScalaC on JVM can be like two to three times slower than code build on native.” (B, 34:18)
On project philosophy:
"If you can observe the difference from the reference foundation as a bug, well" (B, 32:26)
| Topic | Timestamp | | ------------------------------------------------|:-------------:| | Scala’s dual paradigm & intro to Native | 01:06–03:47 | | JVM limitations & ‘golden cage’ | 03:57–11:42 | | Compilation process to native code | 12:03–15:36 | | Garbage collection in JVM and Scala Native | 16:18–24:34 | | Language compatibility & Extensions | 24:52–27:46 | | Cross-compiling and standard library strategy | 28:01–31:38 | | Porting frameworks / ScalaC to Native | 32:53–36:54 | | String/Array interoperability | 37:16–38:15 | | Platform support | 38:28–39:15 | | Experimental/adoption stories | 39:23–41:46 | | Technical inspirations, future work | 41:50–47:18 |
The episode presents a thorough, candid exploration of the motivations for and challenges in bringing Scala to native environments. Denys Shabalin outlines both technical innovations (such as advanced compilation, flexible memory management, and broad API emulation) and the broader vision—to make Scala more flexible and useful beyond the JVM. Key domains that benefit include command-line tools, experimental web frameworks, and potentially serverless computing, all enabled by rapid startup and lower memory demands. The project remains open and growing, with library support and GC innovations as prime areas for future contributions.