Jonathan Worthington ::

About My Dissertation Project

So everyone who knows me vaguely well and has talked to me much this term will have heard that, being a third year computer scientist, I've got to do some project then some months down the line write a 10,000 word dissertation about it. Of course, the innevitable question I get asked is "what's your project?" and the answer tends to be along the lines of "bytecode translation between virtual machines". Huh?! Thing is, some computer scientists do projects that do all kinds of cool 3D rendering, so you can say "it does pretty things" and people understand. Or they do something with AI, which you mention and everyone goes away happy (and quietly hoping your terminator won't chase them down). But no, pretty or intelligent (or the conjunction) wasn't in my plans (well, not the ones for my dissertation anyway). So just what on earth am I doing?

What's a virtual machine?

Take a look at your computer. It's a machine. Inside it there's a CPU, which can carry out millions (or billions) of arithmetic and logical operations every second. There's also some memory, where stuff is stored, and things like hard disks for longer term storage and a network interface so your computer can access the internet. The CPU is at the center of the show, though. Software (that is, the programs you use every day) are made up of a sequence of instructions that the CPU executes. These instructions are in machine code, which the computer understands. However, it's not ideal for humans to read and write, so many decades back high level programming languages were invented. These were more human-friendly. Tools called compilers turned programs written in these programming languages into machine code. And all was good.

Well, all was...OK. Thing is that there isn't just one type of computer. There are different types of CPU that take different sets of instructions, and there are different operating systems to account for too. Things just work different on Mac OSX and Windows and Linux and VMS and so on. This means that when you wrote a program that worked on one type of CPU and operating system, it very likely wouldn't work on another platform. And this kinda sucks.

A solution can be found by creating a virtual machine. It has an instruction set and a standard way to do the stuff that an operating system would let a program do. Compilers then generate "machine code" for the virtual machine. We usually call it bytecode rather than machine code when discussing virtual machines. The virtual machine is implemented as a program itself, which efficiently maps the virtual instructions and features to the ones that exist on the real machine.

The gain here is that we can write a program once for a virtual machine and it will run on many other types of machine - in fact, any type of machine that the virtual machine is implemented for. So instead of everybody having to make their programs work on lots of different types of machine, the only thing that needs to be made to work on different platforms is the virtual machine.

The Multitude Of Virtual Machines

So we have a virtual machine. Is the world perfect now? No, 'fraid not. Thing is that we don't just have one virtual machine. There are lots of them out there. For example, there is the Java Virtual Machine (JVM) that runs programs written in a language called Java. There's the .NET virtual machine, which was specified and first implemented by Microsoft. There's the Parrot virtual machine, which came out of the Perl community and is aimed at supporting what are known as dynamic languages (this basically means the VM needs to provide loads of difficult and scary stuff that I won't talk about here).

No surprises, a program that runs on the .NET virtual machine won't run on the JVM or Parrot. It's not such a big deal as you can quite easily install Parrot and .NET and the JVM on a range of machines and they will all happily co-exist. The problems kinda hit people making software though. Imagine you are writing a program in a language that compiles to run on Parrot but you have a module that consists of .NET bytecode that you'd really like to use. You're a bit stuck.

Enter Bytecode Translation

My project will look at taking a load of .NET bytecode and translating it into Parrot bytecode, so it's essentially just as if the module had been written in a language that compiled to the Parrot virtual machine, and therefore usable from other languages that compile to Parrot bytecode. An alternative stratergy that is very likely easier to implement is to "embed" a .NET virtual machine within Parrot, but the embedding boundary will get in the way of various things, thus why I want to probe the translation approach.

What makes the translation hard? Well, just like translating natural languages, just because you have a word for something in one language doesn't mean you have a word for it in another. One instruction in .NET may become a few instructions in Parrot bytecode. Or there could be a word that means something kinda similar, but it has slightly different implications. For example, if I translated the French "Jonathan est tres heureux" to a version of English that was missing the word "happy" I might get "Jonathan is very gay.". While if you look up "gay" in the dictionary it sure does have the same kinda meanings as happy, it also could be read in such a way that would lead people to think I'm homosexual. Similarly, the meaning of the add instruction in .NET and the add instruction in Parrot may not be the same.

Besides the bytecode translation there are masses of side-issues that I will need to deal with. Discussing them would make this an extremely long read, so I won't. Well, not on this page. But come hunt me down or check my blog if you really want to know more about how it's going. :-)

All content Copyright (C) Jonathan Worthington 2003-2005 unless otherwise stated.