gusl: (Default)
[personal profile] gusl
It seems to be "common sense" among many/most CS people that it is very hard to automatically translate code from one language to another. But we have Linj, which does CommonLisp->Java.

I'd like to see statistical Machine Translation techniques applied to this problem... although maybe we'd benefit from more logical learning methods, such as ILP. Also, I'd like to see paraphrase-detection in programming: detecting whether two code snippets do the same thing... and paraphrase-generation: rewriting the code (in a non-trivial way) while preserving the input-output semantics, and hopefully gaining some kind of efficiency (by breaking the low-level semantics inside the black box). e.g. automatic memoization (easy!), rewriting BubbleSort (or some specification of what it means to be sorted) to QuickSort (not so easy!). The official name for this idea seems to be algorithm synthesis / algorithm discovery

The difference between this and Natural Language MT, of course, is that we don't need humans to test whether a translation / paraphrase is correct: even if we can't prove that translation/paraphrase is correct, it should be easy to refute incorrect ones with very high probability. For this, and other reasons (e.g. a lack of formal semantics!), translating between programming languages is an easier problem.

Although this came to me while thinking about all the Matlab code that I'd like to port to NumPy or OCaml, this problem would be even more interesting if you wanted to automatically parallelize your code. Automatic serialization is easy, no? Why is automatic parallelization difficult? I'm not necessarily asking the translator to come up with clever tricks.

Also, writing compilers for a new architecture is essentially the same problem. What is the state-of-the-art in automatic compiler construction?

What makes these problems so difficult?

(no subject)

Date: 2009-02-26 02:26 am (UTC)
From: [identity profile] mapjunkie.livejournal.com
The usual pitfalls when writing compilers and translators include:
* it can't have any chance of changing the overall semantics
* standards are not necessarily written in terms of compatible concepts
* there's not any labeling ahead of time about the usage distribution of various paths through the code (and, even if you do try to capture it, odds are you'll be wrong, because you'll have way more customers than testers, in most cases)
* it shouldn't be any worse in performance than what they wrote to begin with
* there are a large number of optimizing transforms, and it's unclear which order to apply them in. One applied in a very early stage can preempt an improvement much later.
* there's a vast difference between the code in the standard and all of the code supported by various vendors. For example, it's my understanding that there aren't widely-known grammars for various popular C compilers.

Having said all that, I should expect that the average CS major from a good-enough school should be able to write most of a functioning translator to do a useful amount of a port between languages.

I think the most worthwhile direction in this area is to take a well-supported virtual machine (CLR, JVM, LLVM) and write a translator from that to the language of your choice on that platform.

(no subject)

Date: 2009-02-26 05:21 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
<< is to take a well-supported virtual machine (CLR, JVM, LLVM) and write a translator from that to the language of your choice on that platform >>

Do you mean the other way around? Translating from JVM to Lisp would be a weird thing to do.

(no subject)

Date: 2009-02-26 11:28 am (UTC)
From: [identity profile] mapjunkie.livejournal.com
Nope, I did mean it in the order I put it. Writing a language-to-language translator is not rare because it's impossible, but it's rare because it does something weird. If you're going to the trouble of accurately parsing and representing a language, then writing it out to another language seems strange, instead of just compiling them and joining them during linking or through some other API, is unnatural. If anything, translating from JVM to Lisp is more practical: you now not only can have your Java be translated to Lisp, but your Scala, Clojure, and so forth as well.

I will admit it's even more annoying to do, though.

(no subject)

Date: 2009-02-26 04:32 am (UTC)
From: [identity profile] dachte.livejournal.com
Lisp is so much simpler than other languages that it's essentially a toy problem.

(no subject)

Date: 2009-02-26 05:17 am (UTC)
From: [identity profile] gustavolacerda.livejournal.com
Although Lisp has a simple syntax (i.e. it's easy to parse), I can hardly imagine that syntax is a substantial part of what makes this problem hard.

(no subject)

Date: 2009-02-26 05:38 am (UTC)
From: [identity profile] dachte.livejournal.com
Lack of types combined with being in general a very small language (with little spec-ambiguity for a given dialect) makes lisp a lot easier than most languages for these things. It's reasonably easy to reason about lisp, and implementing it in other languages (the link you provide claims something a bit different, in that it involves translation that probably includes a bit of implementation when needed) is quite simple - writing a simple LISP interpreter is a good task for a young programmer. Writing a simple C compiler that implements a good part of any of the specs is much tougher. Given that reasoning about a language at a level sufficient to translate is generally harder than writing a nice compiler/interpreter, it's considered a hard problem for most languages. Even things like cfront (the old C++ to C translator that was written before there were any true C++ compilers) were a huge effort, and it was translating between closely related languages (meaning there's little work needed to try to line up different fundamental concepts using some kind of a runtime or specialised types).

If you really want to do a bulletproof job at this kind of task, you're talking about doing bounds-checking and automated proofs of the characteristics of a program akin to figuring out the entire possible state diagram of a program, and mapping that (hairy) structure onto a new language. It's at least considerably harder than writing a really good optimising compiler for one or both languages (and involves many of the same steps).

Of course, some languages are going to be easier than others - languages that are "close to the metal" tend to have more "gotcha"s, and more complex/ambiguous specs. Also, languages that are easier to optimise (like Fortran) also would probably be easier to translate.

(no subject)

Date: 2009-02-26 11:19 am (UTC)
From: [identity profile] mapjunkie.livejournal.com
It depends on the Lisp. Some toy Lisp sure, Scheme likely, Elisp maybe, Common Lisp unlikely.

(no subject)

Date: 2009-02-26 06:31 am (UTC)
From: [identity profile] bhudson.livejournal.com
Translating is pretty easy; fundamentally, that's what a compiler does (typically from the source language to an intermediate representation or three, then to assembler code, which gets assembled into machine code, possibly for a virtual machine, which then needs to be turned into native machine code at runtime).

Occasionally, someone feels the need to translate a big glob of code into another language, but that's pretty rare. The main example I can think of is f2c.

Writing compilers for a new machine is pretty easy these days, because all you do is take an existing back-end and modify it. Unless you're doing something totally revolutionary, this will work.

Automatic parallelization is hard because often the natural way to write an algorithm has no parallelism to exploit, and you need to change the *algorithm* to get anywhere (and not just rearrange a few instructions).

(no subject)

Date: 2009-02-26 06:31 am (UTC)
ikeepaleopard: (Default)
From: [personal profile] ikeepaleopard
To automatically parallelize code you have to guarantee that the order of execution doesn't matter, which, is in general, halting-complete. Furthermore, this is easy to approximate in basically no languages, since they tend to encourage effectful programming. People actually do stuff like this for Haskell, but naturally, you can't do anything effectful in those cases. There are people who claim, for reasons like this, that in order to program the massively parallel languages of the future we'll need to use pure languages. Also, when you actually want to parallelize smaller bits of code depends on weird stuff like pipeline depth and cache sizes and various OS implementation details.

(no subject)

Date: 2009-02-26 01:39 pm (UTC)
From: [identity profile] mapjunkie.livejournal.com
Not to mention that the cost of being wrong can mean more than the pipeline delay of a missed branch, but may entail synchronized intercommunication across processors.

(no subject)

Date: 2009-02-26 03:08 pm (UTC)
From: [personal profile] neelk
Actually, Linj is not Common Lisp. It's basically Java in Lisp syntax, and it's that way on purpose.

The reason translating between languages is a PITA is that you have to do tons of impedance matching. For example, Python, Perl and Ruby are all basically the same language, but it's a gigantic pain to translate code between them, because all of their basic data structures have slightly different semantics -- for example, hash tables in Python don't work like hash tables in Perl don't work like hash tables in Ruby, but only at the corner cases. So when translating, you have to put in a bunch of checking code to handle all the corner cases, and this makes the translated code very very s l o w.

As a result, people don't do it much.

(no subject)

Date: 2009-02-26 05:20 pm (UTC)
From: [identity profile] gustavolacerda.livejournal.com
hang on... the corner cases is where they disagree, right?

can you give an example of this? I'm very curious.

<< So when translating, you have to put in a bunch of checking code to handle all the corner cases, and this makes the translated code very very s l o w. >>
and how do human programmers do any better?

(no subject)

Date: 2009-02-26 06:32 pm (UTC)
From: [identity profile] simrob.livejournal.com
Language Lang1 has vectors that don't offer a length operation, but where vector lookup will either return NONE or SOME i.
sub1 : intvector * int -> int option
Language Lang2 is designed to have very efficient exceptions, and throws the exception Subscript when an out-of-bounds lookup is caused.
sub2 : intvector * int -> int
Language Lang3 las a length function, and does no array bounds checking.
length3 : intvector -> int
sub3 : intvector * int -> int


I know how to write the function in all three languages that returns -1 for an out-of-bounds array access. I don't know how to automatically translate between the three: it's probably doable, but my brain can do it better. Note that I haven't even used three "languages" here - this is all ML (with unsafe arrays enabled)
(* Language 1 *)
fun mylookup1(vec,i) = 
   case sub1(vec,i) of 
     NONE => -1
   | SOME i => i

fun mylookup2(vec,i) =
   sub2(vec,i) handle Subscript => -1

fun mylookup3(vec,i) = 
   if length(vec) >= i orelse i < 0 
   then -1 
   else sub3(vec,i)
The three different implementations of bounds checking are the "corner cases."

(no subject)

Date: 2009-02-26 07:47 pm (UTC)
From: [personal profile] neelk
Okay, here's an example. In Python, if you do a hash lookup on a key that's not in the hash, an exception is raised. In Ruby, a nil value is returned. So if you were automatically translating from Python to Ruby, you would need to wrap each hash table lookup with an if-then-else that looked for nils, and then raised an exception. This is, in general, very slow.

When humans translate programs, we have two advantages. First, we know what the program is supposed to do, and can avoid putting in such checks if (for example) we know that a hash table is always supposed to contain a certain key. Machine translation can't do this, because they don't know anything about what programs are supposed to do at a high level. Second, we often know in what contexts a particular program might be used, which we can use to constrain what values will flow to a subprogram, and then do optimizations. A translator can't make any assumption about how a function will be used -- it's supposed to generate a translation that always works, rather than just works in all the cases that will happen in this particular program.

A lot of the bugs that arise in a human porting effort are when we screw up on those assumptions.

February 2020

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags