Posts tagged R

DWIM vs. Small Semantic Core

:: R, Programming Languages

So, I’d like to do some statistical analysis. I hear that R is really good at this. Let’s download it and take a look.

(Ten minutes later)

AAAHHH! MY EYES! THEY’RE BLEEDING!

What about Matlab? It’s the same story.1 As a programming languages person, these languages make me … well, angry.

Why?

Well, after thinking about this for a while, it seems to me that what I hate most about these languages is their complete lack of a small semantic core.

Take a language like Racket, JavaScript, Java, or C— these languages don’t have a heck of a lot in common, but they share

is this all just library design? Most of the things I really hate can easily be constructed in any dynamic library through a suitable application of

Terrible Library Design (tm)

… except that when it applies to things like vector dereference, it feels like fairly ‘core’ syntax.

Example time! First, R does this crazy thing in distinguishing logical from numeric vectors.

> a
[1] "a" "b" "c" "d"
> a[c(2,4,3)]
[1] "b" "d" "c"
> a[c(FALSE,TRUE)]
[1] "b" "d"

In the first of these two array derefs, we’re using the indices from the vector to decide what elements of a to take. In the second case, though, the index expression is a ‘logical vector’ and is therefore tiled to the length of the original one, and used to decide whether to take the corresponding element.

If you imagine this as part of a language semantics, you’d see this horrible side-condition attached to these rules, where array deref’ing works in totally different ways depending on the kind of argument it gets.

To say nothing of the silent tiling, which seems like an open invitation to horrible bugs.

But wait, we can compound this problem with some nasty coercion:

> a[c(4,c(FALSE,TRUE,TRUE))]
[1] "d" "a" "a"

What on earth is going on here? First of all, vectors get silently flattened, so that c(3,c(4,5)) is the same as c(3,4,5) — ugh — but then, the logical values are coerced into numeric ones, so the index vector that’s produced is actually c(4,0,1,1), which is then used to index the vector a. But why are there only three values? Oh, well, there’s no index 0, so let’s just skip that one, shall we?

Honestly, I guess the real problem is in thinking of something like R as a programming language; it’s not. It’s a statistical analysis tool, with a rich text-based interface. After all, would I get upset if Photoshop used ‘b’ for blur and ‘s’ for sharpen and I couldn’t nest them the way that I wanted, using parentheses? Probably not.

And finally: apologies for everything I’ve written. I’ve used R for about fifteen minutes, and this piece is really just me blowing off a small amount of steam. Not well written, not well thought-out. Meh.

  1. Actually, maybe not; I spoke with a friend yesterday, and I get the impression that Matlab may not be as horrible as R, here.