Thursday, March 16, 2006

 

Regular expressionescence

I felt startled when I read Tim Sweeny's language proposal (pdf). Perhaps that's not exactly the word - I felt vindicated. The sort of things he says he wants from a language are exactly the sorts of things I've been desigining in my experimental programming language, Flan. Not too surprising, I suppose, since the two of us are in the same field.

One of the key features of Flan is its type system. You can have type-checking applied statically or dynamically, and you can make types more-or-less as restrictive or as loose as you want. In the end, it turns out that you can use this one system to do Prolog or Haskell-style data deconstruction, SQL-style filtering and pattern-matching, Eiffel-style compile-time contracts, and a really readable version of regular expressions.
Although such flexible type systems seem to be in vogue lately, I've actually been working out the details of this system for several years now.
Let's start with a demo of Flan's static type declarations:

int @a = 0; // declare an integer variable a, which is initially 0.

To quickly summarize: the basic syntax is C. Comments are marked with // and /**/, assignment with =, statements delimited with ;. (Because as far as I'm concerned, if it ain't broke, don't fix it.)

As I've mentioned before, @foo means "create a new variable called foo". If you're used to Perl or Ruby it takes a little time to adjust to declaring variables with @foo and then using them with just foo - but I'm finding that the more I use this syntax, the more I like it. It turns out that it even helps people learn the language. You know when you look at a piece of code in an unfamiliar language, and you're not sure which identifiers are being declared, and which have been declared elsewhere? That simply doesn't happen in Flan.

Some more examples:

char @b = 'B'; // b can be any character - initially, 'B'. (The audience gasps!)
value @c; // c can be any value - character, number, list, whatever.
4 @d; // d is always the number 4.
'E' @e; // e is always the character 'E'.
"ffFF" @f; // f is always the string "ffFF".

Notice that compile-time constants don't need special treatment in this system: they're just variables with a very constrained type.

Flan supports lists as first-class values, denoted Python-style with square brackets and commas. (Yeah, that's different from what I said last time; bear with me.)

[4,3,2,1] @g; // the constant list [4,3,2,1].
[int,int] @h = [2,7]; // any list of exactly two integers; initially [2,7].
[int,1,int] @i = [200,1,-200]; // any list of 3 integers with 1 in the middle.

If you're making a long list, this gets annoying - you don't want to have to write [int,int,int,int,int...], so you can instead use :, the "repeat" operator.

[int:7] @j = [0:7];

...is the same as...

[int,int,int,int,int,int,int] @j = [0,0,0,0,0,0,0];

This is a very flexible system. For example, you can mix and match commas and colons...

[1:4, 2, 3:2, 4] @k; // the list [1,1,1,1,2,3,3,4]
['aci', 'e':8, 'd'] @m; // the string "acieeeeeeeed".


Aside: I hope you found that last example a bit startling. We're using the comma
operator to build a string out of three sequences of characters - 'aci', 'e':8 and 'd'. The last one is just a single character; the other two are structures known as "list fragments". They're almost, but not quite lists.

Some more examples of list fragments and the operators that generate them...
4,5
'h','e','l','l','o'
1:6 //equivalent to 1,1,1,1,1,1.
'hello' //equivalent to 'h','e','l','l','o'.
1..5 //the "range" operator: equivalent to 1,2,3,4,5.
'1'..'5' //equivalent to '12345'.
/[4,5] //equivalent to 4,5 - the / operator is
the inverse of [].


Some of these look a bit like tuples in Python, but actually they're rather different: for a start, list fragments cannot contain other list fragments, because the only way to merge them is concatenation. The only thing you can make is one big list fragment, not a hierarchy.

List fragments can be stored in a variable...
@n = 1,2,3,4;

...concatenated with the comma operator...
44,n,99,n,44 // equivalent to 44,1,2,3,4,99,1,2,3,4,44;

...and if you put a list fragment in square brackets, you get a full-fledged grownup list that you can index and do useful things with.
[n] // equivalent to the list [1,2,3,4].

Yep, this is the same "every value is a single-element list" philosophy that I talked about in the last post. Unfortunately, since then I've realised that people will find it confusing that both [1,2,3] and 1,2,3 are legal (but subtly different) syntax for lists, and it'll be a pain having to convert between the two when something is supplied in the wrong form, and so on. I had to pick one as the "default" syntax: [1,2,3] won. Now a list fragment is a list that's been deliberately disabled: to index it you have to put square brackets around it.

Ok, where was I...?



Oh yeah.
For even more flexibility, the colon operator can be followed by an arbitrary type instead of a number:

[char:int] @o = []; // a list containing any number of characters.
[char*] @p = []; // the same thing. (but more convenient.)

That * operator is just one of the operators that let you construct complex types...

4¦5 @q = 4; // can be either 4 or 5 - initially 4.
>0 @r = 1; // any number greater than 0.
>=0 @s = 0; // any number greater than or equal to 0.
[int:>=1] @t = [0]; // a list of one or more integers.
[int+] @u = [0]; // same thing.
!null @v = 45; // any value except 'null'.

You can also intersect types to make stricter types - just write them in series one after another.

!0 int @w = 4; // a non-zero integer.
>-100 <100 !0 int @x = 4; // a non-zero integer between -100 and 100.

And finally, all of these types are first class values: you can store them in variables, return them from functions, and so on. Flan's equivalent of a C typedef expression is just a normal assignment:

@natural = int >=0; // any natural number.
@vowel = 'a'¦'e'¦'i'¦'o'¦'u'¦'A'¦'E'¦'I'¦'O'¦'U'; // any vowel.

natural @y = 4; // a variable containing any natural number.
vowel @z = 'o'; // a variable containing any vowel.

Wow, that was a lot of code snippets. If it's a lot to digest, don't worry about it too much; for now let's move on to a discussion.

Practicalities:
The first question that probably comes to mind is - "Hey, didn't you call these static type declarations? How can you check a type like '>0' at compile time?"Well, there are several ways.
1) Simple case: for a literal value or constant, e.g. "7", the value is known at compile time and can obviously be checked.
2) Marginally less simple: If another variable is declared as the same type (or a stricter one), its contents are always guaranteed to be acceptable, so you can assign one to the other. And the same goes for function calls with an appropriate return type.
3) Explicit type cast: If you're certain that a variable will satisfy a given type restriction, the ".assert" method allows you to force it to be treated as that type. Here's an example of this used in a function declaration...

natural @abs(int @i)
{
..if ( i >= 0 )
....return i.assert(natural);
..else
....return (-i).assert(natural);
}

Yep, this is the equivalent of a static typecast in C. As Bjarne Stroustrup has said, the syntax (type)value for casts in C was a rather unfortunate choice. Slightly dodgy operations like this should not be quite so hard to search for and hard to see. Unfortunately, I feel C++ overreacted with static-cast<type>(value) - it's a ridiculously cumbersome construct. So far, value.assert(type) seems to be a good middle ground.

As the name implies, the program will halt (in debug builds) if a type can't cope with the value it's given; another feature lacking from C's casts.

4) Explicit type check: If you don't know whether a value satisfies a type, test it at runtime. For example:

natural @abs(int @i)
{
..if ( i.is(natural) )
....return i.assert(natural);
..else
....return (-i).assert(natural);
}

Every value supports the ".is" method - it's essentially the same as the "is" operator in C#. It takes a type as a parameter, and returns true or false depending on whether that type will accept the value.
Unfortunately, as seen above, once you've determined that a value is of a given type, you usually want it to be treated as a value of that type... but in a statically-typed language, you'll have to explicitly cast it. This is a pain in the arse.To simplify cases like this, Flan lets you declare a new variable inside your pattern-match expression:

natural @abs(int @i)
{
..if ( i.is(natural @n) )
....return n;
..else
....return (-i).assert(natural);
}

Even cleaner, if you're doing this with a local variable, a "where" statement can be used to temporarily narrow its type:

natural @abs(int @i)
{
..where i.is(natural)
....return i;
..else
....return (-i).assert(natural);
}

In effect, within the body of the where statement, the type of the variable 'i' becomes (natural int).
Frankly I'm not completely happy with the syntax of the "where" statement, but I'm definitely happy with how useful it is. As I develop the language further, I'm sure more conversion expressions will continue to suggest themselves.


Regular expressionescence:
We've now reached the point where I can demonstrate Flan's regular expression-like structures, as I mentioned earlier.

@teststring = "hello";

The expressions "foo", ['foo'] and ['f','o','o'] are all equivalent. In other words, this defines that teststring is a list of characters: ['h','e','l','l','o'].

if ( teststring.is([char*,'o',char*]) ) ...

Here we're defining a new type - "lists that contain some characters, then an 'o', then some more characters" - and testing whether "hello" is accepted by that type. Yes, it contains an 'o', so it's accepted.

The test above is pretty cumbersome. Simple stuff should be simple, so instead of "is", you probably want to write "contains":

if ( teststring.contains('o') ) ...

What if you didn't care about 'o', but wanted to look for any vowel instead? Simple:

if ( teststring.contains(vowel) ) ...
(This is using the 'vowel' type I declared earlier.)

What if you want to search for a particular sequence of characters, instead of a single character?

if ( teststring.contains('l','l') ) ...

Yes, "hello" does contain two consecutive 'l's.
In case you were wondering, no, the contains function does not have a form that takes two arguments. I hope you were paying attention earlier: it's actually taking a list fragment! Notice that there aren't any square brackets here. We didn't write:

if ( teststring.contains(['l','l']) ) ...

That would treat teststring as a list of strings, and look to see whether any of them was the string "ll". (The test would fail, because teststring isn't a list of strings.)

Ok, so far so good. What if you wanted to generalise this to accept any double letter? Well, I already showed how to declare temporary variables within a type. You can reuse these variables within the same pattern, like so:

if ( teststring.contains(char @x, x) ) ...

What if you don't care whether the characters are consecutive? No problem...

if ( teststring.contains(char @x, char*, x) ) ...

I could keep going, but hopefully you get the idea. (This post has already taken an unhealthy amount of time to write, anyway...)
Although the Flan type system can't compete with the terseness of conventional regular expressions (they're pathologically terse), I think it's significantly more readable - mainly thanks to the way it lets you compose an expression out of simpler expressions.

Thanks for reading! I'll leave you with some more examples of Flan code...

// when used in a function declaration instead of the return type,
// the "is" keyword signifies that it's actually a type declaration.
// The resultant type can only accept values for which the function
// body returns true.
is @even( int @n )
{
..n.mod(2).is(0);
}

even @x = 2;
even @y = 3; // compiler complains because 3.mod(2) is 1, not 0.

// in Flan, what would be a template in some languages
// is simply a function that returns a type.
// oneof([1,2,3]) is equivalent to the type 1¦2¦3.

is @oneof( [value*] @v )( value @x )
{
..v.contains(x);
}

// some useful types for characters:
@digit = oneof(['0'..'9']);
@lowercase = oneof(['a'..'z']);
@uppercase = oneof(['A'..'Z']);
@letter = lowercase¦uppercase;
@alphanum = letter¦digit;
@whitespace = ' '¦'\t'¦'\n'¦'\r';

// a type that matches strings from "0" to "255"
is @string0to255( [digit:1..3] @s )
{
..s.toNumber().is( >=0 <=255 );
}

string0to255 @str1 = "177";
string0to255 @str2 = "256"; // compiler complains

// a type that matches IP addresses
@ipAddress = (/str0to255,'.'):3, /str0to255;

// one (inefficient) way to implement the "contains" method we saw earlier.
bool value. @contains( type @t )
{
..return this.is([value*,t,value*]);
}

Comments:
Nice! Thanks for the mention.

-- raganwald
 
This comment has been removed by a blog administrator.
 
This comment has been removed by a blog administrator.
 
This comment has been removed by a blog administrator.
 
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?