November 29, 2003

What the heck is: A type

Update: I forgot to finish the type inferencing section. D'oh! Fixed.
It's a good excuse for a fight, at least in some circles. And papers. Lots and lots of papers.

But, more usefully, a type is a notation on some piece of data that marks a promise as to how that data behaves. (If you're looking for sheep, try elsewhere) They may be an indication of the storage that a data needs, the class an object belongs to or inherits from, or the values that data may take.1

Types, when used well, allow you to avoid, or at least reduce the incidence of, a whole class of errors in your programs, so called "type errors". These occur, as you've probably figured, when you're using the wrong type--trying to call a method on an integer, or do addition with two filehandles or something like that.

With a properly designed type system it's possible to prove at compile-time that type errors are impossible in your program. How useful this is depends both on how rich your type system (if the only type you have is "int", it's of limited use) and how much use your compiler can put the type information.

Static and Dynamic Typing

These two are the poles (or perhaps lightning rods) of type checking. Simply put, static typing assigns and checks types at compile time, while dynamic typing assigns and perhaps checks types at runtime.

Before going any further, it's important to realize that Pure Static and Pure Dynamic typechecking are two ends of a continuum. With a very few, generally pedagogical, exceptions no language is on one end or the other. Pretty much all languages fall somewhere in between. Whether this is a good or bad thing is a matter of opinion, but we'll dodge that question for now.

Note that a language can be statically typed yet still have some level of runtime dynamism, such as function or method redefinition or instantiation, reference or pointer based function or method calling, and runtime compilation of code. The language just needs to ensure that any assertions made at compile time remain correct in the face of runtime changes--function signatures may not change with a redefined function, for example, and in the face of compile time uncertainty (such as dispatching through a function pointer of unknown origin) that the constraints of the type system are enforced at runtime. More on this later.

C is an adequate, or at least nearly ubiquitous, example of a statically typed language. The types of all variables must be declared, the return types of all functions must be declared, and the parameters to all functions must be declared. (That it's possible to lie in many profound ways to the C compiler is a separate matter we'll deal with in a bit) The compiler knows how everything should be treated, and can act accordingly. For better or worse.

Python (and Ruby, and Smalltalk), on the other hand, is a good example of a dynamically typed language. In python the types of variables, values, parameters, functions, and methods are not known to the compiler, and in fact may change as the program runs. (Indeed, with these languages there arguably aren't any "variables" as such--just names which are bound at runtime to different things as the program runs.2) In languages such as this, if the type of something is important generally the programmer has to explicitly check.

Before going off on the "But the behaviour of an object is more important than its type!" rant, assuming you conflate classes and types (which is OK by me), I should note here that with a sufficiently rich type system the behavior of a thing is reflected in its type. Which is to say that methods and interfaces that a class implements are as much a part of its type as the class' parents or the class itself. That's partial types, and we'll go there later. Very few languages support them as part of the static type information, unfortunately.

Strong and Weak Typing

Another quality of a type system is its strength--basically an indicator of how easily and to what extent your program can lie to the type checking and validation code and get away with it. A type system is considered strong if you can't lie, and weak if you can lie. (And broken (or at least laughed at by its peers) if it starts out strong and then gets officially hacked as time goes on to allow it to be weakened. A lot of languages fall into this category, as practicality asserts itself)

C is a good example of a weakly typed language. Yeah, sure, C has data types, but they're more suggestions, or indications of current usage, than anything else. As an easy example, look at this:

  int foo;
double bar = 12.5;
foo = (int)bar;

It's a simple enough little chunk of code. foo is an integer, and bar is a double-precision floating point value. A straight assignment from one to the other is arguably illegal, as they're different types. (C will allow it, but it generally yells) That (int) bit in there, though, tells C to treat what follows as an int, rather than whatever type it is.

This is generally called type casting and there are two ways to do it. In this case, typecasting performs a conversion--C knows how to get the integer value from a floating point number and does so. With a conversion bar never loses its inherent float-ness--the bits that the name refers to are still considered part of a floating point number, we just then translate it to an integer. Casts can also change the way that the bits themselves are interpreted, for example:

  int foo;
double bar = 12.5;
foo = *(int *)&bar;

will, rather than treating bar as a float and convert it to an integer, assume that bar is, in fact, really holding an integer and interpret the first four bytes of it as if it were an int. (More or less 4 bytes at least, depending on the size of an int) We have to play pointer deref and cast games here because C won't let you directly change the interpretation of the underlying data storage for a variable, which is probably for the best.

Perl, oddly enough, is a good example of a strongly typed language. (The rich--and arguably excessive--behaviour of its base types tends to confuse folks on that account) A scalar is a scalar is a scalar--no matter what you do you'll never convince perl to treat it as an array, and when you have a reference to an array you can't, unlike in C, treat it as a reference to a hash or scalar or filehandle. There's no casting possible, either directly or through references.

Most languages that do the C-style "interpret the bits differently" cast are low-level languages operating on either primitive types or explicit relatively primitive structures. Once you get abstract enough (and most OO languages are abstract enough) this low-level direct bit hitting stuff just isn't allowed as there's no way for it to be useful without an understanding of a number of very low-level implementation dependent things (like class and object layouts) that's outside the range of the normal. (Not to mention the fact that the folks writing those sorts of languages tend to be opposed, on general principles, to that sort of low-level access)

Higher-level type casting as conversion is more commonly supported, as its often very useful--if for no other reason than to be able to turn strings (like, say, from user input) into structured data (like real integers, floats, or widgets). Opinions seem mixed as to whether this is a good thing or not, but practicality usually wins and it is supported, though generally it needs to be explicit. Very few languages will implicitly autoconvert or call autoconversion routines for you since it's assumed that if you're assigning a value of one type to a variable of another that you've made a mistake and its up to you to tell the compiler otherwise.

Partial Types

As we alluded to earlier, a type system can be far richer than just a plain "what class does this object belong to" or "Which structure are we allocating for you" thing. Types can also express behavior--what methods you have or interfaces you export--and ranges of permissible values. Since this means a full type may have more than one "thing" that defines it, you have cases where you don't really care about the full type--just part of the type.

For example, let's consider GUIs for a moment. It's common when doing GUI programming in a language that supports objects to have a class for each thing that can appear in a window, classes for windows, classes for events, classes for sounds, classes for actions--classes all over the place. As we're in an OO mood, each class is a type.

Now, consider the case where we're doing at least some level of static typing in our OO GUI language. The common thing to do, in a static type system, is to require each parameter to be a class or child of a class, which is fine, but... wait until you find yourself building objects that need to inherit from several parent classes at once. This isn't too tough to do--just throw serialization into the mix, and you'll immediately feel the pain. (The drawing methods require the parameter to inherit from some base Drawable class, and the serialization methods require their things to inherit from the base Serializable class)

The common way to deal with this is to ignore the heck out of all but one type constraint (Generally Serializable things take generic objects and probe methods at runtime) but, well.. ewww. The second-most common way to deal with this is multiple inheritance, but folks tend to get their rant on when dealing with MI. Partial types solve the problem. (And MI is a particular case of partial typing but Shh! don't tell)

With a type system that has partial types you declare that your drawing method must have a type of Drawable, but that doesn't preclude the object from having a type of Storable, Readable (if you've got a text-to-speech system), and maybe PhenomenallyUgly. (Though, with GUI systems, the last type is often a freebie)

With a type inferencing system (we'll talk about that in a bit) often you'll get partial types whether you know it or not, as they're handy things for optimizers--any information, even incomplete information, is better than nothing.

Runtime type checking

With some languages, compile-time typechecking is insufficient--there just isn't enough information at compile time to be sure that the rules of the type system are being followed. (This is especially true of type systems that include the value as part of the type) Languages like this generally provide runtime type checking, either automatically or via calls you can make. Runtime type checks are exactly what they sound like--as your program runs it checks the types of variables and values that get passed around.

This does, as you might expect, have a cost associated with it, since checks like this aren't free. (They're often quite cheap but even cheap adds up when you do it a million times or more) Runtime type checking also isn't restricted only to languages with some measure of dynamic typing associated with them--even the terribly strictly typed languages can have it. Rather it tends to be associated with languages with some measure of strong type checking. It's difficult to write a broad range of Interesting Programs without having at least some interface to the real world, and the real world tends to be a messy, typeless place. Runtime checks are used at the borders to make sure the data coming in conforms to the type information it's declared.

The more dynamic languages do tend to have more runtime type checking, though, since the more dynamically typed a language is the harder it is to check at compile time that the type invariants are enforced. If you can't tell, and you want to be sure, you have to resort to runtime checks, throwing type errors or exceptions if the invariants vary.

You'll often see, in dynamically typed languages (as well as statically typed languages with insufficiently expressive type systems), hand-rolled type checking at the start of functions and methods to make sure that the invariants the programmer wants expressed are expressed. Personally I've always disliked those, as they're both messy and error-prone, but often there's no good way to do anything else in the language at hand.

Type Inferencing

Type inferencing is when the compiler figures out what type most variables are and what functions return without the programmer necessarily having to explicitly type variables or functions. It's one of those neat tricks that a compiler can do because there's so much redundant information in most typed source code.

What the type inferencing engine does is trace the source of your program and makes type annotations based on the code, and uses these annotations to find the places where you've made type errors, without you having to actually put type notations everywhere.

The first thought that might spring to mind is "what about that pesky halting problem thing?" and yes, it's generally impossible for a type inferencing engine to be 100% correct for 100% of the programs you'll feed it, but that's fine--it doesn't have to be perfect to be useful. For example, this:

sub foo(a, b) {
declare c;
c = a.some_method(b);
return c**4;
}

q = foo(1, 4)
print substring(q, 1, 4);

That code has two type errors in it. First, foo returns a number of some sort, but we're doing a substring on it--unless we've got a language with lots of autoconversion that's not allowed. Second, the first parameter to foo must be an object, because we're calling a method on it, but when we invoke foo we're passing in an integer, and unless we've a language that assumes everything's an object that's not allowed.

It's clear, by inspection, that those errors exist, but in a dynamic language they won't be caught until runtime. Or... will they? A type inferencing pass on the source will pick them up, and they can be reported at compile time. That's a good thing, since generally the earlier an error is caught the better.

Of course, you can throw some uncertainty into the mix:

sub foo {
if (rand(4) >2) {
return "a";
} else {
return 12;
}
}

Still, that's OK. With a simple type inferencer it can just decide that the return type of foo is unknown and let that chunk of uncertainty flow through, in which case you're no worse off than if you'd not had the inferencing engine. Or... Very Clever inferencing engines can see that it returns either an integer or a string and, while it can't be sure which, it can certainly know that if you use the return value as, say, a matrix that your code's broken.

Types and objects

Static typing gets somewhat interesting when objects are involved. Or, rather, they get somewhat interesting when subclassing is involved, if the type notation on a value might indicate just a parent type rather than the actual type.

If your type system allows for subtypes and your object system allows child methods to have different method signatures from the parent methods they obscure, it makes it very difficult to do full static type checking. In this case you have to resort to runtime type checking if you want to maintain those type invariants, which has a performance penalty, and tends to make it more difficult to make sure your programs are right to begin with.3 This is one of the reasons the more dynamic OO languages tend to not bother with type checking, as without a good type inferencing system you very quickly get to the point where there's so much uncertainty that types are more hindrance than help. (Even with a good type inferencing system in somoe cases)

What's the point?

Ultimately, what the heck's the point of types? (Besides the excuse for an argument, of course) There are three big ones.

First, from a theoretical standpoint, they may, in the right system, allow you to prove your code is correct4 at compile time.

Second, they help catch some typos, provide a bit of help catching API changes (the ones that involve changing the types of the parameters, though that's arguably a Horribly Bad Thing), and provide something of a memory crutch for using libraries. (I personally find that last one useful, since I find details like argument ordering occasionally dribble out of my brain for less-used functions)

Third, they help the optimizer, and this is the one I like the best. Having types available at optimization time can mean the difference between knowing that foo is a generic thing and foo is a platform-native integer, something that can mean an order of magnitude or more difference in speed of use and memory usage. Not, perhaps, a huge difference for one variable, but it's something that can add up quickly.

In object-oriented languages typing also helps tremendously with method lookup. Code like this:
b = a.foo
can have the foo method pre-looked-up with a typed system. Without sufficient information at compile time, methods lookups all have to be be deferred to runtime. That means that every time the code snippet above is executed the runtime has to take the string "foo" (or perhaps the magic method token that corresponds to "foo", if your language can do that) and search the class hierarchy looking for that method. A good method cache can let you skip all but the first heirarchy search for a method, but you can't skip them all, and a lookup, even with a precomputed hash of the method name, is somewhat expensive.

On the other hand if, at compile time, you know the structure of a class hierarchy and the types (even just the base type) of a variable, you can build a vtable for each class, with one slot per method, and hang it off each object (or the class, at the cost of an extra dereference) and the runtime cost of the method dispatch is just an indirect array access, which is definitely a lot cheaper than any sort of string lookup. All you need to do to make that work is make sure that methods of the same name are always in the same slot for each branch of the tree, which isn't tough. (Just a bit of compiletime bookkeeping, something computers are reasonably good at)

Good optimization, however, doesn't require the source to be statically typed, it just requires the parsed representation of the source to be typed. (Type information reduces the uncertainty in a program, and uncertainty is the enemy of optimization) It's possible to have a dynamically or minimally typed source language that still has some type information made available to the optimizer--this is one of the places where type inferencing is really, really handy.

Why the fights?

Well, bluntly, because some people like fighting, and this is a good excuse. That's not a helpful answer, though.5 The battle is generally between the tinkerers and the mathematicians. Tinkerers (and hackers in the classic sense fall into this group) just like to get going, fiddle around, do their thing, and have the compiler get out of their way

Which is the right answer? Well... it depends, of course. Depends on the people involved, as the differing styles of typing appeal to different people. It also depends on the problem, since some problems require more stricture than others. (I, for one, want the code running a plane I'm in to have all the checks run on it as possible) And it depends on the performance requirements you have, since the more statically typed a language is the faster the code can be. (Not necessarily will be, of course, but the less uncertainty the optimizer has the more it can optimize) There really isn't One Right Answer. That'd be too easy. :)

1 Yes, a type could be "even numbers between 1 and 6" or "words that don't have the letter Q in them", though value based type systems are reasonably uncommon.
2 That's picking nits, and those "things" are really variables. But this isn't the place to get into the differences between names, variables, and values
3 You have to have a really good set of unit tests to catch this stuff in development or testing and, while unit tests are handy they're sometimes difficult to get 100%, and can't always check things right in really complex systems. Not that I dislike unit tests--I think they're great. They just solve a different, albeit partially overlapping, problem domain than types
4 For a very restricted definition of correct. Arguably anything's better than nothing.
5 Well, OK, it actually is a helpful answer, but for a more general problem.

Posted by Dan at November 29, 2003 03:25 PM | TrackBack (2)
Comments

Hi, Dan!

Great blog! Thank you! Just a minor typo in a link.Probably it should have been http://www.graphic-design.com/news/index.html ?

Cheers,

maik.

Posted by: Maik Schmidt at November 30, 2003 03:26 AM

D'oh! Good catch, fixed.

Posted by: Dan at November 30, 2003 08:13 AM

"The static people talk about rigorously enforced interfaces, correctness proofs, contracts, etc. The dynamic people talk about rigorously enforced testing and say that types only catch a small portion of possible errors. The static people retort that they don't trust tests to cover everything or not have bugs and why write tests for stuff the compiler should test for you, so you shouldn't rely on only tests, and besides static types don't catch a small portion, but a large portion of errors. The dynamic people say no program or test is perfect and static typing is not worth the cost in language complexity and design difficulty for the gain in eliminating a few tests that would have been easy to write anyway, since static types catch a small portion of errors, not a large portion. The static people say static types don't add that much language complexity, and it's not design "difficulty" but an essential part of the process, and they catch a large portion, not a small portion. The dynamic people say they add enormous complexity, and they catch a small portion, and point out that the static people have bad breath. The static people assert that the dynamic people must be too stupid to cope with a real language and rigorous requirements, and are ugly besides.

This is when both sides start throwing rocks." - Quinn Dunkan

"All that rigid type safety and data hiding is like wearing army boots on the beach: nothing can bite your toes, but golly don't it feel good to just toss 'em and run barefoot." - David Pinn

Posted by: Simon Brunning at December 1, 2003 08:48 AM

... when you have a reference to an array you can't, unlike in C, treat it as a reference to a hash or scalar or filehandle. There's no casting possible, either directly or through references.

Which explains why this doesn't work:

$arrayref = [ "this", "is", "an", "array" ];
print join(",",keys(%$arrayref));

... but not why this does:
$arrayref = [ "this", "is", "an", "array" ];
%$hashref = @$arrayref;
print join(",",keys(%$hashref));

Surely there's a cast going on in the middle there?

Posted by: Yoz at December 1, 2003 10:00 AM

That's not a cast, it is more like hash = array.as_hash hidden under some syntactic sugar.

Posted by: malte at December 1, 2003 11:51 AM

There's nothing odd about what you're doing--the @$arrayref is flattening the array on the RHS, and assigning that list to the hash on the left. Which is fine. You can assign a list to a hash in perl, and when you do it takes the list as a set of key/value pairs and initializes the hash.

Posted by: Dan at December 1, 2003 01:06 PM

What about:

$my_string = '{ 1 => 2, 3 => 4 }';
%my_hash = eval $my_string;

That looks more like a cast to me. Albeit, it's only ever a "string to something else" cast, but still...

Posted by: michael at December 1, 2003 07:15 PM

Nah. It's an error, which you'd see if you had warnings on:

lir:~ dan$ perl -w
$my_string = '{ 1 => 2, 3 => 4 }';
%my_hash = eval $my_string;
Name "main::my_hash" used only once: possible typo at - line 2.
Reference found where even-sized list expected at - line 2.

Posted by: Dan at December 1, 2003 07:30 PM

Bah. Teaches me to post without testing.

Posted by: michael at December 1, 2003 09:54 PM

Any chance of a "What the heck is: a partial type"?

Failing that any refs on partial types would be much appreciated. It's not a term I've come across before and my own ignorance annoys me :-)

I've been googling a bit and not found anything yet that seems to define them in terms my poor coffee addled brain can accept - and I assume that what Microsoft are talking about for the next version of C# isn't what you mean?

Posted by: Adrian Howard at December 2, 2003 06:09 AM

I can talk about partial types at some point, sure. (It's distinctly possible I got the name wrong as well, which might be why you can't find anything on it. I'll go check)

FWIW, the Microsoft "partial type" thing is more a "I've defined this class in multiple files" which is something entirely different from what I'm talking about. I presume it's uncommon and considered a Nifty Idea--hard to say, since it's the sort of stuff I do with some regularity in Perl... :)

Posted by: Dan at December 2, 2003 12:18 PM

As far as Perl being strongly-typed (and particularly people getting confused by the rich behaviour of its base types) goes, I'd say that it's not so much the behaviour of the base types, as the nature of them. People don't tend to think of "scalar" as a type, but do think of "string" as a type. So "12"+12 giving 24 looks weakly typed (you can't add a string to an integer) rather than strongly typed (you can add any two scalars together).

Posted by: Paul Moore at December 2, 2003 03:42 PM

''sub foo(a, b) {
declare c;
c = a.some_method(b);
return c**4;
}

q = foo(1, 4)
print substring(q, 1, 4);

That code has two type errors in it.''

Can't agree with that. In say Ruby everything is an object so you can put methods for integers.
Also ** operation can be overloaded to return a string.

Posted by: Mike Rovner at December 3, 2003 06:06 PM

Me thinks taht both camps ( pro-type and pro-scripting ) miss one thing. Here goes ;-)

Types are like assertions. It takes time to put an assertion, but without the assertions the code is 'less reliable'. Put less assertions in your code == "bad code". Put more == lot's typing, more code. "Bad code" again. Can't win either way.

However. There is a solution.

The truth is that the source code has (at least) two important stages. At some point the code is immature. Then it becomes mature.

First you write immature code.

For that fuzzy stage it is better if the language is typeless. Why bother with int a = 0, when I can write it just like a = 0 ?

In this stage (which *is* important) I want my alghoritms terse and I don't want *any* assertions to distract me.

However, after the component had matured, it makes sense to start inserting assertions (types included).

In 'mature' stage you *have* to insert as much assertions as possible, because the component would be used by other coders, they would make mistakes e t.c.

Anybody who've debugged some module from CPAN knows what I'm talking about.

So - after the code matures - all of a sudden it becomes good for language to be typed (assertions).

So we need like ... two languages ???

Actually yes.

But now - look at XML. First you write no DTD (in XML world, the DTD determines a *type* of the XML document), you just write the document per se. Just being 'well-formed' is enough.

Then after the structure of teh document becomes mature - it is time to write a DTD and bless the (*same*) document with 'type'.

Some parts of this vision are reflected in one of Larry Wall's proposals. As far as I remember, one can go with my $var as int - 'as int' being optional.

Only I think this way of 'blessing the code with types' kinda sucks, comparing it to 'DTD alike' way. Blessing with DTD implies changing only one line of original XML document, not every declaration.

In that sense "CSS is better than XSL", if you get my drift.

This brings us to the idea of 'aspects' e t.c. (types being just one of aspects of source code, but there are more aspects - such as 'debugging' for example), the way aspects can be attached e t.c. e t.c. But I better stop.

Just wanted to say that all the languages out there try 'balance things' to have one language, when in fact they kinda need two languages (because of two stages).

That is why I think that modern programming languages suck. All of them ;-) Just kidding. That was yet another simplification.


Posted by: PaulT at December 5, 2003 10:01 PM

For that fuzzy stage it is better if the language is typeless. Why bother with int a = 0, when I can write it just like a = 0 ?

In the ML family, the types are inferred. If you declare a = 0, then a is an int. If it's declared a = "0", then it's a string. And so on for the other types. It completely eliminates the need for littering your code with int and float and so on. I must wonder why more typed languages didn't do this.

From what little I've worked with OCaml, it seems to work quite well, and any errors you get from the type system actually point to real bugs in your code.

Posted by: TMurray at December 8, 2003 12:47 PM

Thanks for a good introductory article on types, Dan: it's hard to find a brief, all-in-one explanation about types.

One thing I think which should be explained a bit more are modern type systems: you point out at the start of your article that there are "lots and lots of papers" written about types, but don't point out that there are languages such as ML and Haskell which have these advanced type systems that implement alot of what those hevay-hitting papers talk about.

Mark Jason-Dominus (who is well-known in the Perl community) gave an introductory talk on ML's type system, and gave one of the best examples of how such a type system is really useful in practice: in his example, ML's type system actually detected an infinite loop in a merge sort algorithm! With a sufficiently powerful type system, you can even go so far as to guarantee that all your AVL tree functions will never leave it unbalanced: and the great thing is, type inference relieves you from the burden of declaring your types since the compiler infers it all for you.

I think it's also worth pointing out that features such as ML's functors and Haskell's type classes allow you to express OO-style polymorphism while still retaining the advantages of a static type system. OO programming lends itself much better to dynamic typing: witness the conciseness of writing code in Python or Smalltalk vs the extravagant need for casts in statically-typed OO languages like Java and C++.

The one key thing I like about modern type systems employed by languages such as ML and Haskell, is that the type system is often so good that it's very rare that a program doesn't work as you intended if it manages to compile. That's a quality I miss dearly when I work with dynamically typed languages such as Perl and Python.

Posted by: Andre Pang at May 16, 2004 02:59 PM