more indirection: haskell

Showing posts with label haskell. Show all posts

Friday, January 21, 2011

Thoughts on node.js and Haskell

I've been hearing chatter about node.js in the last few months. A few weeks ago, a colleague gave me a quick demo and I decided to take a closer look. I was impressed with how much scale it could handle without batting an eye, and I was especially impressed with the claim (wish I could find where I saw it) that you could replace a cluster of five or six thread-based web servers with a single node.js process and see comparable performance.

Many web servers (e.g. Tomcat, Apache) dedicate a thread to every open connection. When a web request comes in, a thread is either created fresh or, if possible, dequeued from a thread pool. The thread performs some computation and I/O, and the thread scheduler puts it to sleep when it blocks for I/O or when it exceeds its quantum. This works well enough, but as more and more threads are created to handle simultaneous connections, the paradigm starts to break down: each new thread adds some overhead, and a large enough number of simultaneous connections will grind the server to a halt.

node.js is a JavaScript environment built from the ground up to use asynchronous I/O. Like other JavaScript implementations it has a single main thread. It achieves concurrency through an API that requires you to provide a callback (a closure) specifying what should happen when any I/O operation completes. Instead of using threads or blocking, node.js uses low-level event primitives to register for I/O notifications. The overhead of context switching and thread tracking is replaced by a simple queue of callbacks. As the events come in, node.js invokes the corresponding callback. The callback then carries on its work until it either begins another I/O request or completes its task, at which point node handles the next event or sleeps while waiting for a new one. Here's a simple web-based "Hello, world!" in node.js from the snap-benchmarks repository:

As you can see, all I/O operations accept a callback as their final argument. Instead of a series of calls that happen to block, we explicitly handle each case where an I/O request completes, giving us code consisting of nested callbacks.

In addition to reducing the strain placed on a system by a large number of threads, this approach also removes the need for thread synchronization - everything is happening in rapid sequence, rather than in parallel, so race conditions and deadlock are not a concern. The disadvantages are that any runaway CPU computation will hose your server (there's only one thread, after all) and that you must write your code in an extremely unnatural style to get it to work at all. The slogan of this blog used to be a famous Olin Shivers quote: "I object to doing things computers can do" - and explicitly handling I/O responses certainly falls into the category of Things a Computer Should Do.

Anyway, this got me thinking about how Haskell could take advantage of the asynchronous I/O paradigm. I was delighted to find that others were thinking the same thing. As it happens, the new I/O manager in GHC 7 uses async I/O behind the scenes - so your existing code with blocking calls automatically takes advantage of the better scalability of asynchronous I/O. The Snap framework has published benchmarks that show it handling about 150% more requests per second than node.js:

But what's even nicer is that there's no need to contort your code to get this performance. Snap and the underlying GHC I/O manager completely abstract away the asynchronous nature of what I'm doing. I don't need to set callbacks because the I/O manager is handling that under the covers. I pretend I'm using blocking I/O and the I/O manager takes care of everything else. I get some defense against long-running computations (because more than one thread is processing the incoming events) and, most importantly, I can write code in what I consider a saner style.

The catch is that most I/O libraries are not designed for asynchronous I/O. So, network and other I/O calls made at the Haskell level will work appropriately, but external libraries (say, MySQL) are often written with blocking I/O at the C level, which can negate some of the scalability gains from asynchronous I/O.

Thursday, December 9, 2010

parallel-io

A few days ago, parallel-io was uploaded to Hackage. Previously, when I wanted to perform a number of I/O actions in parallel, I used some homegrown code that used forkIO and MVar's to run a bunch of I/O actions in parallel and collect their result.

The new package makes all that unnecessary. Its two most important functions functions are:

    parallel_ :: [IO a] -> IO ()  -- ignores results
    parallel  :: [IO a] -> IO [a] -- collects results into a list

I have some code that fetches a number of exchange rates in relation to the US dollar. I want to do this work in parallel, so previously I used my own combinators. Now it's as simple as:

    getExchangeRateToUSD :: String -> IO Double
    getExchangeRateToUSD = ...

    parallel $ map getExchangeRateToUSD ["CAD", "EUR", "JPY", "GBP"]

The map call returns a list of IO actions. parallel-io dutifully forks off a thread for each of these actions and returns a value of type IO [Double]. Nice and simple.

Sunday, September 12, 2010

Getting started with Leksah

I've been wanting to try Leksah for some time, but found that it doesn't give you many cues about how to get started when you first run it. Today, I sat down and got a simple example running.

Here's how to get a simple Haskell program running in Leksah:

Open Leksah. If this is the first time you're running it, just hit OK when asked for search paths and wait for a bit (possibly quite a bit) while it indexes your installed packages.
Click "Workspace | New Workspace" and specify a location for the Leksah workspace, which is a single file that will track references to the Leksah projects you'll create.
Click Module | New Module. This creates a new Cabal module under your workspace. The New Module dialog is looking for a folder which will hold the Cabal files. Create a new directory, navigate to it, and press Open.
Press the Save button at the bottom of the Package tab.
Write your code within the module you created. Note that Leksah will complete function names as you type. It will also repeatedly compile your package and show the output in the Log window to alert you to errors.
Use the package menu to run your code or to install it via cabal. Your output will appear in the Log pane.

Hope that's helpful!

Tuesday, August 24, 2010

Haskell Syntax Highlighting for Blogs

Can anyone recommend a good way to get Haskell syntax highlighting for blog posts? I've been using Gist, which works very well, except that Gists don't show up in RSS feeds.

Monday, August 23, 2010

Planet Haskell

I give permission to include this blog and its contents on the Planet Haskell aggregator.

Saturday, August 21, 2010

Word Wrapping in Haskell

Prompted by a question on haskell-cafe, here's a relatively concise word-wrapping implementation I wrote:

Update: Thanks to Yitz in the comments for pointing out that "any isSpace line" should be "any isSpace beforeMax"

Sunday, July 25, 2010

cabal unpack

If you want to browse the source code of a package on Hackage, "cabal unpack" is a useful command.

packages % ls
packages % cabal unpack web-routes-quasi
Unpacking to web-routes-quasi-0.5.0/
packages % ls
web-routes-quasi-0.5.0
packages %

Just pass it the name of a package and you'll get an unzipped tarball in your current directory. Very useful.

Thursday, May 27, 2010

readFile and lazy I/O

Recently, I came across a problem in a Haskell script I run frequently. Every so often, I drop a report file into a designated folder. Then I run my script, which peforms an operation like the following.

getReportFiles >>= map readFile

This worked fine for months - until this morning, when the program crashed with an error indicating that too many files had been opened.

The problem is that readFile uses lazy I/O. Generally, when we write code like getLine >>= putStrLn, we expect these calls to happen in order - indeed, that's one of the primary purposes of the IO monad. But readFile uses hGetContents internally, which is an exception to the strict I/O found elsewhere in Haskell. So readFile opens the file and then returns a thunk instead of actually reading the file. Only when the thunk is evaluated is the I/O performed and the file read into memory. And only when the thunk has been fully evaluated will the open file be closed.

So in my snippet, I was reading in hundreds of files as thunks. and until the full contents of the thunks were evaluated, the files all remained open. This was no problem until the number of reports I had to process reached a certain point and exposed the bug.

The solution in my case was to use Data.ByteString:

import qualified Data.ByteString.Char8 as BS eagerReadFile :: FilePath -> IO String eagerReadFile file = BS.unpack <$> BS.readFile file

ByteString's readFile method is eager, so you'll get back the complete file contents.

Update: In the comments, Chris points out System.IO.Strict, which has a strict readFile function that simply replaces Prelude.readFile.

Lazy I/O can be very useful: instead of reading in the complete contents of a large file, you can read it lazily using a function in the hGetContents family and then process it without having to read the entire contents into memory at once. But lazy I/O can surprise you if you're not expecting it.

(thanks to #haskell for pointing out the eager Data.ByteString.Char8.readFile)

Friday, April 23, 2010

The Brilliance of Maybe and the Value of Static Type Checking

I am sometimes surprised by the strokes of genius hidden inside Haskell's type system. For example: the Maybe type.

Maybe a is a parameterized type that can hold either the value Nothing, or a value Just a . Seems simple enough - it's an optional value.

But it goes deeper than that. Look at an object of type String in Java. This object's type is actually larger than String - it can either represent a string of characters or the special value null. Languages like Java, Ruby, and C# all make the mistake of allowing any type at all to hold null. This greatly diminishes the value of static typing, since any call of the form string.length() will sail happily past the compiler and then throw the dread NullPointerException at runtime if passed a null. The compiler will never see it coming, and your program could work for some time before you discover the problem.

But in Haskell, we must be explicit. By default, an object of type Integer can only hold an integral value - there is no null, and no possibility of a runtime null pointer exception. The Haskell compiler will not even produce a program if you try to pass Nothing into a non-Maybe type. A datatype cannot be instantiated with null values unless you explicitly allow it to hold them.

This eliminates an enormous class of errors. Anyone who has maintained a large Java application instinctively shudders at the prospect of an NPE - the value could have become null at any point in the program and resolving it requires extensive tracing.

Further, in order to avoid the possibility of an NPE, you must litter your code with distracting and unnecessary conditionals like:

if( foo == null ) throw new IllegalArgumentException( ..

if( foo.bar() == null ) throw new IllegalArgumentException( ... );

before doing any real work. But what's worse is that you're turning a programming error into a runtime exception. The fact that your program compiles does not guarantee that you will have no NPE's and running it does not guarantee that you will hit every branch that might generate one.

In Haskell, it is simply impossible to write a program that is susceptible to NullPointerExceptions at runtime unless you explicitly make the type in question optional by wrapping with Maybe.