The HTTP and Browser Modules

Summary: A significantly RFC compliant HTTP/1.1 implementation.
Version: 0.3
Author: Warrick Gray.

Download

You can download them zipped or unzipped:

And if you are going to use the Browser, you will need the following files:

These last three files are an implementation of the MD5 algorithm by Ian Lynagh, and can be found at http://web.comlab.ox.ac.uk/oucl/work/ian.lynagh/.

Module Descriptions

The modules are arranged into a low level HTTP module which provides connection oriented functions and the basic Request, Response, and Header data types. Sitting on top of this is the Browser module, which further insulates the user from the rigours of the HTTP/1.1 specification.

HTTP

This module provides the data structures:

ConnectionOpaque connection object, used to maintain persistant connections though IORefs
RequestThe request object, the structure of which which you will probably need to know. Has accessor functions: rqXXX such as rqMethod, rqURI, rqHeaders and rqBody.
ResponseThe response object, with accessors: rspCode, rspReason, rspHeaders, rspBody. I want to change this so that "rspBody" is some sort of reference, then when used with concurrency primatives we could get a really nice asyncronous interface going. Alas GHC on Windows blocks on I/O, a promising method of getting around this is to use the "would block" feature on the underlying socket interface... we'll see what happens.
HeaderA simple data structure that pairs items of type HeaderName and String, i.e. the key-value pairs found in MIME headers.
HeaderNameA big list of HTTP/1.1 headers, all of the form HdrXXX, where XXX is the standard capitalisation of the header name, minus hyphens. E.G. We have constructors named HdrWWWAuthenticate, HdrHost, and HdrContentLength. For non-standard header names use HdrCustom String.
RequestMethodHas constructors HEAD, GET, POST, PUT, DELETE, OPTIONS, TRACE.
ConnErrorAn error.

The module exports principally header manipulation functions and connection oriented functions. The header manipulators are all fairly self explanatory. The connection-oriented functions provide two ways for managing an HTTP connection, the first method is to ignore the underlying connection by using simpleHTTP:

main = simpleHTTP myrequest >>= \rsp -> putStrLn (show rsp)

The second method is to take responsibility for your own connections. Use the sequence:

  1. openTCP :: String -> IO Connection
  2. sendHTTP :: Stream s => s -> Request -> IO (Either ConnError Response)
  3. close :: Stream s => s -> IO ()
Of course any number of sends can occur on an opened connection, but any attempt to send after the connection is closed is doomed to failure, and a Connection can close at any time. A Connection can be closed multiple times. This second method of using the HTTP library allows both connection through a proxy server and use of persistant connections. If you are about to complain about a lack of request pipelining then don't, I'm on it.

Finally you may find the functions urlEncode, urlDecode and urlEncodeVars useful. These provide URL escaping.

Browser

After some small deliberation I have decided that a browser monad isn't such a bad idea. The implementation is a simple DIY state monad, which I bet will make some of you cringe. The monad name is BrowserAction, and you will find the following functions very useful for mixing the IO monad with BrowserAction.

browse :: BrowserAction x -> IO x
ioAction :: IO x -> BrowserAction x

By using the Browser module you will be gaining a whole bunch of useful features. You will get:

  1. Persistant connections (across the BrowserAction monad only), limited to about 5.
  2. Cookies, the sending, receiving, and persistance thereof.
  3. Handling of 401 (authenticate) and 407 (proxy authenticate) responses using both Basic and Digest authentication.
  4. Handling of 3xx (redirects), with special cases for 303 (redirect using GET) and 305 (use proxy) responses, up to 3 redirects each request.

There is no need to directly manipulate the browser state, since the Browser module provides specific functions for doing just this: get/setAllowRedirects, setCookieFilter, get/setCookies, addCookie, setErrHandler, setOutHandler, setProxy. The functions out and err help make logging consistent across a BrowserAction, these are used within the most interesting function request :: Request -> BrowserAction Response.

The digest authentication scheme requires the MD5 algorithm. I've used this implementation, and the files for this implementation are included in the zip above.

Base64

Just a simple Base64 encoding/decoding module, used by Browser for Basic authentication.

RFC Consistency

Here is the list of RFCs I've used:
RFC 1521 - MIME stuff.
RFC 1867 - Form based file upload.
RFC 2045 - MIME stuff.
RFC 2068 - Old HTTP/1.1 spec.
RFC 2109 - Cookies.
RFC 2246 - TLS spec (Transport Layer Security, aka SSL).
RFC 2396 - URI format.
RFC 2616 - HTTP/1.1 spec.
RFC 2617 - HTTP/1.1 Authentication spec.
RFC 2817 - TLS upgrading in HTTP.
The most important one here is RFC 2616, this is the one that I have attempted to follow. RFC 2617 is also important, especially in the Browser module. Where I have strayed from RFC 2616 it is to make the implementation more robust. RFC 2617 however, is supported only partially - since the more interesting features of nonce-nce have not been implemented, and I have performed precisely no tests.

Weaknesses

In the interests of honesty I should mention:

Important: Using withSocketsDo

In the interests of portability you should wrap any IO action containing a call to browse with the function withSocketsDo from the Socket module. Stictly speaking this is only necessary in Windows, where winsock initialisation is mandatory, but I think you should do it anyway.

It is quite safe to call withSocketsDo multiple times, but that technique has earnt a place on the Winsock Programmers FAQ Lame List, since winsock initialisation has performance overhead. I doubt you can safely nest these calls, so don't do it.

Oh, and you should strive to catch errors within withSocketsDo then pass them out in a synchronous fashon, since throwing exceptions from within this function will otherwise prevent the safe cleanup of resources (and on my computer eventually kills all network connectivity).

Compilation

Originally developed under ghc-5.02.2, using the old package heirarchy, I used:

ghc -o main -package net -package util --make Main.hs

With the new package structure the following should do it:

ghc -o main -package network -package data --make Main.hs

Bugs, Comments, Requests, and Suggestions

Bugs, comments, requests, and suggestions are all welcome. Send to warrick dot gray at hotmail dot com.

Example: Getting a web page.

import Network.URI
import System.Environment (getArgs)
import Data.Maybe (fromMaybe)
import System.IO
import HTTP
import Browser
import Network (withSocketsDo)


main :: IO ()
main = withSocketsDo $ catch
       (do { h <- openFile "mylog.log" WriteMode
           ; args <- getArgs
           ; if null args 
             then putStrLn "Web page argument required!"
             else catch (browse $ fn h (head args))
                        (\e -> hClose h)
           })
       (\e -> putStrLn ("Exception!: " ++ show e))
    where
        myRequest u = 
            let uri = fromMaybe (error "Nothing from url parse") (parseURI u)
            in defaultGETRequest uri
        
        fn h arg = 
            do setOutHandler (hPutStrLn h)
               setCookieFilter (\_ _ -> return True)
               let rq = myRequest arg               
               rsp <- request rq
               ioAction $ hFlush h
               out (rspBody $ snd rsp)

Licence

These modules are published under the BSD licence, with the single alteration:
Neither the name of the ORGANIZATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
Is replaced with:
The names of contributors may not be used to endorse or promote products derived from this software without specific prior written permission.
Simply because there is no relevant organisation name to substitute.

You can find the licence here.