Monday, March 12, 2007

Detecting Web Application Security Vulnerabilities

Web Application Vulnerability Detection with Code Review

Web application source code, independent of languages and platforms, is a major source for vulnerabilities. One of the CSI surveys on vulnerability distribution suggests that 64% of the time, a vulnerability crops up due to programming errors and 36% of the time, due to configuration issues. According to IBM labs, there is a possibility of at least one security issue contained in every 1,500 lines of code. One of the challenges a security professional faces when assessing and auditing web applications is to identify vulnerabilities while simultaneously performing a source code review.

Problem Domain

Several languages are popular for web applications, including Active Server Pages (ASP), PHP, and Java Server Pages (JSP). Every programmer has his own way of implementing and writing objects. Each of these languages has exposed several APIs and directives to make a programmer's life easy. Unfortunately, a programming language cannot offer any guarantee on security. It is the programmer's responsibility to ensure that his own code is secure against various attack vectors, some of which may be malicious in nature.

On the other side, it is imperative to get the developed code assessed from a security standpoint, externally or in-house, prior to deploying the code on production systems. It's impossible to use only one tool to determine vulnerabilities residing in the source code, given the customized nature of applications and the many ways in which programmers can code. Source code review requires a combination of tools and intellectual analysis to determine exposure. The source code may be voluminous, running into thousands or millions of lines in some cases. It is not possible to go through each line of code manually in a short time span. This is where tools come into play. A tool can only help in determining information; it is the intellect--with a security mindset--that must link this information together. This dual approach is the one normally advocated for a source code review.


To demonstrate automated review, I present a sample web application written in ASP.NET. I've produced a sample Python script as a tool for source code analysis. This approach can work to analyze any web application written in any language. It is also possible to write your own tool using any programming language.

Method and Approach

I've divided my method for approaching a code review exercise into several logical steps with specific objectives:

  • Dependency determination
  • Entry point identification
  • Threat mapping and vulnerability detection
  • Mitigation and countermeasures

Dependency determination

Prior to commencing a code review exercise, you must understand the entire architecture and dependencies of the code. This understanding provides better overview and focus. One of the key objectives of this phase is to determine clear dependencies and to link them to the next phase. Figure 1 shows the overall architecture of a web shop in the case study under review.

architecture for the sample web application
Figure 1. Architecture for web application []

The application has several dependencies:

  • A database. The web application has MS-SQL Server running as the backend database. This interface must be examined when performing a code review.
  • The platform and web server. The application runs on the IIS web server with the .NET platform. This is helpful from two perspectives: 1) in securing deployment, and 2) in determining the source code type and language.
  • Web resources and languages. In this example, ASPX and ASMX are web resources. They are typical web applications and web services pages, written in the C# language. These resources help to determine patterns during a code review.
  • Authentication. The application authenticates users through an LDAP server. The authentication code is a critical component and needs analysis.
  • Firewall. The application layer firewall is in place and content filtering must be enabled.
  • Third-party components. Any third-party components being consumed by the application along with the integration code need analysis.
  • Information access from the internet. Other aspects that require considerations are RSS feeds and emails, information that an application may consume from the internet.

With this information in place, you are in a better position to understand the code. To reiterate, the entire application is coded in C# and is hosted on a web server running IIS. This is the target. The next step is to identify entry points to the application.

Entry point identification

The objective of this phase is to identify entry points to the web application. A web application can be accessed from various sources (Figure 2). It is important to evaluate every source; each has an associated risk.

web app entry points
Figure 2. Web application entry points

These entry points provide information to an application. These values hit the database, LDAP servers, processing engines, and other components in the application. If these values are not guarded, they can open up potential vulnerabilities in the application. The relevant entry points are:

  • HTTP variables. The browser or end-client sends information to the application. This set of requests contains several entry points such as form and query string data, cookies, and server variables (HTTP_REFERER, etc). The ASPX application consumes this data through the Request object. During a code review exercise, look for this object's usage.
  • SOAP messages. The application is accessible by web services over SOAP messages. SOAP messages are potential entry points to the web application.
  • RSS and Atom feeds. Many new applications consume third-party XML-based feeds and present the output in different formats to an end-user. RSS and Atom feeds have the potential to open up new vulnerabilities such as XSS or client-side script execution.
  • XML files from servers. The application may consume XML files from partners over the internet.
  • Mail system. The application may consume mails from mailing systems.

These are the important entry points to the application in the case study. It is possible to grab certain key patterns in the submitted data using regular expressions from multiple files to trace and analyze patterns.

Scanning the code with Python is a source code-scanning utility. It is simple Python script that automates the review process. This Python scanner has three functions with specific objectives:

  • The scanfile function scans the entire file for specific security-related regex patterns:

    ".*.[Rr]equest.*[^\n]\n" # Look for request object calls
    ".*.select .*?[^\n]\n|.*.SqlCommand.*?[^\n]\n" # Look for SQL execution points
    ".*.FileStream .*?[^\n]\n|.*.StreamReader.*?[^\n]\n" # Look for file system access
    ".*.HttpCookie.*?[^\n]\n|.*.session.*?[^\n]\n" # Look for
    cookie and session information
    "" # Look for dependencies in the application
    ".*.[Rr]esponse.*[^\n]\n" # Look for response object calls
    ".*.write.*[^\n]\n" # Look for information going back to browser
    ".*catch.*[^\n]\n" # Look for exception handling
  • The scan4request function scans the file for entry points to the application using the ASP.NET Request object. Essentially, it runs the pattern ".*.[Rr]equest.*[^\n]\n".
  • The scan4trace function helps analyze the traversal of a variable in the file. Pass the name of a variable to this function and get the list of lines where it is used. This function is the key to detecting application-level vulnerabilities.

Using the program is easy; it takes several switches to activate the previously described functions.

Cannot parse the option string correctly
scancode -
flag -sG : Global match
flag -sR : Entry points
flag -t : Variable tracing
Variable is only needed for -t option

Examples: -sG details.aspx -sR details.aspx -t details.aspx pro_id


The scanner script first imports Python's regex module:

import re

Importing this module makes it possible to run regular expressions against the target file:

p = re.compile(".*.[Rr]equest.*[^\n]\n")

This line defines a regular expression--in this case, a search for the Request object. With this regex, the match() method collects all possible instances of regex patterns in the file:

m = p.match(line)

Looking for entry points

Now use to scan the details.aspx file for possible entry points in the target code. Use the -sR switch to identify entry points. Running it on the details.aspx page produces the following results:

D:\PYTHON\scancode> -sR details.aspx
Request Object Entry:
22 : NameValueCollection nvc=Request.QueryString;

This is the entry point to the application, the place where the code stores QueryString information into the NameValue collection set.

Here is the function that grabs this information from the code:

def scan4request(file):
infile = open(file,"r")
s = infile.readlines()
linenum = 0
print 'Request Object Entry:'
for line in s:
linenum += 1
p = re.compile(".*.[Rr]equest.*[^\n]\n")
m = p.match(line)
if m:
print linenum,":",

The code snippet shows the file being opened and the request object grabbed using a specific regex pattern. This same approach can capture all other entry points. For example, here's a snippet to identify cookie- and session-related entry points:

# Look for cookie and session management
p = re.compile(".*.HttpCookie.*?[^\n]\n|.*.session.*?[^\n]\n")
m = p.match(line)
if m:
print 'Session Object Entry:'

Threat mapping and vulnerability detection print linenum,":",

Discovering entry points narrows the focus for threat mapping and
vulnerability detection. An entry point is essential to a trace. It is
important to unearth where this variable goes (execution flow) and its
impact on the application.

After locating these entry points to the application, you need to trace them and search for vulnerabilities.

The previous scan found a Request object entry in the application:

22 :    NameValueCollection nvc=Request.QueryString;

Running the script with the -t option will help to trace the variables. (For full coverage, trace it right through to the end, using all possible iterations).

D:\PYTHON\scancode> -t details.aspx nvc
Tracing variable:nvc
NameValueCollection nvc=Request.QueryString;
String[] arr1=nvc.AllKeys;
String[] sta2=nvc.GetValues(arr1[0]);

This assigned a value from nvc to sta2, so that also needs a trace:

D:\PYTHON\scancode> -t details.aspx sta2
Tracing variable:sta2
String[] sta2=nvc.GetValues(arr1[0]);

Here's another iteration; tracing pro_id:

D:\PYTHON\scancode> -t details.aspx pro_id
Tracing variable:pro_id
String pro_id="";
String qry="select * from items where product_id=" + pro_id;

Finally, this is the end of the trace. This example has shown multiple traces of a single page, but it is possible to traverse multiple pages across the application. Figure 3 shows the complete output.

vulnerability detection with tracing
Figure 3. Vulnerability detection with tracing

As the source code and figure show, there is no validation of input in the source. There is a SQL injection vulnerability:

String qry="select * from items where product_id=" + pro_id;

The application accepts pro_id and passes it as is to the SELECT statement. It is possible to manipulate this statement and inject SQL payload.

Similarly, another line exposes a cross-site scripting (XSS) vulnerability:


Throwing back the (unvalidated) pro_id to the browser provides a position for an attacker to inject JavaScript to be executed in the victim's browser.

The scripts -sG option executes the global search routine. This routine looks for file objects, cookies, exceptions, etc. Each has potential vulnerabilities, and this scan can help you to identify them and map them to the respective threats:

D:\shreeraj_docs\perlCR> -sG details.aspx
13 :

Request Object Entry:
22 : NameValueCollection nvc=Request.QueryString;

SQL Object Entry:
49 : String qry="select * from items where product_id=" + pro_id;

SQL Object Entry:
50 : SqlCommand mycmd=new SqlCommand(qry,conn);

Response Object Entry:
116 : response.write(pro_id);

XSS Check:
116 : response.write(pro_id);

Exception handling:
122 : catch(Exception ex)

This code review approach takes minimal effort by detecting entry points, vulnerabilities, and variable tracing.

Mitigation and Countermeasure

After you have identified a vulnerability, the next step is to mitigate the threat. There are various ways to do this, depending on your deployment. For example, it's possible to mitigate SQL injection by adding a rule to the web application firewall to bypass a certain set of characters such as single and double quotes. The best way to mitigate this issue is by applying secure coding practices--providing proper input validation before consuming the variable at the code level. At the SQL level, it is important to use either prepared statements or stored procedures to avoid SQL SELECT statement injection. For mitigation of XSS vulnerabilities, it is imperative to filter out characters such as greater than (>) and less than (<) prior to serving any content to the end-client. These steps provide threat mitigation to the overall web application.


Code review is a very powerful tool for detecting vulnerabilities and getting to their actual source. This is the "whitebox" approach. Dependency determination, entry point identification, and threat mapping help detect vulnerability. All of these steps need architecture and code reviews. The nature of code is complex, so no single tool can meet all of your needs. As a professional, you need to write tools on the fly when doing code review and put them into action when the code base is very large. It is not feasible to go through each line of code.

In this scenario, one of the methods is to start with entry points, as discussed earlier in this article. You can build complex scripts or programs in any language to grab various patterns in voluminous source code and link them together. Tracing the variable or function is the key that can show up the entire traversal and greatly help in determining vulnerabilities.

Open Tools for MySQL Administrators

MySQL provides some tools to monitor and troubleshoot a MySQL server, but they don't always suit a MySQL developer or administrator's common needs, or may not work in some scenarios, such as remote or over-the-web monitoring. Fortunately, the MySQL community has created a variety of free tools to fill the gaps. On the other hand, many of these are hard to find via web searches. In fact, web searches can be frustrating because they uncover abandoned or special-purpose, not ready-to-use projects. You could spend hours trying to find tools for monitoring and troubleshooting your MySQL servers. What's a tool-seeker to do?

Relax! I've already done the work, so you won't have to. I'll point you to the tools I've actually found useful. At the end of this article I'll also list those I didn't find helpful.

This article is about tools to discover and monitor the state of your server, so I won't discuss programs for writing queries, designing tables, and the like. I'm also going to focus exclusively on free and open source software.

Tools to Monitor Queries and Transactions

The classic tool for monitoring queries is Jeremy Zawodny's mytop. It is a Perl program that runs in a terminal and displays information about all connections in a tabular layout, similar to the Unix top program's process display. Columns include the connection ID, the connection's status, and the text of the current query. From this display you can select a query to EXPLAIN, kill a query, and a few other tasks. A header at the top of the display gives information about the server, such as version, uptime, and some statistics like the number of queries per second. The program also has some other functions, but I never found myself using them much.

There are mytop packages for various GNU/Linux distributions, such as Gentoo and Fedora Core, or you can install one from Jeremy's website. It is very small and has minimal dependencies. On the downside, it hasn't been maintained actively for a while and doesn't work correctly with MySQL 5.x.

A similar tool is mtop. It has a tabular process display much like mytop, and although it lacks some features and adds others, the two programs are very similar. It is also a Perl script and there are installation packages for some operating systems, or you can download it from SourceForge. Unfortunately, it is not actively maintained and does not work correctly on newer versions of MySQL.

Some programmers have also created scripts to output MySQL's process list for easy consumption by other scripts. An example is this SHOW FULL PROCESSLIST script, available from the always-useful MySQL Forge.

My own contribution is innotop, a MySQL and InnoDB monitor. As MySQL has become increasingly popular, InnoDB has become the most widely used transactional MySQL storage engine. InnoDB has many differences from other MySQL storage engines, so it requires different monitoring methods. It exposes internal status by dumping a potentially huge amount of semi-formatted text in response to the SHOW INNODB STATUS command. There's a lot of raw data in this text, but it's unusable for real-time monitoring, so I wrote innotop to format and display it conveniently. It is the main monitoring tool at my current employer.

Innotop is much more capable than the other tools I've mentioned, and can replace them completely. It has a list of processes and status information, and offers the standard functions to kill and explain queries. It also offers many features that are not in any other tool, including being able to list current transactions, lock waits, deadlock information, foreign key errors, I/O and log statistics, InnoDB row operation and semaphore statistics, and information on the InnoDB buffer pool, memory usage, insert buffer, and adaptive hash index. It also displays more standard MySQL information than mytop and its clones, such as compact, tabular displays of current and past status information snapshots. It is very configurable and has interactive help.

Installation is simple, because innotop is a ready-to-run Perl script, but there are no installation packages yet, so you must download it from my website.

There are also some web-based tools. There are two web-based mytop clones, phpMyTop and ajaxMyTop. These are useful when you don't have shell access and can't connect remotely to your database server, but can connect from a web server. ajaxMyTop is more recent and seems to be more actively developed. It also feels more like a traditional GUI program, because thanks to Ajax, the entire page does not constantly refresh itself.

Another web-based tool is the popular phpMyAdmin package. phpMyAdmin is a Swiss Army Knife, with features to design tables, run queries, manage users and more. Its focus isn't on monitoring queries and processes, but it has some of the features I've mentioned earlier, such as showing a process list.

Finally, if you need to monitor what's happening inside a MySQL server and don't care to--or can't--use a third-party tool, MySQL's own mysqladmin command-line program works. For example, to watch incremental changes to the query cache, run the command:

$ mysqladmin extended -r -i 10 | grep Qcache

Of course, innotop can do that for you too, only better. Take a look at its "V" mode. Still, this can be handy when you don't have any way to run innotop.

Tools to Monitor a MySQL Server

Sometimes, rather than monitoring the queries running in a MySQL server, you need to analyze other aspects of the system's performance. You could use standard command-line utilities to monitor the resources used by the MySQL process on GNU/Linux, or you could run Giuseppe Maxia's helpful script to measure MySQL resource consumption. This tool recursively examines the processes associated with the MySQL server's process ID, and prints a report on what it finds. For more information, read Giuseppe's own article on the O'Reilly Databases blog.

The MySQL Forge website is an excellent place to discover tips, tricks, scripts, and code snippets for daily MySQL administration and programming tasks. For example, there's an entry to help you measure replication speed, a "poor man's query profiler" to capture queries as they fly by on the network interface, and much more.

Another excellent resource is mysqlreport, a well-designed program that turns MySQL status information into knowledge. It prints out a report of relevant variables, sensibly arranged for an experienced MySQL user. I find this tool indispensable when I have to troubleshoot a server without knowing anything about it in advance. For example, if someone asks me to help reduce load on a MySQL server that's running at 100 percent CPU, the first thing I do is to run mysqlreport. I can get more information by glancing at its output than I could in 10 minutes of talking to the customer. It immediately tells me where to focus my efforts. If I see a high key read ratio and a high percentage of index scans, I can immediately look for large indexes and a key buffer that's too small. That intuition could take many minutes to develop just by examining SHOW STATUS.

The mysqlreport website has full information on how to install and use the program, but better yet, there are excellent tutorials on how to interpret its output, with real examples. Some of these go into detail on MySQL internals, and I recommend them to any MySQL developer.

Another common task is setting up automated systems to monitor your server and let you know if it's alive. You could write your own monitor, or you could just plug in a ready-made one. According to a MySQL poll, Nagios is the most popular tool for doing this. There's also a Watchdog mysql monitor plugin for mon, the Linux scheduling and alert management tool. We currently use a home-grown system at my employer, but we're looking at using Nagios soon.

Tools I Didn't Find Useful

The Quicomm MySQL Monitor is a web-based administration tool similar to phpMyAdmin, not a monitor in the same sense as mytop or innotop. It offers relatively few features compared to phpMyAdmin.

Another web-based tool is MySysop, which is billed as a "MySQL system optimizer", though it certainly doesn't do anything on its own to optimize a MySQL system. It offers recommendations I would not trust without doing enough investigation to arrive at the same conclusions. By the time I could install and run this system, I'd have long since run mysqlreport.

Finally, I've never understood how to even use the Google mMaim (MySQL Monitoring And Investigation Module). It is part of Google's open source code contributions, and Google probably uses it internally to monitor its servers. However, it's not obvious to the rest of the world how to do this, as evidenced by the mailing list. The mailing list also reveals that Google released the code simply for the sake of releasing it. While I appreciate the gesture, I can't find any use for the code.


If you're trying to find tools for your own work, I recommend innotop and mysqlreport, and a healthy dose of command-line competence. I used to rely on mytop for my routine monitoring, but now I use innotop, because it shows much more information, including all-important details about transactions. When I need to analyze a server to discover what's wrong with it, it's impossible to match mysqlreport's instant snapshot of server health and activity. When I need to know about MySQL's resource consumption and performance, I augment standard command-line utilities with scripts, such as Giuseppe Maxia's.

There are certainly other tools, but the ones mentioned here are free and open source, have nearly every feature you can find in other tools, and do a lot you can't find elsewhere at all.

VOIP on the Nokia 770 Internet Tablet

I ended my previous article (Linux on the Nokia 770 Internet Tablet) by saying that the release of the OS 2006 prepared the way for some serious VOIP work. The 770 can now make SIP-based VOIP phone calls and is more like what you'd expect from Nokia--a phone!

What does it take to upgrade the machine, and how difficult is it? As it happens, not much and not very, but when you're at risk of bricking the machine, there's always a certain level of anxiety.

The first step in the upgrade is to visit the Nokia 770 support site for a Windows download or Maemo's 770 download page for Linux and Mac OS X. Download the new OS. You need to provide the machine number of your 770; the download pages provide instructions on how to find it.

The next step is to do it! On Linux and Mac OS X, connect the 770 to the host machine with the USB cable and run a script while holding down the home button (and possibly your breath, as well). I flubbed my first attempt by letting go of the button too soon. The good news was that the only result was a failure notice on the host machine console:

SW version in image: SU-18_2006SE_1.2006.26-8_PR_MR0
Image '2nd', size 8704 bytes
Image 'secondary', size 87040 bytes
Image 'xloader', size 13824 bytes
Image 'initfs', size 1890304 bytes
Image 'kernel', size 1266560 bytes
Image 'rootfs', size 60030976 bytes
Suitable USB device not found, waiting
USB device found at bus 002, device address 002-0421-0105-02-00
Sending request 0x01 failed: Unknown error: 0
NOLO_REQ_GET_STATUS: Invalid argument
Device status query failed

Holding down the button for the whole operation was the way forward. Here is my success:

SW version in image: SU-18_2006SE_1.2006.26-8_PR_MR0
Image '2nd', size 8704 bytes
Image 'secondary', size 87040 bytes
Image 'xloader', size 13824 bytes
Image 'initfs', size 1890304 bytes
Image 'kernel', size 1266560 bytes
Image 'rootfs', size 60030976 bytes
Suitable USB device not found, waiting
USB device found at bus 002, device address 002-0421-0105-02-00
Found board Nokia 770 (F5)
NOLO version 0.9.0
Sending xloader image (13 kB)...
100% (13 of 13 kB, avg. 385 kB/s)
Sending secondary image (85 kB)...
100% (85 of 85 kB, avg. 765 kB/s)
Flashing bootloader... done.
Sending kernel image (1236 kB)...
100% (1236 of 1236 kB, avg. 796 kB/s)
Flashing kernel... done.
Sending initfs image (1846 kB)...
100% (1846 of 1846 kB, avg. 795 kB/s)
Flashing initfs... done.
Sending and flashing rootfs image (58624 kB)...
100% (58624 of 58624 kB, avg. 598 kB/s)
Finishing flashing... done

Looks, etc.

What you get is an updated interface with more operations available from the desktop.

This is the same process I demonstrated in my previous article if you want to add to the basic Linux install by importing more apps such as the terminal. I'm using it and Joe to write this report (Emacs keystrokes just didn't work out for me on this machine, and I didn't get the hang of the double escapes with Vi either). There is a version of Vim that works quite well, though.

The catalog of apps is fairly similar, except there are some that haven't made it across yet, and some new ones as well.

I should put in a warning here about a theme called LCARS. It's a Star Trek thing, which looks pretty cool. The minus side starts with hard-to-see fonts in daylight. From there, it grew significantly worse on my configuration, with corrupted data files and various apps refusing to start. The problem, I think, is that this theme is very weighty for this machine, and the OS doesn't so far degrade very nicely when it runs out of memory. This only affected runtime files, so an uninstall followed by a couple of reboots seemed to fix everything.

At least, that was true for me on release 1 of OS 2006. The recently released update cured all those problems on my machine. LCARS now runs like a charm and looks pretty good as well.

Another tangent is email. The bundled client is quite OK for dealing with a few emails but it gets old very quickly if you get lots. For example, you can't tag emails so deleting quite a few is a major pain. It won't handle groups at all and GMail isn't all that great, either.

Pine to the rescue! I used to prefer Mutt but it isn't available for this platform, and I'm on the road and don't have a suitable machine to do it myself. Anyway, running Pine on the 770 is way cool. The easiest way to get it is to add mistral user to your repositories list, update available packages, and get Pine.

If you're new to Pine, the best way to edit the config file .pinerc is through the internal setup within the program. Be sure to enable the mouse in xterm, as this allows you to tap options on the screen rather than having to drop down the menu item in the improved Xterm that will send a Ctrl signal. Another note: as initially configured, the emails you send will come from User. This is easy to fix. See Jimc's Nokia 770 page for details.


Without importing any apps, the limit of your VOIP calling is to fellow Gmailers. You're not a Gmailer, you say? Well, as a 770 owner, you already have an account. It's just a pity that the Opera browser shipped with OS 2006 can't fully cope with Gmail. Opera tells me that the next version is better.

Another alternative is to download a client from the Gizmo Project. Once you open the app, you receive 25 cents of free calls if you register. At 1 cent per minute to quite a few places, the rates are quite competitive. Calls to fellow Gizmo users are free. You can also register a normal phone number for your device at Gizmo for $12 for three months. Calling is very straightforward. You put in the number, put the 770 up to your ear, and talk away. Top up your minutes by clicking on "add credit" in the "home" section.

There's also Tapioca, which is "a GoogleTalk client with VoIP and instant messaging capabilities, with a simple user interface. It can be installed on the device without any conflict with the product's built-in Gtalk client."

Another project called Minisip comes from the postgrad students at the Royal Institute of Technology in Stockholm, Sweden. It's quite advanced, but there are no downloads at the moment due to code rewrites.


Finally there's a port of the well-known Asterisk that will do VOIP as well as PABX duties. Getting this on a 770 isn't, at the moment, for the faint of heart though...but if you're a long-term Asterisk user, you won't be faint of heart.

This is what you want, I'm sure. Here's what I did to get a working (as in "non-crashing") version of Asterisk 1.2.1 (the latest release from Digium) on the Nokia 770.

If you're in a hurry or you don't want to mess with compiling and Scratchbox (or you simply don't know what those are), just skip to the binaries.

  • Start Scratchbox.
  • From within Scratchbox, run wget to download the latest Asterisk sources.
  • Unarchive the sources with tar xvfz asterisk-1.2.1.tar.gz. This will give you an asterisk-1.2.1 folder. Change to that folder (cd asterisk-1.2.1).
  • Patch the main Makefile and the one for the GSM codec in order to make them compile for the 770. Download both diffs with wget and wget
  • Patch the main Makefile with patch Makefile Makefile.diff.

There are eight steps to go; read more at Installing Asterisk on the Nokia 770.

Note: A point of interest here is that the linked Asterisk Nokia 770 binary includes a SIP client for OS 2005, which might be useful if you don't want to upgrade for other reasons.

The Scratchbox reference means that you first need to install the Maemo SDK. Otherwise, you can pick up toward the end of the instructions and get a ready-made binary, which needs some work to install...

  • You're ready to move the binaries to your Nokia 770. Go to /tmp/ast121/ and type tar cvfz asterisk-1.2.1-nokia770-arm-binary.tar.gz *. You can also download the Nokia 770 Asterisk binary directly. Drop the files on your memory card or scp them from your machine--your choice.

    Another note: As I write this, the binary for OS2006 does not work due to missing libraries. I imagine the fix is on its way, though.

  • On the 770, start an XTerm and become root.
  • Go to the folder where you dropped the asterisk-1.2.1-nokia770-arm-binary.tar.gz file and (as root) type tar -zvx -C / -f asterisk-1.2.1-nokia770-arm-binary.tar.gz.

    Note: The easiest way to become root is to get Becomeroot from the's application list. With that on board, sudo su gives you a passwordless root.

  • That's all. To run Asterisk, edit the configuration files at /etc/asterisk, then type asterisk -vvvvvc to start the program and get a console prompt.

Other Things

There is some interesting stuff coming up with handwriting recognition. At a recent Symbian Smartphone show, I saw both Symbian and 770 demos of vastly improved systems. The one from MyScript recognized whole lines of cursive linked writing rather than just one letter at a time. XT9 also showed an improved version of the current system.

Some people call the 770 "the new Zaurus" but really the only comparison is Linux and the degree of enthusiasm around. Nokia seems fully aware of what it has, which is more than Sharp ever demonstrated, at least in markets other than Japan. Nokia also has the advantage of having much wider distribution channels.

Very special thanks to Gala's fourth-year computer science students at Simferopol University for showing me where to get a WLAN connection for this article. Special thanks as well to Ciaron Linstead in Berlin for extensive use of his network, which allowed me to get Pine working, among other things.

A New Visualization for Web Server Logs

There are well over a hundred web server log analyzers (Google Directory for Log Analysis) or web statistics tools ranging from commercial offerings such as WebTrends to open source ones such as AWStats. These take web server logfiles and display numbers such as page views, visits, and visitors, as well as graphs over various time ranges. This article presents the same data in those logfiles in a very different way: as a 3D plot. By the end of this article, I hope you will agree with me that the visualization described herein is a novel and useful way to view the content of logfiles.

The logfiles of web servers record information on each HTTP request they receive, such as the time, the sender's IP address, the request URL, and the status code. The items in each request are fairly orthogonal to one another. The IP address of a client has no relation to the URL that it requests, nor does the status code of the request to the time of the request. If that is the case, what could be a better way to display these n columns from the logfiles than an n-dimensional plot?

When an administrator observes anomalous behavior on a web server, she reaches out for web statistics reports, as they are usually all there is as a record of past activity. These often prove fruitless, mainly because web statistics is primarily a marketing-oriented view of web server activity. The next step is to take the raw logfiles apart with ad hoc scripts. The sheer mass of data makes it difficult to reduce it to a few numbers that reveal the cause of the problem. Another complication is that you may not quite know what you are looking for other than that it is abnormal behavior. The path this article takes is to provide a visualization of raw data such that cause or causes make themselves visible. This comes from the real-life experience of a client, where crippling performance problems appeared out of nowhere.

The Plot

The scatter plot in Figure 1 shows more than half a million HTTP page requests (each request is a dot) in 3D space. The axes are:

  • X, the time axis--a full day from midnight to midnight of November 16.
  • Y, the requester's IP address, with the conventional dotted decimal format sorted and given an ordinal number between 1 and 120,000, representing the number of clients that accessed the web server.
  • Z, the URL (or content) sorted by popularity. Of the approximately 60,000 distinct pages on the site, the most popular URLs are near the zero point of the Z-axis and the least popular ones at the top.

3D scatter plot of a good day
Figure 1. Scatter plot showing HTTP requests

If the plotted parameters were truly orthogonal, you could expect a random distribution: a flat featureless plot. The parameters, however, are not completely independent of one another. For example, the IP ranges for Italy may prefer the Italian pages on the website. Therefore instead of a random plot, there are clusters in the 3D space. If you think about it, that does not seem unreasonable: the home page is probably the most visited page on a website. Studies (especially Jakob Nielsen on website popularity and Jakob Nielsen on traffic log patterns) argue convincingly that popularity closely follows Zipf's law: a log curve with a long tail. Hence the dense horizontal layer at the bottom in Figure 1. The vertical rectangular planes are search crawlers. They request pages over the whole content space from a small number of IP addresses and do that over the whole day. Therefore, clustering along each of the three dimensions is common.

The Case Study

The website of a client grew inexplicably sluggish one day. Since the web server, CMS, and auxiliary servers had run well for the preceding months, the only rational explanation pointed to an unusual request pattern. The web log-analysis reports showed nothing out of the ordinary. Command-line scripts (with awk, sort, grep, and friends) running over the logfiles also revealed no anomalies. I used Gnuplot to graph the requests in 3D space. (See also an excellent Gnuplot introduction) Some time later, the 3D plot made the culprit evident.

3D scatter plot of a bad day
Figure 2. Scatter plot of a bad day.

The thick pillar in the plot stands out like a sore thumb. This is a dense set of requests in a short time (about 100 minutes on the X-axis, which represents 24 hours) from a single IP address (Y-axis) and going over the whole content space (Z-axis). Why should it cause trouble? Large-scale CMS servers generate content on-the-fly from a database. Caches usually handle most requests, so only the small number of requests that are not currently in the cache should require database activity. On this particular CMS, the caches keep content for 15 minutes. When the client requested all of the pages in a short time, the high number of cache misses placed a heavy load on the database. This resulted in deteriorated performance. Search crawlers such as Yahoo Slurp and Googlebot do pretty much the same thing, but they spread the load over a much longer period.

The Process

Now that you have seen the output, here's how to generate it. The input is, of course, an access logfile that has lines of data, one per HTTP request. A typical line from an Apache server conforms to the NCSA combined access logfile standard. (See the Combined Log Format description at Note that I've wrapped the long line: - - [15/Jan/2006:21:12:29 +0100] "GET
/index.php?level=2 HTTP/1.1" 200 5854 ""
"Mozilla/5.0 (X11; U; Linux i6 86; en-US; rv:1.7.3)

The Perl script at the end of the article takes sequences of these lines and condenses them to just what Gnuplot needs. Run it with an access logfile and redirect it to an output file, such as gnuplot.input, from the command line:

$ perl access_log > gnuplot.input

The output will be a series of lines matching those of the access logfile. For the previous line from the access log, the corresponding output is:

15/Jan/2006:21:12:29 906 41 200

The fields in gnuplot.input, the output file of the Perl script, are date/time, ip rank (906), url rank (41), and status code.

To display the sequence of lines in Gnuplot, give it the commands:

$ gnuplot
set style data dots
set xdata time
set timefmt "%d/%b/%Y:%H:%M:%S"
set zlabel "Content"
set ylabel "IP address"
splot "gnuplot.input" using 1:2:3


If the plot is too dense--as was the case for me--thin it down by
telling Gnuplot to only use every nth data point. For example, I
thinned Figure 1 by plotting every tenth point with the Gnuplot splot command:

splot "gnuplot.input" using 1:2:3 every 10

Figure 3 shows the corresponding scatter plot.

Thinned 3D scatter plot of a good day

Figure 3. Thinned scatter plot

Gnuplot makes it easy to focus on a part of the plot by setting the
axes ranges. Figure 4 shows a small part of the Y- and Z-axes. The
almost continuous lines that run parallel to the time axis are
monitoring probes that regularly request the same page. Four of them
should be clearly visible. In addition, I changed the eye position.

Monitoring probes visible after reducing the Y and Z ranges.

Figure 4. Reduced Y and Z ranges showing monitoring probes

Because real people need sleep, it should be possible to make out
the diurnal rhythms that rule our lives. This is evident in Figure 4.
The requests are denser from 08:00 to about 17:00 and quite sparse in
the early hours of the morning.

Changing the viewing angle can give you a new point of view. Gnuplot lets you do it in one of two ways: with the command line set view or interactively with a click and drag of the mouse.

The Pièce de Résistance

Because a display of 3D plots is difficult to see in three
dimensions without stereoscopic glasses, I used a few more
manipulations to "jitter" the image such that the depth in the picture
is visible. The plot in Figure 5 is an example of this. It was easy to
generate with more Gnuplot commands followed by GIF animation with ImageMagick.

An animated scatter plot

Figure 5. A animated GIF of the scatter plot that hints at the 3D structure

Further Work

With Gnuplot 4.2, which is still in beta, it is now possible to draw
scatter plots in glorious color. Initial tests show that using color
for the status code dimension makes the plots even more informative.
Stay tuned.


Though the 3D plots present no hard numbers or trend lines, the
scatter plot as described and illustrated above may give a more
intuitive view of web server requests. Especially when diagnosing
problems, this alternative way of presenting logfile data can be more
useful than the charts and reports of a standard log analyzer tool.

Code Listings

The Perl script:

# convert access log files to gnuplot input
# Raju Varghese. 2007-02-03

use strict;

my $tempFilename = "/tmp/temp.dat";
my $ipListFilename = "/tmp/iplist.dat";
my $urlListFilename = "/tmp/urllist.dat";

my (%ipList, %urlList);

sub ip2int {
my ($ip) = @_;
my @ipOctet = split (/\./, $ip);
my $n = 0;
foreach (@ipOctet) {
$n = $n*256 + $_;
return $n;

# prepare temp file to store log lines temporarily
open (TEMP, ">$tempFilename");

# reads log lines from stdin or files specified on command line

while (<>) {
my ($ip, undef, undef, $time, undef, undef, $url, undef) = split;
$time =~ s/\[//;
next if ($url =~ /(gif|jpg|png|js|css)$/);
print TEMP "$time $ip $url $sc\n";

# process IP addresses

my @sortedIpList = sort {ip2int($a) <=> ip2int($b)} keys %ipList;
my $n = 0;
open (IPLIST, ">$ipListFilename");
foreach (@sortedIpList) {
print IPLIST "$n $ipList{$_} $_\n";
$ipList{$_} = $n;
close (IPLIST);

# process URLs

my @sortedUrlList = sort {$urlList {$b} <=> $urlList {$a}} keys %urlList;
$n = 0;
open (URLLIST, ">$urlListFilename");
foreach (@sortedUrlList) {
print URLLIST "$n $urlList{$_} $_\n";
$urlList{$_} = $n;
close (URLLIST);

close (TEMP); open (TEMP, $tempFilename);
while () {
my ($time, $ip, $url, $sc) = split;
print "$time $ipList{$ip} $urlList{$url} $sc\n";
close (TEMP);

How Linux and open-source development could change the way we get things done

An army of disheveled computer programmers has built an operating system called Linux based on a business model that seems to have been written with everything but business in mind. Instead of charging customers as much as the market can bear, Linux is given away for free; instead of hiding information from competitors, Linux programmers share their work with the world; instead of working for money, Linux developers are motivated primarily by adrenaline, altruism, and the respect of their peers.

Despite this unusual foundation, Linux is booming and even beginning to challenge Microsoft's control of the operating system industry. Linux may eventually pull the rug out from under the richest company in the world. It may not. But no matter what happens, it has already shown that money doesn't have to make the world, even the business world, go round. In fact, as technology improves and computers connect and create even more of our society, the principles of cooperation and collaboration that drive Linux may well spread to other fields: from computers, to medicine, to the law.

The Source

The Linux movement kick-started in 1991 when Linus Torvalds, a puckish graduate student at the University of Helsinki, got frustrated with his rickety computer. Refusing to buy another one, he wrote a new operating system--the core programs by which applications (like Microsoft Word) talk to hardware (like microprocessors). When finished, instead of running down to the patent office, he posted his code on the Internet and urged other programmers to download it and work with him to improve it. A few emailed back suggestions, some of which Torvalds took. A few more wrote the next day and a couple more the day after that. Torvalds worked constantly with these new colleagues, publicly posting each improvement and delegating responsibility to more and more programmers as the system grew. By 1994, Linux (a combination of "Linus" and "Unix," another operating system) had 100,000 users. Today, it has between 10 and 20 million users and is the fastest growing operating system in the world.

But Linux (rhymes with 'cynics') is different from almost every other operating system available. For one thing, it's downloadable for free straight off the Web. It's also open source, meaning that the source code, the program's all-important DNA, is open for anyone to look at, test, and modify. Most software is developed so that only the original authors can examine and change the code; with open-source models, however, anyone can do it if they have a computer and the right intuition.

To see the power of this model, consider what happens when you're running Microsoft Windows or Macintosh OS and your computer crashes: You stamp your feet and poke a twisted paper clip into a tiny reset button. You probably don't know what happened and it's probably going to happen again. Since you've never seen the source code, it probably doesn't even occur to you that you could fix the problem at its root. With Linux, everything's transparent and, even if you aren't an expert, you can simply post your question on a Linux-help Web page and other users can usually find solutions within hours, if not minutes. (The amorphous Linux community recently won InfoWorld's Product of the Year award for Best Technical Support.) It's also entirely possible that someone--perhaps you--will write some new code that fixes the problem permanently and that Linux developers, led by Torvalds, will incorporate into the next release. Presto, that problem's fixed and no one will need paper clips to fix it again.

To make another analogy, fixing an error caused by a normal software product is like trying to fix a car with the hood welded shut. With Linux, not only can you easily pop the hood open, there is extensive documentation telling you how everything works and how it all was developed; there's also a community of thousands of mechanics who will help you put in a new fuel pump if asked. In fact, the whole car was built by mechanics piecing it together in their spare time while emailing back and forth across the Web.

The obvious threat to this type of open development is appropriation. What if someone lifts all the clever code off the Web, copyrights it, and sells it? What if someone takes the code that you wrote to fix your crashed computer (or busted fuel pump), copyrights it, and markets it for $19.95? Well, they can't. When Torvalds created Linux, he protected it under the GNU General Public License, an intriguing form of copyright commonly known as copyleft. Under copyleft, anyone who redistributes the program, with or without changes, must pass along the freedom to further copy, change, and distribute it. Theoretically one can download Linux off the Web, add a string of useful features, and try to sell it for $50. But anyone who buys this new version can just copy it, give it away, or sell it for a dollar, thus destroying the incentive for piracy. An equivalent would be if I were to write the following at the end of this article: "Verbatim copying and redistribution of this entire article is permitted in any medium provided this notice remains at the end."

Use the Source

From the Oxford English Dictionary to hip-hop music, open-source development has always been with us to some degree. Since the mid-19th century, contributors to the OED have defined words and sent them to a centralized location to be gathered, criticized, sorted, and published. With music, as long as you stay within certain copyright laws, you can take chunks of other people's compositions and work them into your own. Most academic research is also built on open-source cooperation. Even the compact fluorescent light bulb above my head comes from data shared by researchers over centuries about electricity, properties of glass, and centralized power.

But still, most business isn't done with open source. Coca-Cola keeps its formula secret, Microsoft won't tell you how it builds its programs, and when a researcher for Ford suddenly stumbles upon the means to a more efficient fuel pump, she doesn't reflexively email her friend at Honda with a precise description of her discovery. A great deal of scientific and medical research is also done through closed source as individual laboratories race each other to determine who'll be the first to find the answer and earn the patent.

But two extraordinary changes are making open source substantially more plausible as a development and research model for 2000 than it was for 1990--and they'll make it even more so for 2010. First, the Internet. Today, I can open my Web browser and communicate instantly with Burmese refugees or writers working on projects similar to mine. Secondly, computer power has been increasing exponentially for generations and will probably continue to do so--in large part because every time you build a faster computer it allows you to build a faster one still. It's difficult to overestimate the change. The standard laptop to which I'm now dictating this article (with technology unavailable just two years ago) has more power than almost any computer owned by the government a decade ago. In four years, it could well be obsolete. As author Ray Kurzweil and others have pointed out, if cars had improved as much over the past 50 years as computers, they'd cost less than a nickel and go faster than the speed of light.

Intellectual and Physical Properties

This rate of progress is critical because the advantages of open-source development depend on the powers of technology and the balance between what can be done through thinking and what has to be done by building. Every product has certain intellectual components and certain physical components built into it. With a car, for example, the intellectual component is the thought about how to build it, how to set up the assembly lines, and how to process data you have on different kinds of tires. The physical components include the actual rubber in the tires, the machines that ran the tests, the operation and maintenance of factories.

Faster computers and increased connectivity are drastically changing the relationship between these components in two ways. First, some things that used to require physical components no longer do. You may not have to buy rubber to test the tires if you can just run a simulator online. (The 777 project at Boeing was designed so that nothing physical was produced before the plans hit the factory floor.) Second, connectivity makes the flow of information much faster and smoother and greatly facilitates the sharing of ideas and data. There is a saying known as "Linus' law" that "given enough eyes, all bugs are shallow." In other words, given enough people working on them, all problems are solvable. And the Internet has not only helped coordinate a lot more eyes: in some ways, it's given everyone glasses.

Open-source development benefits from this transition because its advantages are almost all in the realm of intellectual property. Open source improves communication and facilitates sharing ideas. But it doesn't mean that you can buy a ton of concrete for free or drive a nail into a wall without a hammer. This is why open source has come first and most prominently to computer programming: a profession where almost all of the development is intellectual, not physical, and where people have been connected over the Internet for more than 20 years. Programming also employs highly specific, common tools--computers--that a fairly large number of people have access to and that some people, such as university students, can even access for free. If a solution or improvement is found, nothing additional needs to be built; code just needs to be entered and then downloaded.

But there is still one great problem standing firmly in the way of even the most modern open-source software development project. As a 21-year-old Bill Gates asked in an angry letter to open-source programmers in 1976: "One thing you do is prevent good software from being written. Who can afford to do professional work for nothing?"

Microsoft's empire, and a great deal of the rest of our society, is built upon the assumption that there isn't an answer to this rhetorical question. To survive, organizations need to patent their information, protect their secrets, and earn as much money as they can. But the success of Linux shows that there's a way around that model. Something extraordinary has been built with a completely different set of rules. I asked David Soergel, a researcher in the department of genetics at Stanford, whether people could use the open-source model to develop medicines. "My first reaction: they'd rather get paid, and if they were any good at it, then they would be paid by somebody. My second reaction: wait a minute, that argument obviously fails for software."

Money for Nothing

Naive as it may be to think that people aren't motivated by money, it is just as naive to think that people are only motivated by money. People are motivated by a variety of factors: money, recognition, enjoyment, a belief that one is doing something good for the world, and so on. We each weigh these factors and make decisions based on our perceptions of their relative importance. At different points in our lives, we give different factors different weights. When we're poor, we tend to value simply high-paying work more than we do when we're well-off; it usually takes a high-paying job to get someone to do something really boring and it generally takes a very fulfilling job to get someone to work for less than what he should normally be able to earn.

Since people working on open-source projects generally earn nothing, or very little, there need to be other incentives. In Linux, there seem to be principally three. First, enjoyment. Computer programming can be addictive, exciting, and extraordinarily intense. Linux can be particularly enjoyable because almost every problem solved is a new one. If your Windows machine crashes, fixing the problem generally entails tediously working through scores of repair procedures which you may have used a thousand times. If A fails, try B. If B fails, try C. If a Linux computer crashes, anyone who repairs it is not only working on that one specific machine, he's finding a solution for the whole Linux community.

Eric Roberts, a computer science professor at Stanford, once explained to The New York Times that people in the profession must be "well trained to do work that is mind-numbingly boring." This is particularly true of work on most closed-source systems where programmers must continually reinvent the wheel. But with open-source projects, the things that need to be done haven't been done before. According to one only slightly hyperbolic programmer, Ali Abdin, writing to a Linux news group about how he felt after creating his first open-source project: "The feeling I got inside when I knew that I had some code out there that I can share with people is indescribable... I felt on top of the world, that I can program anything...I felt as mother would feel giving birth to a child, giving it life, for the first time."

Secondly, and similarly, Linux programmers are motivated by a feeling that they are changing the world and developing an operating system that really works. Torvalds laid out this philosophy well in a speech this summer: "I don't resent Microsoft for making lots of money. I resent them for making bad software."

Thirdly, but most significantly, Linux programmers seem motivated by prestige and, in particular, respect from their peers. Having "hacked the kernel" (contributed to the core of the operating system) gives programmers a certain stature--much as completing a four-minute mile does among runners--and, since the program is open source, everyone knows exactly who contributed what. I was once introduced to a programmer described as "the guy who wrote all the Ethernet device drivers!" as though I was meeting Jonas Salk "who came up with the cure for polio!" And, in fact, Linux programmers often discuss their work as falling in the tradition of eminent scientists. As three well-known programmers put it in the introduction to their book Open Sources: "It would be shortsighted of those in the computer industry to believe that monetary reward is the primary concern of open source's best programmers... These people are involved in a reputation game and history has shown that scientific success outlives financial success... When the history of this time is written a hundred years from now, people will perhaps remember the name of Bill Gates, but few other computer industrialists. They are much more likely to remember names like... Linus Torvalds."

Importantly, this philosophy may well be helping Linux develop creatively. There is a great deal of psychological research that shows that people actually do more creative work when they aren't motivated primarily by money. Tell a child that you'll pay her for reading a book and she'll read it with little imagination. Have one group of college poets think about getting rich and famous through their writing, according to research done by Harvard Professor Teresa Amabile, and they tend to turn out less creative work than a second group that's just asked to write poems. Is it possible that Linux programmers created such an extraordinary operating system in part because they were driven by other factors and weren't doing it for the money? I asked Professor Amabile if the implications of her research cross over to open-source programming and whether it could explain some of the remarkable innovations that have come from people working without pay. "Yes," she responded, "this [would be] entirely consistent."

Making free software affordable

Still, Linux programmers are not completely locking themselves out of the economy and there's a second response to Gates' rhetorical question: If your core open-source project is successful enough, it's possible to eventually make money off of it indirectly. No, it's not possible to make as much money as a proprietary company can--open source and copyleft will ensure this--and there's always going to be an astounding amount of work that has to be done without financial reward. But open-source programmers don't have to starve.

The trick to making money off Linux, or any open-source development, is to profit off of derivatives. No one actually makes money off the Linux code, but they can make money by offering technical support or helping customers install the program. Companies that do this follow a well-trodden path: give something away in order to sell something else. This is what cellular phone companies do when they give you the actual telephone handset for free if you agree to pay for using it a certain number of minutes a month. We do the same thing with the Monthly's Web page. We post some articles (giving them to readers for free) in the hope that visitors' interest will be piqued and that they'll subscribe.

Red Hat, the best-known Linux company, sells the operating system and other open-source programs in a box for $80, though you can download their product for free from Their revenue comes from the technical support they offer and from consumers' need for trust. It's much less unsettling to buy a program like Linux if you get it shrink-wrapped with a manual than if you have to download both. VA Linux, another well-known company, sells Linux hardware: You choose the memory and the motherboard; it builds you a computer.

Has the money these companies brought into the open-source movement corrupted it and made it more like the traditional model that Microsoft uses? Surely, there are a lot of people doing promotional and administrative work at Red Hat for the money and there are probably even some people working on Linux for Red Hat because they get paid a lot to do it (the company hires programmers to write code that anyone, including Red Hat's competitors, can use). But programmers mostly see the money as an added and surprising plus, and Linux is still driven by adrenaline, altruism, and peer recognition.

While it is possible that this could change, it hasn't so far. I asked Richard Stallman--the creator of copyleft, as well as many of the programs that run with the Linux through a system called GNU, and the man often considered to be the father of the open-source movement--whether he thought that money would change the attitudes of people who used to work on GNU/Linux without being paid. "In general, I don't think that wealth will make a hacker into a worse person. It would be more likely to enable the hacker to spend more time volunteering for free software instead of on work for pay."

This point is particularly germane because most open-source programmers have always had to work other jobs, and many have only been able to contribute to the project during the evenings or when their employers weren't looking. Linus Torvalds, for example, helps design microprocessors for a company called Transmeta in the daytime and does all his Linux coding after work. When I asked John Hall, vice president of VA Linux, what motivates programmers, he responded: "For some, it's altruism. For some, it's fame. For some, it's religion. For a very few, it's money."

So what's next?

To determine where open source is likely to move next, one has to imagine a scenario where these obstacles can be overcome. A project would need to be fun, or at least rewarding, to get going and it would have to primarily pose intellectual, not physical, challenges. Once it began to move, derivative financial incentives could help push it along. There also has to be a certain amount of luck, particularly with regard to organization. It's hard enough to get six people to agree to a restaurant for dinner; it's much harder to coordinate thousands of people known to each other only by email addresses. Linux has gotten around this latter problem in no small part because of Torvalds himself. He is a benevolent dictator who has earned the trust and respect of virtually everyone on the project. He's a relaxed, friendly, funny guy whose strength of character keeps his distended organization free from the internal battles that sink so many others. He also has learned how to delegate and has developed a core of equally well-respected close associates who coordinate different parts of the project.

One intriguing possibility for future open-source development comes from medicine, an area where people can become passionate and where intellectual components can far exceed physical components. Consider a smart doctor who loses a friend to rare disease X and decides to devote her life to finding a cure. Ten years ago, trying to develop anything more than a local network to collaboratively design a drug to cure the disease would have been extremely difficult. Communication would have had to be done over the phone or with photocopies slowly and expensively mailed. It made much more sense to have small groups of developers working together in laboratories, foundations, or universities.

Today the possibilities for open collaboration have improved. An ambitious doctor can network online with other researchers interested in disease X and, at the least, can quickly exchange data about the newest research techniques. In fact, there are already medical networks that pass around information about acute medical cases, using email and computers that can automatically send out patient files over a network and put X-rays into the overnight mail.

Now think another decade ahead when everyone will have high-speed Internet lines at least 500 times as fast as standard connections today (this will probably happen in three years), where it is fairly likely that we all will be able to simulate the movement of disease X online, and where it will surely be possible for medical students to run tests that approximate the human immune system on high-powered laboratory computers. Now the same doctor can farm out parts of the project to interested collaborators, many of whom have also lost friends to X and are passionate about finding a cure. If the coordinator is a good organizer and can hold people together the way that Torvalds has, the organization could grow, attracting even more people to join the collaborative effort with each success. Every breakthrough or improvement in the model could be posted online so that other participants could begin work on the next challenge. If a sample test is performed, data could be transferred to the Web simultaneously.

Eventually a prototype could be developed and adopted by an established drug company (or perhaps even a non-profit company, funded by foundations, that specializes in distributing open-source drugs and selling them at minimal costs) that licenses the product with the FDA, runs it through the necessary tests, and then manufactures, distributes and sells it--keeping prices relatively low both because no company would have exclusive copyrights and because research costs (drug companies' largest expense) would be drastically reduced.


A real-life example of another possible opportunity for open source comes from Harvard where law Professors Larry Lessig and Charles Nesson have started the Open Law Project, an attempt to try cases using the open-source model. Interested people sign into the Website, read what other contributors have written, and help to develop arguments and briefs. According to the site description, "what we lose in secrecy, we expect to regain in depth of sources and breadth of argument." The program is run under the same sort of benevolent dictatorship model as Linux, with Lessig serving as chief. People brainstorm, debate, and then Lessig synthesizes and writes the briefs. Currently, the group is constructing arguments challenging, appropriately enough, the United States Copyright Extension Act.

There are great advantages to this model for law: The problems faced by lawyers are mostly intellectual, not physical; there is an abundance of people (especially law students) who are potentially willing to work on the projects for free; and there is the powerful draw of doing something you believe is a public service. If you don't agree with current copyright laws, join the group and figure out how to change them.

Of course, open-law will never be able to handle certain kinds of cases. As Nesson said to me, "open-law is not conducive to ambush." If you need to rely on secret arguments or evidence that the other side doesn't know about, you're not going to post everything on the Net. But if you just need to develop the best argument and rely on increased information flows, the model could work quite well.


It is very difficult to determine exactly where open-source projects will take off next. So much depends on the personalities of the coordinators and the excitement they are able to generate; a great deal also depends on the different ways that technology develops and the way that different markets and research fields change. For some organizations, open-source development will make more sense in a couple of years than it does now. For others, it will make less. But the overall trends of technology are likely to push open source closer and closer to the mainstream.

Imagine a scale with all the advantages of a proprietary model on the left and all the advantages of an open-source model on the right. Pretend everybody who wants to solve a problem or build a project has a scale like this. If it tips to the left, the proprietary model is chosen; if it tips to the right, the open model is chosen. Now, as connectivity increases with the Internet, and computer power increases exponentially, more and more weight accumulates on the right. Every time computer power increases, another household gets wired, or a new simulator is built online, a little more weight is added to the right. Having the example of Linux to learn from adds some more weight to the right; the next successful open-source project will add even more.

Not enough is added to the right side to tip the scale for everybody and everything, but open source is presently growing and it should only continue that way. Netscape has made its Web browser open source. Sendmail, the program that routes most of our email, is open source. Most Web sites use an open-source program called Apache at their core. Even some microchip developers are starting to use open source.

Perhaps the next boom in open source will come from the law; perhaps from drug X; perhaps it will be something entirely different. Although it's difficult to tell, it is quite likely that the scale is going to tip for some projects and that there will be serious efforts at open-source development in the next decade. Moreover, it's quite likely some of these projects will work. Open source has created the fastest growing operating system in the world and it's done so by capitalizing on changes in technology that will almost certainly seem commonplace in a decade or two. Linux will continue to grow; but 10 years from now, it will probably no longer be the largest open-source project in the world.

Thursday, February 22, 2007

A Watershed for Open Source

Open-source software isn't a new phenomenon. It has been winding its way through the tech world for decades, starting with Richard Stallman's Free Software movement in 1980s. But only in recent years have businesses warmed to the promise of low-cost, openly available software. In fact, open-source programs have become so popular, they now pose a legitimate threat to the established software giants.

Looking back, 2005 will likely be viewed as a turning point. It was a year when CIOs signed off on open-source projects, a big change from previous years when that happened only after low-level engineers started such projects on their own initiative. It was a year when venture capitalists woke up to the new business opportunities of open source. It was a year when open source was the word on the lips of not just early adopters but of an early majority. According to a new study by consulting firm Optaros, 87% of organizations are now using open-source software, somewhere.

BusinessWeek Online paused in the final days of 2005 to poll a dozen experts, investors, early adopters, and entrepreneurs to get their take on the five biggest open-source events of 2005 -- as well as what to expect for 2006. The following are based on their responses.

1. Red Hat finally proves to everyone it can make money from free software. It took Red Hat (RHAT), which sells and supports a version of the Linux operating system for businesses, nearly 10 years to find its footing, but boy has it. On Dec. 22 it announced stellar third-quarter earnings, with revenues up 43.6%, to $73.1 million, and profits up 114%, to 12 cents per share.

Finally, the Linux movement has a pure open-source success story to point to, and as practically the only vendor that's publicly traded, Red Hat has become a hot commodity. The stock is trading north of $28 as of Dec. 27, up from $13.06 at the beginning of 2005 -- a boost of more than 110%.

And Wall Street is bullish about next year. "Red Hat is one of the best-positioned stocks in software and should be able to further capitalize on the growing demand for open source," wrote Credit Suisse First Boston analyst Jason Maynard in a post-earnings research note.

2. Sun Microsystems open sources everything -- except Java. One reason Linux is becoming mainstream is the broad endorsement from just about everyone who matters in techdom, whether it's Dell (DELL) or Hewlett-Packard (HPQ), whose servers run Linux, or IBM (IBM), which is making a name in open-source support and integration.

Enter Sun Microsystems (SUNW), which made a bold move in late November to open-source almost all of its software except Java. The move transformed Sun into one of the largest open-source software players overnight. Yet critics have complained that what open-source developers really want is Java.

Several experts expect that Sun might finally capitulate in 2006. "It took them a long time to realize if you don't open-source and you're not a market leader, you're dead," says Peter Yared, CEO of open-source startup Active Grid and a former Sun executive.

Novell (NOVL) is another company trying to revive its business through open source. The results have been mixed since it bought Red Hat competitor Suse Linux two years ago. Look for 2006 to be the year it gets its act together -- or gets a new management team (see BW, 10/31/05, "Cold Realities for Novell").

3. Motorola bets big on mobile Linux. Linux is commonplace on servers and is working its way onto many desktops around the world. But desktop- and server-makers don't have to worry about details like battery life. Wireless-phone manufacturers do, and that's Linux' next great frontier. Open Source Development Labs, a nonprofit group that governs and advocates for Linux, formed a Mobile Linux Initiative in October to address these problems.

Even more exciting for penguin lovers, Motorola (MOT), the second-biggest handset maker in the world, announced that Linux would be its standard operating system for the bulk of its future phones. If the OSDL makes progress on the code, other handset makers could follow suit in 2006 (see BW Online, 11/8/05, "Linux Answers Phone Makers' Call").

4. Firefox goes mainstream. The bulk of open-source strides have been made in the business world, as most Linux phones are only sold in China, and Microsoft (MSFT) still dominates the desktop. Firefox is an important exception. The popular browser marked its 100 millionth download in October just before its first birthday, proving how well a mass market can accept open-source software when done right.

"There was a question as to whether we [open-source developers] could do user interfaces, and that's much less of a question now," says Bruce Perens, head of developer relations for open-source startup SourceLabs. Perens and some others think Linux desktop programs could gain steam among consumers in 2006, particularly in emerging countries in Asia and South America where Microsoft's Windows hasn't gained dominance.

5. Venture capitalists wake up to open source. Industry estimates show some $400 million was invested in open-source startups in 2005. Two types of companies dominated the landscape: First, so-called application companies, such as SugarCRM which makes customer relationship management software for companies and aims to compete with Siebel (SEBL) and (CRM).

The other category is services companies, which play the middleman between open-source projects and the info-tech departments at large corporations. Companies such as SpikeSource and SourceLabs test and maintain applications like SugarCRM for companies (see BW Online, 10/3/05, "Open Source: Now It's an Ecosystem").

There's a lot of skepticism about these newer entrants. A few are hits, such as MySQL, which makes open-source database software and is said to be closing in on $40 million in revenues this year. But not too many others are showing much traction.

In 2006, they'll have to put up real revenues or shut down. "Half the companies that raised venture money in 2005 won't be able to raise money in 2006," says Matt Asay who organizes the annual Open Source Business Conference and is vice-president for business development at Alfresco, an open-source document-management startup.

All in all, it has been a pretty great year for open source. And 2006 may be even bigger and better.

Open-Source Software Development


Open source is software developed by uncoordinated but loosely collaborating programmers, using freely distributed source code and the communications infrastructure of the Internet. Open source has a long history rooted in the Hacker Ethic. The term open source was adopted in large part because of the ambiguous nature of free software. Various categories of free and non-free software are commonly referenced, some with interchangeable meanings. Several licensing agreements have therefore been developed to formalize distribution terms. The Cathedral and the Bazaar is the most frequently cited description of the open-source development methodology , however although the paper identifies many mechanisms of successful open-source development, it does not expose the dynamics. There are literally hundreds, if not thousands, of open-source projects currently in existence.

1. Introduction

Open source has generated a considerable amount of interest over the past year. The concept itself is based on the philosophy of free software, which advocates freely available source code as a fundamental right. However, open source extends this ideology slightly to present a more commercial approach that includes both a business model and development methodology.

Open Source Software, or OSS, refers to software for which the source code is distributed without charge or limitations on modifications. Open source sells this approach as a business model, emphasizing faster development and lower overhead, as well as a closer customer relationship and exposure to a broader market.

Open source also encompasses a software development methodology. Open source development is described as a rapid evolutionary process, which leverages large-scale peer review. The basic premise is that allowing source code to be freely modified and redistributed encourages collaborative development. The software can be incrementally improved and more easily tested, resulting in a highly reliable product.

1.1 Background

Much of the Internet infrastructure is open-source software. For example, Sendmail is the dominant mail transfer system on the Internet. BIND is the most widely used implementation of the Internet Domain Name System, and InterNetNews is the most popular Usenet news server. (O’Reilly, 1998a) It is therefore no surprise that the momentum associated with open source has coincided with the rapid growth of the Internet. The Web has made collaboration between programmers easier and possible on a larger scale than before, and projects such as Linux and Apache have become immensely successful. The projected size of various open-source communities is shown in Table 1 (O'Reilly, 1998b).

Table 1. Projected size of open-source communities.

Estimating size of user community
Linux 7,000,000
Perl 1,000,000
BSD 960,000
Apache 400,000
Sendmail 350,000
Python 325,000
Tcl/Tk 300,000
Samba 160,000

The response from the software industry has been varied, but open source has made some notable inroads in a relatively short time. IBM has adopted the Apache Web server as the cornerstone of its WebSphere Internet-commerce application server (IBM, 1998). IBM has also released the source code for an XML parser and Jikes, a Java byte code interpreter (Gonsalves, 1998). Netscape has released the source code for the next generation of its popular Communicator product, restructuring ongoing development as an open-source project (Charles, 1998). Apple has taken a similar approach, releasing portions of its next generation operating system, MacOS X, as open source (Apple, 1999a). Microsoft acknowledged open source as a potential business threat in an internal memo that was subsequently leaked to the press (Valloppillil, 1998), and has recently indicated that it may consider releasing some code.

These developments demonstrate a sustained interest in open source, and it is quickly becoming a viable alternative to conventional methods of software development, as companies attempt to leverage the Internet in reducing time to market.

2. History

2.1 The Hacker Ethic

Open source is firmly rooted in the Hacker Ethic. In the late 1950’s, MIT’s computer culture originated the term hacker, defined today as "a person who enjoys exploring the details of programmable systems …" (Raymond, 1996). Various members of the Tech Model Railroad Club, or TMRC, formed the nucleus of MIT’s Artificial Intelligence Laboratory. These individuals were obsessed with the way systems worked. The word hack had long been used to describe elaborate college pranks devised by MIT students, however TMRC members used the word to describe a task ‘imbued with innovation, style, and technical virtuosity" (Levy, 1984). A project undertaken not solely to fulfill some constructive goal, but with some intense creative interest was called a hack.

Projects encompassed everything electronic, including constant improvements to the elaborate switching system controlling the TMRC’s model railroad. Increasingly though, attentions were directed toward writing computer programs, initially for an IBM 704 and later on the TX-0, one of the first transistor-run computers in the world. Early hackers would spend days working on programs intended to explore the limits of these machines.

In 1961, MIT acquired a PDP-1, the first minicomputer, designed not for huge number-crunching tasks but for scientific inquiry, mathematical formulations, and of course hacking. Manufactured by Digital Equipment Corporation, the PDP series of computers pioneered commercial interactive computing and time-sharing operating systems. MIT hackers developed software that was freely distributed by DEC to other PDP owners. Programming at MIT became a rigorous application of the Hacker Ethic, a belief that "access to computers – and anything which might teach you something about the way the world works – should be unlimited and total" (Levy, 1984).

2.2 ARPAnet

MIT was soon joined by Stanford University’s Artificial Intelligence Laboratory and later Carnegie-Mellon University. All were thriving centres of software development able to communicate with each other through the ARPAnet, the first transcontinental, high-speed data network. Built by the Defense Department in the late 1960’s, it was originally designed as an experiment in digital communication. However, the ARPAnet quickly grew to link hundreds of universities, defense contractors, and research laboratories. This allowed for the free exchange of information with unprecedented speed and flexibility, particularly software.

Programmers began to actively contribute to various shared projects. These early collaborative efforts led to informal principles and guidelines for distributed software development stemming from the Hacker Ethic. The most widely known of these projects was UNIX, which contributed to the ongoing growth of what would eventually become the Internet.

2.3 Unix and BSD

Unix was originally developed at AT&T Bell Labs, and was not strictly speaking a freely available product. However, it was licensed to universities for a nominal sum, which resulted in an explosion of creativity as programmers built on each other’s work.

Traditionally, operating systems had been written in assembler to maximize hardware efficiency, but by the early 1970’s hardware and compiler technology had become good enough that an entire operating system could be written in a higher level language. UNIX was written in C, and this provided unheard of portability between hardware platforms, allowing programmers to write software that could be more easily shared and dispersed.

The most significant source of Unix development outside of Bell Labs was the University of California at Berkeley. UC Berkeley’s Computer Science Research Group folded their own changes and other contributions into a succession of releases. Berkley Unix came to be known as BSD, or Berkley Standard Distribution, and included a rewritten file system, networking capabilities, virtual memory support, and a variety of utilities (Ritchie, 1979).

A few of the BSD contributors founded Sun Microsystems, marketing Unix on 68000-based hardware. Rivalry ensued between supporters of Berkley Unix and AT&T versions. This intensified in 1984, when AT&T divested and Unix was sold as a commercial product for the first time through Unix System Laboratories.

2.4 The GNU Project

The commercialization of Unix not only fractured the developer community, but it resulted in a confusing mass of competing standards that made it increasingly difficult to develop portable software. Other companies had entered the marketplace, selling various proprietary versions of Unix. Development largely stagnated, and Unix System Laboratories was sold to Novell after efforts to create a canonical commercial version failed. The GNU project was conceived in 1983 to rekindle the cooperative spirit that had previously dominated software development.

GNU, which stands for GNU’s Not Unix, was initiated under the direction of Richard S. Stallman, who had been a later participant in MIT’s Artificial Intelligence Lab and believed strongly in the Hacker Ethic. The GNU project had the ambitious goal of developing a freely available Unix-like operating system that would include command processors, assemblers, compilers, interpreters, debuggers, text editors, mailers, and much more. (FSF, 1998a)

Stallman created the Free Software Foundation, an organization that promotes the development and use of free software, particularly the GNU operating system (FSF, 1998c). Hundreds of programmers created new, freely available versions of all major Unix utility programs. Many of these utilities were so powerful that they became the de facto standard on all Unix systems. However, a project to create a replacement for the Unix kernel faltered.

By the early 1990’s, the proliferation of low-cost, high-performance personal computers along with the rapid growth of the World Wide Web had reduced entry barriers to participation in collaborative projects. Free software development extended to reach a much larger community of potential contributors, and projects such as Linux and Apache became immensely successful, prompting a further formalism of hacker best practices.

2.5 The Cathedral and the Bazaar

The Cathedral and the Bazaar (Raymond, 1998a), a position paper advocating the Linux development model, was first presented at Linux Kongress 97 and made widely available on the Web shortly thereafter. The paper presents two singular approaches to software development. The Cathedral represents conventional commercial practices, where developers work using a relatively closed, centralized methodology. In contrast, the Bazaar embodies the Hacker Ethic, in which software development is an openly cooperative effort.

The paper essentially ignored contemporary techniques in software engineering, using the Cathedral as a pseudonym for the waterfall lifecycle of the 1970s (Royce, 1970), however it served to attract widespread attention. A grassroots movement quickly developed, culminating in a January 1998 announcement that Netscape Communications would release the source code for its Web browser. This was the first time that a Fortune 500 company had transformed an enormously popular commercial product into free software.

The term Open Source was coined shortly afterward out of a growing realization that free software development could be marketed as a viable alternative to commercial companies.

3. Definition

The term open source was adopted in large part because of the ambiguous nature of the expression free software. The notion of free software does not mean free in the financial sense, but instead refers to the users' freedom to run, copy, distribute, study, change and improve software. Confusion over the meaning can be traced to the problem that, in English, free can mean no cost as well as freedom. In most other languages, free and freedom do not share the same root; gratuit and libre, for instance. "To understand the concept, you should think of free speech, not free beer," writes Richard Stallman (FSF, 1999a).

3.1 Categories of Free and Non-Free Software

Due to the inherent ambiguity of the terminology, various wordings are used interchangeably. This is misleading, as software may be interpreted as something it is not. Even closely related terms such as free software and open source have developed subtle distinctions. (FSF, 1998b)

3.1.1 Public Domain

Free software is often confused with public domain software. If software is in the public domain, then it is not subject to ownership and there are no restrictions on its use or distribution. More specifically, public domain software is not copyrighted. If a developer places software in the public domain, then he or she has relinquished control over it. Someone else can take the software, modify it, and restrict the source code.

3.1.2 Freeware

Freeware is commonly used to describe software that can be redistributed but not modified. The source code is not available, and consequently freeware should not be used to refer to free software.

3.1.3 Shareware

Shareware is distributed freely, like freeware. Users can redistribute shareware, however anyone who continues to use a copy is required to pay a modest license fee. Shareware is seldom accompanied by the source code, and is not free software.

3.1.4 Open Source

Open source is used to mean more or less the same thing as free software. Free software is "software that comes with permission for anyone to use, copy, and distribute, either verbatim or with modifications, either gratis or for a fee." (FSF, 1999a) In particular, this means that source code must be available.

Free software is often used in a political context, whereas open source is a more commercially oriented term. The Free Software Foundation advocates free software as a right, emphasizing the ethical obligations associated with software distribution (Stallman, 1999). Open source is commonly used to describe the business case for free software, focusing more on the development process rather than any underlying moral requirements.

3.2 Licensing

Various free software licenses have been developed. The licenses each disclaim all warranties. The intent is to protect the author from any liability associated with the software. Since the software is provided free of charge, this would seem to be a reasonable request.

Table 2 provides a comparison of several common licensing practices (Perens, 1999).

Table 2. Comparison of licensing practices.

License Can be mixed with non-free software Modifications can be taken private and not returned to you Can be re-licensed by anyone Contains special privileges for the original copyright holder over your modifications
Public Domain X X X

3.2.1 Copyleft and the GNU Public License

Copyleft is a concept originated by Richard Stallman to address problems associated with placing software in the public domain. As mentioned previously, public domain software is not copyrighted. Someone can make changes to the software, many or few, and distribute the result as a proprietary product. People who receive the modified product may not have the same freedoms that the original author provided. Copyleft says that "anyone who redistributes the software, with or without changes, must pass along the freedom to further copy and change it." (FSF, 1999b)

To copyleft a program, first it is copyrighted and then specific distribution terms are added. These terms are a legal instrument that provide rights to "use, modify, and redistribute the program's code or any program derived from it but only if the distribution terms are unchanged." (FSF, 1999b)

In the GNU project, copyleft distribution terms are contained in the GNU General Public License, or GPL. The GPL does not allow private modifications. Any changes must also be distributed under the GPL. This not only protects the original author, but it also encourages collaboration, as any improvements are made freely available

Additionally, the GPL does not allow the incorporation of licensed programs into proprietary software. Any software that does not grant as many rights as the GPL is defined as proprietary. However, the GPL contains certain loopholes that allow it to be used with software that is not entirely free. Software libraries that are normally distributed with the compiler or operating system may be linked with programs licensed under the GPL. The result is a partially-free program. The copyright holder has the right to violate the license, but this right does not extend to any third parties who redistribute the program. Subsequent distributions must follow all of the terms of the license, even those that the copyright holder violates.

An alternate form of the GPL, the GNU General Library Public License or LGPL, allows the linking of free software libraries into proprietary executables under certain conditions. In this way, commercial development can also benefit from free software. A program covered by the LGPL can be converted to the GPL at any time, but that program, or anything derived from it, cannot be converted back to the LGPL.

The GPL is a political manifesto as well as a software license, and much of the text is concerned with explaining the rationale behind the license. Unfortunately this political dialogue has alienated some developers. For example, Larry Wall, creator of Perl and the Artistic license, says "the FSF [Free Software Foundation] has religious aspects that I don’t care for" (Lash, 1998). As a result, some free software advocates have created more liberal licensing terms, avoiding the political rhetoric associated with the GPL.

3.2.2 The X, BSD, and Apache Licenses

The X license and the related BSD and Apache licenses are very different from the GPL and LGPL. The software originally covered by the X and BSD licenses was funded by monetary grants from the US government. In this sense, the public owned the software, and the X and BSD licenses therefore grant relatively broad permissions.

The most important difference is that X-licensed modifications can be made private. An X-licensed program can be modified and redistributed without including the source or applying the X license to the modifications. Other developers have adopted the X license and its variants, including the BSD and the Apache web server.

3.2.3 The Artistic License

The Artistic license was originally developed for Perl, however it has since been used for other software. The terms are more loosely defined in comparison with other licensing agreements, and the license is more commercially oriented. For instance, under certain conditions modifications can be made private. Furthermore, although sale of the software is prohibited, the software can be bundled with other programs, which may or may not be commercial, and sold.

3.2.4 The Netscape Public License and the Mozilla Public License

The Netscape Public License, or NPL, was originally developed by Netscape. The NPL contains special privileges that apply only to Netscape. Specifically, it allows Netscape to re-license code covered by the NPL to third parties under different terms. This provision was necessary to satisfy proprietary contracts between Netscape and other companies. The NPL also allows Netscape to use code covered by the NPL in other Netscape products without those products falling under the NPL.

Not surprisingly, the free software community was somewhat critical of the NPL. Netscape subsequently released the MPL, or Mozilla Public License. The MPL is similar to the NPL, but it does not contain exemptions. Both the NPL and the MPL allow private modifications.

3.2.5 The Open Source Definition

The Open Source Definition is not a software license. Instead it is a specification of what is permissible in a software license for that software to be considered open source. The Open Source Definition is based on the Debian free software guidelines or social contract, which provides a framework for evaluating other free software licenses.

The Open Source Definition includes several criteria, which can be paraphrased as follows (OSI, 1999):

  1. Free Redistribution – Copies of the software can be made at no cost.
  2. Source Code – The source code must be distributed with the original work, as well as all derived works.
  3. Derived Works – Modifications are allowed, however it is not required that the derived work be subject to the same license terms as the original work.
  4. Integrity of the Author’s Source Code – Modifications to the original work may be restricted only if the distribution of patches is allowed. Derived works may be required to carry a different name or version number from the original software.
  5. No Discrimination Against Persons or Groups – Discrimination against any person or group of persons is not allowed.
  6. No Discrimination Against Fields of Endeavor – Restrictions preventing use of the software by a certain business or area of research are not allowed.
  7. Distribution of License – Any terms should apply automatically without written authorization.
  8. License Must Not Be Specific to a Product – Rights attached to a program must not depend on that program being part of a specific software distribution.
  9. License Must Not Contaminate Other Software – Restrictions on other software distributed with the licensed software are not allowed.

The GNU GPL, BSD, X Consortium, MPL, and Artistic licenses are all examples of licenses that conform to the Open Source Definition.

The evaluation of a proposed license elicits considerable debate in the free software community. With the growing popularity of open source, many companies are developing licenses intended to capitalize on this interest. Some of these licenses conform to the Open Source Definition, however others do not. For example, the Sun Community Source License approximates some open source concepts, but it does not conform to the Open Source Definition. The Apple Public Source License, or APSL (Apple, 1999b), as been alternately endorsed and rejected by members of the open-source community.

4. Methodology

The Cathedral and the Bazaar is the most frequently cited description of the open-source development methodology. Eric Raymond’s discussion of the Linux development model as applied to a small project is a useful commentary. However, it should be noted that although the paper identifies many mechanisms of successful open-source development, it does not expose the dynamics. In this sense, the description is inherently weak.

4.1 Plausible Promise

Raymond remarks that it would be difficult to originate a project in bazaar mode. To build a community, a program must first demonstrate plausible promise. The implementation can be crude or incomplete, but it must convince others of its potential. This is given as a necessary precondition of the bazaar, or open-source, style.

Interestingly, many commercial software companies use this approach to ship software products. Microsoft, for example, consistently ships early versions of products that are notoriously bug ridden. However as long as a product can demonstrate plausible promise, either by setting a standard or uniquely satisfying a potential need, it is not necessary for early versions to be particularly strong.

Critics suggest that the effective utilization of bazaar principles by closed source developers implies ambiguity. Specifically, that the Cathedral and the Bazaar does not sufficiently describe certain aspects of the open-source development process (Eunice, 1998).

4.2 Release Early, Release Often

Early and frequent releases are critical to open-source development. Improvements in functionality are incremental, allowing for rapid evolution, and developers are "rewarded by the sight of constant improvement in their work." (Raymond, 1998a)

Product evolution and incremental development are not new. Mills initially proposed that any software system should be grown by incremental development (Mills, 1971). Brooks would later elaborate on this concept, suggesting that developers should grow rather than build software, adding more functions to systems as they are run, used, and tested (Brooks, 1986). Basili suggested the concept of iterative enhancement in large-scale software development (Basili and Turner, 1975), and Boehm proposed the spiral model, a evolutionary prototyping approach incorporating risk management (Boehm, 1986).

Open source relies on the Internet to noticeably shorten the iterative cycle. Raymond notes that "it wasn’t unknown for [Linus] to release a new kernel more than once a day." (Raymond, 1998a) Mechanisms for efficient distribution and rapid feedback make this practice effective.

However, successful application of an evolutionary approach is highly dependent on a modular architecture. Weak modularity compromises change impact and minimizes the effectiveness of individual contributors. In this respect, projects that do not encourage a modular architecture may not be suitable for open-source development. This contradicts Raymond’s underlying assertion, that open source is a universally better approach.

4.3 Debugging is Parallelizable

Raymond emphasizes large-scale peer review as the fundamental difference underlying the cathedral and bazaar styles. The bazaar style assumes that "given a large enough beta-tester and co-developer base, almost every problem will be characterized quickly and the fix obvious to someone." Debugging requires less coordination relative to development, and thus is not subject "to the same quadratic complexity and management costs that make adding developers problematic." (Raymond, 1998a)

The basic premise is that more debuggers will contribute to a shorter test cycle without significant additional cost. In other words, "more users find more bugs because adding more users adds more ways of stressing the program." (Raymond, 1998a) However, open source is not a prerequisite for peer review. For instance, various forms of peer review are commonly employed in software engineering. The question might then become one of scale, but Microsoft practices beta-testing on a scale matched only by larger open-source projects.

Raymond continues, suggesting that debugging is even more efficient when users are co-developers, as is most often the case in open-source projects. This is also subject to debate. Raymond notes that each tester "approaches the task of bug characterization with a slightly different perceptual set and analytical toolkit, a different angle on the problem." (Raymond, 1998a) This is characterized by the fact that developers and end-users evaluate products in very different ways. It therefore seems likely that peer review under the bazaar model would be constrained by a disproportionate number of co-developers.

5. Project Profiles

There are literally hundreds, if not thousands, of open-source projects currently in existence. These projects include operating systems, programming languages, utilities, Internet applications and many more. The following projects are notable for their influence, size, and success.

5.1 Linux

Linux is a Unix-like operating system that runs on several platforms, including Intel processors, Motorola MC68K, and DEC Alphas (SSC, 1998). It is a superset of the POSIX specification, with SYS V and BSD extensions. Linux began as a hobby project of Linus Torvalds, a graduate student at the University of Helsinki. The project was inspired by his interest in Minix, a small Unix system developed primarily as an educational tool by Andy Tannenbaum. Linus set out to create, in his own words, "a better Minix than Minix." In October 1991, Linus announced the first official release of Linux, version 0.02. Since then, hundreds of programmers have contributed to the ongoing improvement of the system.

Linux kernel development is largely coordinated through the linux-kernel mailing list. The list is high volume, and currently includes over 200 active developers as well as many other debuggers and testers. With the growth of the project, Linus has relinquished control over certain areas of the kernel, such as file systems and networking, to other ‘trusted lieutenants." However, Linus remains the final authority on decisions related to kernel development. The kernel is under the GPL, and official versions are made available via ftp.

Arguably the most well known open-source project, Linux has quietly gained popularity in academia as well as among scientific researchers and Internet service providers. Recently, it has made commercial advances, and is currently marketed as the only viable alternative to Microsoft Windows NT. A study by International Data Corporation reported that Linux accounted for 17.2 % of server operating system shipments in 1998, an increase of 212% over the previous year (Shankland, 1998). The Linux kernel is typically packaged with the various other programs that comprise a Unix operating system. Several commercial companies currently sell these packages as Linux distributions.

5.2 Apache

Apache originated in early 1995 as a series of enhancements to the then-popular public domain HTTP daemon developed by Rob McCool at the National Center for Supercomputing Applications, or NCSA. Rob McCool had left NCSA in mid 1994, and many Webmasters had become frustrated with a lack of further development. Some proceeded to develop their own fixes and improvements. A small group coordinated these changes in the form of patches and made the first official release of the Apache server in April 1995, hence the name A PAtCHy server. (Laurie, 1999)

The Apache Group is currently a core group of about 20 project contributors, who now focus more on business issues and security problems. The larger user community manages mainstream development. Apache operates as a meritocracy, in a format similar to most open-source projects. Responsibility is based on contribution, or "the more work you have done, the more work you are allowed to do." (The Apache Group, 1999) Development is coordinated through the new-httpd mailing list, and a voting process exists for conflict resolution.

Apache has consistently ranked as the most popular Web server on the Internet (Netcraft, 1999). Currently, Apache dominates the market and is more widely used than all other Web servers combined. Industry leaders such as DEC, UUNet, and Yahoo use Apache. Several companies, including C2Net, distribute commercial versions of Apache, earning money for support services and added utilities.

5.3 Mozilla

Mozilla is an open-source deployment of Netscape’s popular Web browsing suite, Netscape Communicator. Netscape’s decision was strongly influenced by a whitepaper written by employee Frank Hecker (Hecker, 1998), which referenced the Cathedral and the Bazaar. In January 1998, Netscape announced that the source code for the next generation of Communicator would be made freely available. The first developer release of the source code was made in late March 1998. exists as a group within Netscape responsible for coordinating development. Mozilla has established an extensive web site, which includes problem reporting and version management tools. Discussion forums are available through various newsgroups and mailing lists. The project is highly modular and consists of about 60 groups, each responsible for a particular subsystem. All code issued in March was released under the NPL. New code can be released under the MPL or any compatible license. Changes to the original code are considered modifications and are covered by the NPL.

Although it has benefited from widespread media exposure, Mozilla has yet to result in a production release. It is therefore difficult to evaluate the commercial success of the project. The recent merger of AOL and Netscape has introduced additional uncertainty, but many continue to feel confident that the project will produce a next generation browser.

5.4 Perl and Python

Perl and Python are mature scripting languages that have achieved considerable market success. Originally developed in 1986 by Larry Wall, Perl has become the language of choice for system and network administration, as well as CGI programming. Large commercial Web sites such as Yahoo and Amazon make extensive use of Perl to provide interactive services.

Perl, which stands for Practical Extraction and Report Language, is maintained by a core group of programmers via the perl5porters mailing list. Larry Wall retains artistic control of the language, but a well-defined extension mechanism allows for the development of add-on modules by independent programmers. (Wall et al, 1996)

Python was developed by Guido van Rossum at Centrum voor Wiskunde en Informatica, or CWI, in Amsterdam. It is an interactive, object-oriented language and includes interfaces to various system calls and libraries, as well as to several popular windowing systems. The Python implementation is portable and runs on most common platforms. (Lutz, 1996)

5.5 KDE and GNOME

KDE and GNOME are X11 based desktop environments. KDE also includes an application development framework and desktop office suite. The application framework is based on KOM/OpenParts technology, and leverages open industry standards such as the object request broker CORBA 2.0. The office suite, KOffice, consists of a spreadsheet, a presentation tool, an organizer, and an email and news client.

GNOME, or the GNU Network Object Model Environment, is similar in many ways to KDE. However GNOME uses the gtk+ toolkit, which is also open source, whereas KDE uses Qt, a foundation library from Troll Tech that was commercially licensed until recently.

KDE and GNOME are interesting because they represent the varying commitments in the open source community to commercial markets and the free software philosophy. The KDE group and Troll Tech initially tried to incorporate Qt, a proprietary product, into the Linux infrastructure. This was met with mixed reactions. The prospect of a graphical desktop for Linux was so attractive that some were willing to overlook the contradictory nature of the project. However, others rejected KDE and instead supported GNOME, which was initiated as a fully open source competitor. Eventually, Troll Tech realized Qt would not be successful in the Linux market without a change in license, and a new agreement was released, defusing the conflict. GNOME continues, aiming to best KDE in terms of functionality rather than philosophy (Perens, 1999).

5.6 Other Projects

Other lesser known, but equally interesting, projects include GIMP, FreeBuilder, Samba, and Kaffe. Each of these projects follows the open source methodology, originating under the direction of an individual or small group and rapidly extending to a larger development community.

GIMP, or the GNU Image Manipulation Program, can be used for tasks such as photo retouching, image composition and image authoring. GIMP was written by Peter Mattis and Spencer Kimball, and released under the GPL. FreeBuilder is a visual programming environment based on Java. It includes an integrated text editor, debugger, and compiler. Samba allows Unix systems to act as file and print servers on Microsoft Windows networks. Development is headed by Andrew Tridgell. Kaffe is a cleanroom implementation of the Java virtual machine and class libraries.

6. Summary and Conclusions

Open source is software developed by uncoordinated but loosely collaborating programmers, using freely distributed source code and the communications infrastructure of the Internet. Open source is based on the philosophy of free software. However, open source extends this ideology slightly to present a more commercial approach that includes both a business model and development methodology. Various categories of free and non-free software are commonly, and incorrectly, referenced, including public domain, freeware, and shareware. Licensing agreements such as the GPL have been developed to formalize distribution terms. The Open Source Definition provides a framework for evaluating these licenses.

The Cathedral and the Bazaar is the most frequently cited description of the open-source development methodology, however although the paper identifies many mechanisms of successful open-source development, it does not expose the dynamics. Critics note that certain aspects remain ambiguous, which suggests that the paper does not sufficiently describe the open-source development process.

There are hundreds, if not thousands, of open-source projects currently in existence. These projects face growing challenges in terms of scalability and inherently weak tool support. However open source is a pragmatic example of software development over the Internet. It provides interesting lessons with regard to large-scale collaboration and distributed coordination. If the open-source approach is adequately studied and evaluated, ideally the methodology might be applied in a broader context.