Saturday, 30 June 2012

Taming Program Complexity

Fast and Furious vs Built To Last

So I've always wanted to write about programming, and suppose this blog is as good an avenue as any.

Programming has always interested me in its incredible depth and complexity. I've recently started thinking about, as my project grows rather large, effective ways of managing complexity. It seems that the way people think of programming depends on thresholds of program size and scope. Even worse, in extreme cases, ordinary techniques tend to fall apart.

Consider two sides of the spectrum:
  1. A hackish throw-away script written to bang out a result, where rigorous programming techniques would probably slow you down
  2. A project written and evolved over the course of several years
The programming styles acceptable for these two projects seem to be at odds. In the case of a quick script, using a dynamic language can allow you to quickly type out code that you immediately visually confirm with some print statements, and build up to your result. Or you just type it all out as you feel it ought to work and let the program tell you otherwise. Either way, this is programming in its full glory. Quick, dirty, and way more effective than any manual approach to the problem. You can promptly toss out your code, and you'd still have saved time.

It takes no argument to deduce that this approach is not how you would handle a program maintained over several years. An important distinction is that in the case of a one-time script, you would need only to handle an instance of a problem. In a program that ought to stand the test of time, numerous corner cases present themselves, and must be dealt with methodically. The maintenance of the code base thus presents a challenging problem. If you write the code in a quick and dirty fashion through-out, you'll notice something disturbing. You will start with a simple elegant solution for a large subset of the problem, and then as corner cases and error handling is added for robustness, the code will swell with noisy code. Your essential solution to the problem at hand will be lost in details.

Code Always Changes
Chances are, if your one time script was truly successful and useful, it will be granted an extension of life. This extension of life can easily bring it to the boundaries of what it was intended to do. Vital to handling program complexity is to gracefully transition code from a prototype-like phase that solves most of the problems, to the point where it handles gracefully any number of conditions that can occur.

Consider a piece of code that began its life in a simple script:

function  sanitize_file(filename):
    file = open(filename)
    file_lines = []
    for line  in file:
         do something to line
         do another thing to line
         do yet another thing to line 
         if function1(line) 
and function2(line): 
             append(file_lines, line)
     for line  in file_lines:
         do something to line
         do another thing to line
         do yet another thing to line
         write_to_file(file, line)
    return success


This solves the original problem just fine, and elegantly enough. This is primarily because, there was only one instance of that problem, and it could be thus reasoned that the code was sufficient. Now, after writing this code, you find it very useful and are going to now adapt it to a larger project.

Lets say the code is simply dropped in, and then as the code evolved, additional features for robustness were added, as well as a number of explanatory messages of failures. Some time later, you look back at the code you wrote, and see it has become:

function  sanitize_file(filename):
    file = open(filename)
    handle file didnt exist:
         print("Invalid file!")
         return failure
     for line  in file:
         do something to line
         do another thing to line
         do yet another thing to line 
         if function1(line) 
and function2(line): 
             append(file_lines, line)
    handle error reading lines:
         print("Error reading file, possibly too large for memory!")
         return failure
    file_backup = create_fileconcat(filename  ".backup"))
    handle file already exists:
         print("Could not create backup file, already exists!")
         return failure 
    copy_file(file, file_backup)  
    handle write error:

         print("Could not write to backup file!")
         return failure 
     for line  in file_lines:
         do something to line
         do another thing to line
         do yet another thing to line
         write_to_file(file, line)
    handle write error:
         print("Error writing file, reverting to backup!")
         copy_file(file_backup, file)
         return failure
    delete_file(file_backup)
    return success 

The Death of Innocence

Code that is maintained over a long time inherits a substantial amount of code that is added to handle corner cases. The main issue with this is that while the code originally had a simple premise: taking a file and overwriting its contents with a sanitized version, it is now buried in details, and it is unclear which of those details relate to the actual problem. 


Consider for the sake of argument that to end users version 2 is much better, and it saves them from losing a lot of data. We are in the unfortunate position of no longer being able to believe that the original version sufficed, and now must accept the added complexity into our lives.

Simple-minded Code

How do we approach making code fit for a large project? More importantly, as not everyone writes large projects, how do we approach making this code fit for human consumption?

Often times when something is hard to understand or reason about, it attempts to do too many things. Even worse than doing too many things, is doing too many unrelated things. As hard as it is to reason about a path-finding algorithm, for example, it is much more complicated to reason about a path-finding algorithm that also draws and takes user input.


It is thus important to separate a piece of code's intent from its implementation. Many programming techniques such as classes, interface types, procedures and modules aim to handle this problem.

The simplest, and most ubiquitous, is the procedure. We can stand to gain by isolating the implementation of the code, from the intent of the code, with separate procedures:



function  sanitized_line(line): 
    do something to line
    do another thing to line
    do yet another thing to line
    return  line

function  prepared_for_output(line): 
    do something to line
    do another thing to line
    do yet another thing to line
    return  line
function  line_isnt_comment(line):  
          return  function1(line) and function2(line)

function  sanitize_file(filename):
    file = open(filename)
    file_lines = []
    for line  in file:
         line = sanitized_line(line)
         if line_isnt_comment(line):
             append(file_lines, line)
     for line  in file_lines:
         write_to_file(file, prepared_for_output(line))
    return success

This version of the original logic, while longer, is both clearer and more re-usable. Units of logic are abstracted into appropriate blocks of code. Especially useful for large projects is the ability to change the implementation of these components without changing their intent, the logical extension of this is the once-and-only-once rule:

Any functionality should only be encapsulated only once in the code. Any suspiciously similar blocks of code should be considered candidates for extraction into procedures. Any suspiciously similar types should be considered candidates for sharing components.
I would say equally worth noting are the explanatory function names, made as long as they need to be. I believe this to be a very crucial practice for code maintainability and can justify it simply: 
Comments are hard to constantly write and maintain, and any function name that is not clear in its intent necessitates many comments to keep the code understandable.
In the end, it is far less typing to have long, explanatory, function names. The code is in a sense, self-documenting.

(Of course, these function names should be even more descriptive of their intent, if we exactly decided what 'sanitizing' a file meant here)


Higher level thinking

For our last code example, we conveniently forgot about all the error handling that we had just come to terms with as necessary. To properly tackle the next problem we must first ask ourselves, what errors should this procedure really worry about? What errors do not belong in this procedure?

For one, the procedure makes no sense unless a file really exists. Secondly, the procedure tries to hide the fact that a backup is being created, but this is not quite achieved. So we shall remove the burden of creating a backup file from the procedure itself, and make the interface as follows:



function  sanitize_file(input_file, output_file)

We can maintain our previous interface by adding a wrapper function:

function 
 sanitize_file_by_name(filename): 
    file = open(filename)

    handle file didnt exist:
         print("Invalid file!")
         return failure
    file_backup = create_fileconcat(filename  ".backup"))
    handle file already exists:
         print("Could not create backup file, already exists!")
         return failure

    copy_file(file, file_backup)
  
    handle write error:

         print("Could not write to backup file!")
         return failure 
    was_successful = sanitize_file(file, file)
    if not was_successful:
         print("Error in sanitize_file, reverting to backup!")  
         copy_file(file_backup, file)

    delete_file(file_backup)

    return  was_successful


The advantages of this organization are clear, the error handling for opening files occurs in an isolated function that actually concerns itself with opening these files. As well, sanitize_file's implementation is free from details such as the name of the backup file, and the detail of deleting the backup file on successful completion.
Let us now combine our results:

function  sanitized_line(line): 
    do something to line
    do another thing to line
    do yet another thing to line
    return  line

function  prepared_for_output(line): 
    do something to line
    do another thing to line
    do yet another thing to line
    return  line
function  line_isnt_comment(line):  
          return  function1(line) and function2(line)

function  sanitize_file(input_file, output_file)
    file_lines = []
    for line  in  input_file :
         line = sanitized_line(line)
         if line_isnt_comment(line):
             append(file_lines, line) 
    handle error reading lines:

         print("Error reading file, possibly too large for memory!")
         return failure 

    for line  in file_lines:
         write_to_file( output_file prepared_for_output(line)) 
    handle write error:

         print("Error writing to file!")
         return failure

    return success

function 
 sanitize_file_by_name(filename): 
    file = open(filename)

    handle file didnt exist:
         print("Invalid file!")
         return failure
    file_backup = create_fileconcat(filename  ".backup"))
    handle file already exists:
         print("Could not create backup file, already exists!")
         return failure

    copy_file(file, file_backup)
  
    handle write error:

         print("Could not write to backup file!")
         return failure 
    was_successful = sanitize_file(file, file)
    if not was_successful:
         print("Error in sanitize_file, reverting to backup!")  
         copy_file(file_backup, file)

    delete_file(file_backup)

    return  was_successful

A lot to bear - a single function was made into 4 functions. However, very importantly, the functions where the actual important details reside are quite simple. Indeed, if we want to change the backup behaviour, we change sanitize_file_by_name, if we want to change the details of the pre-processing step, we change sanitized_line, and so on.

While it may seem daunting that 4 functions were created, one must note that any changes that have to be made will very likely only have to be made in one of these functions. Thus we have greatly reduced the amount of code we must actually look at to solve any maintenance problem. As well, the individual functions can be reused for purposes that were not originally envisioned, thus providing more benefit to a larger project.

The greatest complexity occurs from interaction of program components, not from the components themselves. Keeping the interaction between program components as straight-forward as possible makes it far easier to manage change in code.

No comments:

Post a comment