So I've always wanted to write about programming, and suppose this blog is as good an avenue as any.
Programming has always interested me in its incredible depth and complexity. I've recently started thinking about, as my project grows rather large, effective ways of managing complexity. It seems that the way people think of programming depends on thresholds of program size and scope. Even worse, in extreme cases, ordinary techniques tend to fall apart.
Consider two sides of the spectrum:
- A hackish throw-away script written to bang out a result, where rigorous programming techniques would probably slow you down
- A project written and evolved over the course of several years
It takes no argument to deduce that this approach is not how you would handle a program maintained over several years. An important distinction is that in the case of a one-time script, you would need only to handle an instance of a problem. In a program that ought to stand the test of time, numerous corner cases present themselves, and must be dealt with methodically. The maintenance of the code base thus presents a challenging problem. If you write the code in a quick and dirty fashion through-out, you'll notice something disturbing. You will start with a simple elegant solution for a large subset of the problem, and then as corner cases and error handling is added for robustness, the code will swell with noisy code. Your essential solution to the problem at hand will be lost in details.
Code Always Changes
Chances are, if your one time script was truly successful and useful, it will be granted an extension of life. This extension of life can easily bring it to the boundaries of what it was intended to do. Vital to handling program complexity is to gracefully transition code from a prototype-like phase that solves most of the problems, to the point where it handles gracefully any number of conditions that can occur.
Consider a piece of code that began its life in a simple script:
function sanitize_file(filename):
file = open(filename)
file_lines = []
do yet another thing to line
if function1(line) and function2(line):
append(file_lines, line)
write_to_file(file, line)
return success
This solves the original problem just fine, and elegantly enough. This is primarily because, there was only one instance of that problem, and it could be thus reasoned that the code was sufficient. Now, after writing this code, you find it very useful and are going to now adapt it to a larger project.
Lets say the code is simply dropped in, and then as the code evolved, additional features for robustness were added, as well as a number of explanatory messages of failures. Some time later, you look back at the code you wrote, and see it has become:
function sanitize_file(filename):
file = open(filename)
if function1(line) and function2(line):
append(file_lines, line)
write_to_file(file, line)
for line in file:
do something to line
do another thing to linedo yet another thing to line
if function1(line) and function2(line):
append(file_lines, line)
for line in file_lines:
do something to line
do another thing to line
do yet another thing to linereturn success
This solves the original problem just fine, and elegantly enough. This is primarily because, there was only one instance of that problem, and it could be thus reasoned that the code was sufficient. Now, after writing this code, you find it very useful and are going to now adapt it to a larger project.
Lets say the code is simply dropped in, and then as the code evolved, additional features for robustness were added, as well as a number of explanatory messages of failures. Some time later, you look back at the code you wrote, and see it has become:
function sanitize_file(filename):
file = open(filename)
handle file didnt exist:
print("Invalid file!")
return failure
for line in file:
do something to line
do another thing to line
do yet another thing to line if function1(line) and function2(line):
append(file_lines, line)
handle error reading lines:
print("Error reading file, possibly too large for memory!")
return failure
file_backup = create_file( concat(filename ".backup"))
handle file already exists:
print("Could not create backup file, already exists!")
return failure
copy_file(file, file_backup)
handle write error:
handle write error:
print("Could not write to backup file!")
return failure
for line in file_lines:
do something to line
do another thing to line
do yet another thing to line
handle write error:
The Death of Innocence
Code that is maintained over a long time inherits a substantial amount of code that is added to handle corner cases. The main issue with this is that while the code originally had a simple premise: taking a file and overwriting its contents with a sanitized version, it is now buried in details, and it is unclear which of those details relate to the actual problem.
Consider for the sake of argument that to end users version 2 is much better, and it saves them from losing a lot of data. We are in the unfortunate position of no longer being able to believe that the original version sufficed, and now must accept the added complexity into our lives.
Simple-minded Code
How do we approach making code fit for a large project? More importantly, as not everyone writes large projects, how do we approach making this code fit for human consumption?
Often times when something is hard to understand or reason about, it attempts to do too many things. Even worse than doing too many things, is doing too many unrelated things. As hard as it is to reason about a path-finding algorithm, for example, it is much more complicated to reason about a path-finding algorithm that also draws and takes user input.
It is thus important to separate a piece of code's intent from its implementation. Many programming techniques such as classes, interface types, procedures and modules aim to handle this problem.
The simplest, and most ubiquitous, is the procedure. We can stand to gain by isolating the implementation of the code, from the intent of the code, with separate procedures:
function sanitized_line(line):
do yet another thing to line
return line
do yet another thing to line
return line
function line_isnt_comment(line):
(Of course, these function names should be even more descriptive of their intent, if we exactly decided what 'sanitizing' a file meant here)
Higher level thinking
delete_file(file_backup)
do yet another thing to line
return line
do yet another thing to line
return line
function line_isnt_comment(line):
print("Error writing file, reverting to backup!")
copy_file(file_backup, file)
return failurecopy_file(file_backup, file)
delete_file(file_backup)
return success
return success
The Death of Innocence
Code that is maintained over a long time inherits a substantial amount of code that is added to handle corner cases. The main issue with this is that while the code originally had a simple premise: taking a file and overwriting its contents with a sanitized version, it is now buried in details, and it is unclear which of those details relate to the actual problem.
Consider for the sake of argument that to end users version 2 is much better, and it saves them from losing a lot of data. We are in the unfortunate position of no longer being able to believe that the original version sufficed, and now must accept the added complexity into our lives.
Simple-minded Code
How do we approach making code fit for a large project? More importantly, as not everyone writes large projects, how do we approach making this code fit for human consumption?
Often times when something is hard to understand or reason about, it attempts to do too many things. Even worse than doing too many things, is doing too many unrelated things. As hard as it is to reason about a path-finding algorithm, for example, it is much more complicated to reason about a path-finding algorithm that also draws and takes user input.
It is thus important to separate a piece of code's intent from its implementation. Many programming techniques such as classes, interface types, procedures and modules aim to handle this problem.
The simplest, and most ubiquitous, is the procedure. We can stand to gain by isolating the implementation of the code, from the intent of the code, with separate procedures:
function sanitized_line(line):
do something to line
do another thing to linedo yet another thing to line
return line
function prepared_for_output(line):
do something to line
do another thing to linedo yet another thing to line
return line
return function1(line) and function2(line)
function sanitize_file(filename):
file = open(filename)
This version of the original logic, while longer, is both clearer and more re-usable. Units of logic are abstracted into appropriate blocks of code. Especially useful for large projects is the ability to change the implementation of these components without changing their intent, the logical extension of this is the once-and-only-once rule:
function sanitize_file(filename):
file = open(filename)
file_lines = []
return success
for line in file:
line = sanitized_line(line)
if line_isnt_comment(line):
append(file_lines, line)
for line in file_lines:
write_to_file(file, prepared_for_output(line))return success
This version of the original logic, while longer, is both clearer and more re-usable. Units of logic are abstracted into appropriate blocks of code. Especially useful for large projects is the ability to change the implementation of these components without changing their intent, the logical extension of this is the once-and-only-once rule:
Any functionality should only be encapsulated only once in the code. Any suspiciously similar blocks of code should be considered candidates for extraction into procedures. Any suspiciously similar types should be considered candidates for sharing components.I would say equally worth noting are the explanatory function names, made as long as they need to be. I believe this to be a very crucial practice for code maintainability and can justify it simply:
Comments are hard to constantly write and maintain, and any function name that is not clear in its intent necessitates many comments to keep the code understandable.In the end, it is far less typing to have long, explanatory, function names. The code is in a sense, self-documenting.
(Of course, these function names should be even more descriptive of their intent, if we exactly decided what 'sanitizing' a file meant here)
Higher level thinking
For our last code example, we conveniently forgot about all the error handling that we had just come to terms with as necessary. To properly tackle the next problem we must first ask ourselves, what errors should this procedure really worry about? What errors do not belong in this procedure?
For one, the procedure makes no sense unless a file really exists. Secondly, the procedure tries to hide the fact that a backup is being created, but this is not quite achieved. So we shall remove the burden of creating a backup file from the procedure itself, and make the interface as follows:
function sanitize_file(input_file, output_file)
We can maintain our previous interface by adding a wrapper function:
function sanitize_file_by_name(filename):
file = open(filename)
copy_file(file, file_backup)
handle write error:
For one, the procedure makes no sense unless a file really exists. Secondly, the procedure tries to hide the fact that a backup is being created, but this is not quite achieved. So we shall remove the burden of creating a backup file from the procedure itself, and make the interface as follows:
function sanitize_file(input_file, output_file)
We can maintain our previous interface by adding a wrapper function:
function sanitize_file_by_name(filename):
file = open(filename)
handle file didnt exist:
print("Invalid file!")
return failure
file_backup = create_file( concat(filename ".backup"))
handle file already exists:
print("Could not create backup file, already exists!")
return failurecopy_file(file, file_backup)
handle write error:
print("Could not write to backup file!")
return failure
was_successful = sanitize_file(file, file)
if not was_successful:
print("Error in sanitize_file, reverting to backup!")
copy_file(file_backup, file)
delete_file(file_backup)
return was_successful
The advantages of this organization are clear, the error handling for opening files occurs in an isolated function that actually concerns itself with opening these files. As well, sanitize_file's implementation is free from details such as the name of the backup file, and the detail of deleting the backup file on successful completion.
Let us now combine our results:The advantages of this organization are clear, the error handling for opening files occurs in an isolated function that actually concerns itself with opening these files. As well, sanitize_file's implementation is free from details such as the name of the backup file, and the detail of deleting the backup file on successful completion.
function sanitized_line(line):
do something to line
do another thing to linedo yet another thing to line
return line
function prepared_for_output(line):
do something to line
do another thing to linedo yet another thing to line
return line
return function1(line) and function2(line)
function sanitize_file(input_file, output_file)
function sanitize_file(input_file, output_file)
file_lines = []
handle error reading lines:
handle write error:
return success
function sanitize_file_by_name(filename):
file = open(filename)
copy_file(file, file_backup)
handle write error:
delete_file(file_backup)
A lot to bear - a single function was made into 4 functions. However, very importantly, the functions where the actual important details reside are quite simple. Indeed, if we want to change the backup behaviour, we change sanitize_file_by_name, if we want to change the details of the pre-processing step, we change sanitized_line, and so on.
While it may seem daunting that 4 functions were created, one must note that any changes that have to be made will very likely only have to be made in one of these functions. Thus we have greatly reduced the amount of code we must actually look at to solve any maintenance problem. As well, the individual functions can be reused for purposes that were not originally envisioned, thus providing more benefit to a larger project.
The greatest complexity occurs from interaction of program components, not from the components themselves. Keeping the interaction between program components as straight-forward as possible makes it far easier to manage change in code.
for line in input_file :
line = sanitized_line(line)
if line_isnt_comment(line):
append(file_lines, line) handle error reading lines:
print("Error reading file, possibly too large for memory!")
return failure
for line in file_lines:
write_to_file(
output_file , prepared_for_output(line)) handle write error:
print("Error writing to file!")
return failurereturn success
function sanitize_file_by_name(filename):
file = open(filename)
handle file didnt exist:
print("Invalid file!")
return failure
file_backup = create_file( concat(filename ".backup"))
handle file already exists:
print("Could not create backup file, already exists!")
return failurecopy_file(file, file_backup)
handle write error:
print("Could not write to backup file!")
return failure
was_successful = sanitize_file(file, file)
if not was_successful:
print("Error in sanitize_file, reverting to backup!")
copy_file(file_backup, file)
delete_file(file_backup)
return was_successful
A lot to bear - a single function was made into 4 functions. However, very importantly, the functions where the actual important details reside are quite simple. Indeed, if we want to change the backup behaviour, we change sanitize_file_by_name, if we want to change the details of the pre-processing step, we change sanitized_line, and so on.
While it may seem daunting that 4 functions were created, one must note that any changes that have to be made will very likely only have to be made in one of these functions. Thus we have greatly reduced the amount of code we must actually look at to solve any maintenance problem. As well, the individual functions can be reused for purposes that were not originally envisioned, thus providing more benefit to a larger project.
The greatest complexity occurs from interaction of program components, not from the components themselves. Keeping the interaction between program components as straight-forward as possible makes it far easier to manage change in code.