As Jim Collins pointed out in “Built To Last”, companies that want to be successful over a long period of time need to engineer their organization so that it consistently produces great results without the dependence on specific individuals. When I observe software developers operating in an organization, I often have the impression that they have a hard time contributing to this engineering process. This is an observation that is incomprehensible to me, because engineering an organization is like engineering a software program. Here I would like to present some similarities:
High-Level Analogies
The Flow of Data
An organization receives inputs and transforms them into outputs. This is actually the same as in software systems. So in order to engineer your organization you look at the kinds of data that flow into and out of the organization, just as you would do when you are building a software system. The data in an organization consists of reports, documents, e-mails and conversations, while the data in software systems might be messages, files, database records and HTTP-requests.
The software system is structured into several components with different responsibilities as well as an organization is structured into several functional units and roles. Between the software components there are interfaces to transport the data. The same holds true for functional units with their reports, meetings and interim results.
The Interface
In object-oriented designs you usually have interfaces to separate concerns and to hide the implementation details. The same applies when you define roles in a team. A role comes with specific responsibilities and activities. The same holds true for interfaces which also come with specific responsibilities (according to the interface contract) and activities (methods). It does not matter who is occupying a role as well as it does not matter which class implements an interface. A person can hold many roles as well as a class can implement many interfaces.
So when you design the team roles think of all responsibilities and activities that your team is confronted with and design them just like you would design the interfaces in your software system. And just like the interfaces in the software system, make sure that the team roles are small.
Reliability Patterns
Since in an organization you are dealing with people, there is a high need for fault tolerance, because people are generally prone to making mistakes. In the following sections I describe some patterns that are used in software development, that can be applied to organizations as well.
The Retry Mechanism
When you have several distributed software components that need to communicate with each other, you will have the possibility of connection loss. This might be because the remote component is overloaded or down or the network connection is not working. Such a problem is especially severe when there is a requirement for transaction safety. What you are doing to overcome this problem is building a retry mechanism: Your component periodically tries to set-up the connection or execute the transaction until it finally works. But you cannot be sure that your component will be running all the time. Maybe your component crashes before the remote component comes up again. So you will need a persistency mechanism (e.g. a database) in your component, so that your retry attempts survive a downtime on your side.
Although technical systems are usually much more reliable than people systems are, it seems that people do not foresee such a failure condition when they work with other people. Interestingly, the other people, which are “the remote system”, are expected to work perfectly without failures or downtime and with high responsiveness. And when a failure or downtime inevitably occurs, people start complaining about the “other person” to be unprofessional or unreliable. I’ve seen it more than once that someone gave up on the “remote person” without a single retry.
Although you should of course look for people with high reliability and commitment, a retry-mechanism will always be necessary when you work with other people. So when you give a task to somebody else or agree on a deadline with somebody, make always sure to note the due date in your calendar (which is the persistency-mechanism) and check for completion. If the task is not completed, remind the other person of the task (which is the retry-mechanism). Even with motivated, conscientious and reliable people I’ve noticed a failure rate of up to 20%. That’s as good as it gets.
The Cleanup Job
Suppose you have a database that stores all kind of user transactions in order to make some real-time reports about user behavior. If the number of transaction per time unit is very high, the database will fill rapidly. This might lead to performance problems during operations or when you actually look at the desired reports. Since we are talking about real-time reports, you will not need to store the data for long. So you implement a clean-up job that periodically (e.g. every week or so) deletes old records from the database.
Now suppose you have a workflow in your software development process, where you generate some sort of documents. You want to make sure that the created documents are filed and labelled correctly so that they can be found later. You probably have included a rule for appropriate filing and labelling in your process description, but with the above stated failure rate, you find out that a lot of times, the filing and labelling is not done correctly.
What you can do is that you implement a periodic clean-up job. Somebody of the team is responsible for periodically checking that the filing and labelling is done correctly. If it was not done correctly, the person either corrects it right away or urges a correction.
The Watch Dog
In high reliability environments there is usually a software component that is called a “watch dog”. This component periodically checks if all other software components are up and running. If one component is not running, the watch dog will try to restart it.
Now imagine that you are coordinating a project with multiple people. Each person has a set of responsibilities and tasks that are associated with them. A lot of times people are in multiple projects at the same time, so what usually will not work well is that you just distribute all the tasks and responsibilities and expect everything to go well and everybody to stay on track. People will lose focus and work on other projects instead of your project. This is similar to a software component being down.
What you should do instead is schedule periodic controlling meetings (which are planned in advance) where you make sure everybody is on the right course and working on the project. This is similar to the watch dog checking if all other software components are running. During this meeting you have the possibility to reinitiate the work on the project, if it hasn’t been performed yet. This is similar to the watch dog restarting a software component.
The Hot/Cold Standby
This analogy actually does not need very much explanation. It is just about adding redundancy to your team roles, so that you can “failover” to another person, if one person gets sick or leaves the team. That failovering can either be hot – the person can take over the work right away, because (s)he is informed well about the project at hand, or cold – the other person has the necessary skills and availability, but will still need some time to become acquainted with the project, if the need to take over arises.
Final Remarks
There are still lots of other analogies that can be drawn between software engineering and organizational engineering. You just have to look for them. And just as in software development, if you don’t do the engineering well, the organization will not work well.