Monday, March 1, 2010

Getting Started with Version Control

I've had to help more than a few teams get their version control systems sorted out over the past few years, and so I thought it would be easier if I just wrote down the philosophies I use for initializing a repository and getting the whole system set up. If you're looking for some specific advice on how to set up and use a specific version control system, the Pragmatic Starter Kit Series for CVS, Subversion, or Git is a great place to start.

What should go into a source code repository?


The short answer: the repository should contain everything necessary to perform a clean build of your system. In most cases, this includes the code, third-party binaries necessary for building, tests, and documentation.

It’s ok to assume that everyone has their build environment "properly configured" for building. To make sure, make a list of everything that must be setup in the environment to build the software and put it on the team wiki. These things don’t need to be stored in the repository but you should at least write down what the standard build environment is supposed to look like. Depending on what the required software is, it might also be a good idea to keep a copy of it, just in case something happens to vendor in the future. The last thing you want is for a vendor to stop supporting the version of something you need, forcing you to upgrade because your hard drive crashed and you had to setup a new environment.

Include at least the following in your standard build environment list:
  • Compiler versions
  • Team sanctioned IDEs
  • Required frameworks, toolkits, and build tools
  • IDE extensions that the team has decided are so critical/awesome to the project they have to be used. Critical/awesome IDE extensions might enable a required tool-kit (such as GWT in Eclipse) or configure the IDE in specific ways (such as coding styles or static analysis settings)
Putting code and tests in the repository is fairly obvious, but third-party binaries (e.g. libraries) might not be. Put these in version control so that it’s easy to check out a project from source control and build without monkeying around with anything. I’ve found it best to create an "ext_lib" folder for storing all the external libraries. This way there is no confusion over what versions to use, and all the build paths can be set so that anyone can build just by checking out the code.

Here’s a real life example. Let’s say you’re writing a web application using the Google Web Toolkit and you rely on a caching library. The caching library should go into your ext_lib folder and you should tuck a zip of the GWT version you use away in a safe place just in case you need it later. Say your team is also using JUnit. Put the version you use in the ext_lib folder. This way everyone can build and use whatever GUI they want to run tests, be it the JUnit GUI or an Eclipse Plug-in.

Another real life example. Let’s say you use the excellent Sharp AutoUpdated component. Should you version the binary or the source? That was a trick question since it depends. The best answer is to only keep the binary of the library, but this isn’t always possible. One of the awesome things about open source software is that you have access to the source if you need it. So, let’s say you find a bug in the AutoUpdater and for some reason the maintainers aren’t responding quickly enough for your immediate needs. You can’t live with this bug so you have no choice but to fix it yourself. Congratulations, you just took ownership over your own fork of the AutoUpdater component. You now are responsible for maintaining the code – either in your version control library or in a public fork, and merging with the original code base may be more difficult in the future.

What doesn’t go into the source code repository?


Remember the DRY Principle for writing code (Don’t Repeat Yourself)? Well, that applies to your version control system too. Anything that can be derived shouldn’t be held under version control. Since your source code is already in the repository, storing the built binary is a violation of the DRY Principle. The penalty? Confusion, mistakes, and avoidable headaches. Third party libraries in the ext_lib folder don't violate DRY since you can't build them - you don't own the source. Also, do your fellow developers a favor and keep your personal stuff out of the repository. If you’re testing, nobody else wants to see your test reports. When you run the application, keep your logging messages to yourself. Also keep anything related to how you set up your personal environment in your personal environment. The last thing I want is to open up my IDE and see the last tabs you had open because you committed your personal user settings. The easiest way to keep these sorts of undesirables out of the repository is by setting up an ignore list. Share it among the team.

How often do I commit?


Generally you should commit your changes anytime you think you’ve finished something useful that doesn’t introduce problems into the system. On the average, you should be committing changes at least once a day. 

There’s two parts to this commit rule. "Finished something useful" might mean many things. This is by design. When you’ve finished a logical chunk of code that does something, feel free to commit it. "Doesn’t introduce problems" is a common courtesy to your fellow developers. Make sure, at a minimum, the system builds and passes any automated tests you have. And always update before you commit. Depending on your team size and how important the code is, you might establish a checklist for committing. Google has theirs automated. Every change the system has to build, pass tests, and pass a peer review before it can be committed. 

Remember this mantra: Commit early, commit often.

But if everyone is committing all the time, isn’t that going to cause problems?


When you’re working with people and coordinating effort, problems will inevitably arise. Just remember, if you’re going to fail, fail early. It’s better to cause a conflict today through miscommunication while there’s plenty of time to fix it than the day before it’s time to deploy. Why? The conflicts will be smaller since you’re incrementally growing your code base. Also, since you made the changes recently they are fresh in your head and easier to work with. Code more than a week old might as well have been written by someone else. Taking a risk management approach makes mitigating this easy. The risk: "Developers use a shared repository and commit changes frequently; might cause code conflicts that break the build." The source of this risk is communication; therefore anything which helps facilitate communication can reduce the likelihood of this risk becoming a problem. Daily stand-up meetings are perfect for getting the word out about what everyone is working on. Automatically generated email updates from the version control system keep folks abreast throughout the day as changes are made. Continuous integration acts as a smoke test for uncovering integration problems while they’re small. Good merge tools can help reduce the impact of the consequence.

Once everyone gets used to the update-then-commit cycle, most of these problems go away. In my experience, big problems with code in the repository are usually a symptom of larger problems such as poor communication or failing processes.

What are some of your version control philosophies? What helps you keep things organized so you can get things done?