Some of my past projects with a little commentary about each. Most of this work is from the last 10 years. Where possible I’ve linked to the original site.
My work at Hubble §
So what did I do? I was the software architect at Hubble, my job entailed ensuring we wrote software that was: readable, maintainable, reliable, and sufficiently performant. When I arrived at Hubble it already had an establish product which consisted on an angular 1 frontend (a migration to react had just started) and a fleet of supporting services, some django and some node.
My initial goal was to immerse myself in the development process and identify pain points. What I found was poor test coverage, confusion about when to use django and when to use node, longstanding hard to diagnose bugs, poor application observability, and a development team that had struggled to maintain credibility in the wider business. My idea was to try and address the software issues by changing development practices, simplifying the tech stack, introduce a bug process that had accountability and honesty, and find a consensus in the team on how to improve test coverage. While in parallel engage with the rest of the company so they felt they were being listened to and that the dev team was responding with an appropriate amount of urgency when there were platform problems.
Starting form the last point I set up an incident response process, as Hubble already used slack we added the ‘@firefighter’ tag so when an issue was found by one of the sales or support team they could flag it to us easily. When the tag was used a member of the dev team could respond quickly indicating that we were aware and investigating. We would also give periodic updates as to progress. Once the issue was resolved we would conduct a root cause analysis to see why things broken and how we could could prevent it happening again. We then circulated the findings with the whole company.
This might seem pretty basic, but, it is the corner stone of making not just the platform more reliable, but, also building trust and showing that we cared about our collogues pain. Since we also shared the analysis and actions of the incident we were forced to write in a way that was clear, jargon free and approachable. I also made my self available to any one who any questions about our software or our incident reports.
Naturally this dovetails quite well with improving test coverage since for many of the early fires the obvious solution was to write tests covering the faults uncovered. It also feeds very well into the the bugs process.
What to do about bugs? Here I decided to use an approach I was first introduced to at Songkick. It’s called a zero bugs policy. The idea is that in a typical bug process you asses the importance of a bug putting each one into a bucket something like: fix now, important, medium, fix later, rainy day. And then the only bugs that ever get fixed are the ‘fix now’ ones and the others languish in the bug tracker for ever. With zero bugs anything that isn’t fix now is closed as won’t fix. If the person who rased the bug thinks this is wrong they can explain why and maybe the bug does become fix now. If later on some one re rases the bug it re enters the prioritization and maybe it is a priority now.
What I like about this process is it is honest, the raiser of the bug knows that if the bug is open there is an expectation that it will be fixed, in a known time frame. Picking what that known time is, depends on the circumstances but I’ve found two weeks seems to work, with bug prioritization happening weekly. With all open bugs including those prioritized last week being re-prioritized. This way a bug may be closed if its priority has changed since last week.
In parallel with the zero bugs policy we also need a way of dealing with defects in new code, so we hade the concept of a warranty period. Bugs rased against features while in the warranty period go directly to the team that shipped them and it is up to them to fix it. The team has a lot of latitude on what that means but generally they are keen to get them fixed. This also reenforces the team’s autonomy.
For the frontends we eventually got to a fairly standard couple of react/Next.js applications. Now the how do a make a new ‘x’ question is simpler if it is a new service you need to justly why it shouldn’t be a Django application. And it it is a frontend it’s react/Next. That isn’t to say we were dogmatic about it’s more that if you want to add something new there needs to be a good reason.
Lets be a bit more specific
Below are a few of the larger things I was involved in architecting at hubble, it is by no means exhaustive but they are the items I think are most interesting and hopefully illustrate my way of thinking and working.
Areas & parent areas
Hubble’s primary search was geographic, typically a user would be looking for an office, meeting room or co-working space in a particular part of a town or city. To allow this we had defined areas which could be represented as shapes in geojson. The shapes have names like London, Soho, Texas and so on. Importantly the shapes are not hierarchical and as such a building can be in multiple shapes at the same time and a search in any of the shapes can return the same building. This matters because how people think of neighbourhoods in cities is highly variable and names for places may overlap in multiple ways.
Since the areas are not hierarchical we have a slight problem as we did also need a way to say that area ‘A’ was part of city ‘B’ and that the city was also in country ‘D’ and as such buildings in ‘A’ are contained in ‘D’ and therefor have the currency of the country and the public holidays of that country too.
To make this work we added optional metadata to the areas this could be things like type, time zone, currency, etc. When trying to find a relevant piece of metadata about a building, for instance its country, we could walk up the parent hierarchy until we found area of the type ‘country’. To find a buildings time zone again we walk the hierarchy till we find a parent that has a time zone set, in this way all buildings inherit their time zone from the closest relative that has a time zone specified. This rather neatly solves the multiple timezone in a country issue and places like Arizona where some parts follow daylight savings time and some do not. It also allows us to hang public holidays off the same mechanism so allowing us to deal with regional as well as national holidays. It is incredibly flexible.
The goal was to make setting up and running any of Hubble’s frontend applications or services as painless as possible. My initial intention was that the developer would only need to install
make. Later as part of the migration to Google cloud run the google cli was added to the list.
With the prerequisite installed it should be possible to checkout any hubble project and run it using the same commands, regardless of language or framework. For this
make is a very convenient and easy way to run shell scripts with a consistent interface. I also aimed to give understandable and useful error messages, as an example if the Google cloud cli tool is not installed the script would exit with a message saying it is needed and the url for the its download page.
With all this in place setting up a project for the first time was a matter of
git clone <repo> && cd <repo> && make setup run. This was pretty satisfying and removed what had previously been potentially hours of faffing getting all the applications dependencies installed. Running the tests also had a make target so again it didn’t matter what testing framework or language was being used:
make test would run all the tests.
The approach of having everything in docker did necessitate some additional work to enable things like automated linting in git hooks. For the editor vs-code (which the whole team were already using) has remote containers which made most if its configuration, for linting etc, very easy.
Services & development environments
Typically in development you only really want to be running the application you are working on. So for example if you are working om you Next.js based frontend you are happy to be using the staging services. This is convenient you staging environment has production like data and so you can work on the application with high expectations of it working like on production. Occasionally however you may be adding a new feature where it is helpful to be able to run the service locally. Again having everything running in docker makes life easy. We have a configuration that points to staging, but, when we want to use a local one we can do something like:
PASS_URL=http://pass:8080/ make run and now the locally running app will connect to the locally running service instead of the default staging one. If you are feeling masochistic you could attempt to run our entire fleet of services and frontends on your development computer but I doubt it would be fun.
Google cloud migration
After a major security incident at our hosting platform Heroku we decided it was time to migrate to some where new. After considering multiple options we settled on Google cloud, specifically using google cloud run to more or less replicate what we had in Heroku.
Google cloud run was a good fit for us because docker files can be used to represent the server configuration (which we mostly already had from the earlier work for the development environments); integrated with our ci platform and had a low ops requirement. It also provides good logging and log aggregation, a mechanism from running one time jobs (google cloud run jobs), and a secrets manager and secrets api. The secrets api was particularly useful for giving us a way to share secrets for the development environments.
We had a rather ambitious three month migration target which very quickly lead us to the conclusion that we would need to mange our cloud run setup using terraform. This allowed us to create standard machine types, like web frontend or Django service and provision them very easily. We were also able to move dns and cron task management into terraform.
For our django services we were using the django-rq plugin, which requires an environment in which to run its workers. In Heroku this was pretty simple add a new dyno for the worker. In google cloud run not so much, as part of the container contract an environment mus respond to health checks on port 8080, rq workers don’t do http.
We used a bit of a hack to work round the cloud run limitations, using the v2 environment we could run multiple processes in the container. So using the python standard library HTTP module we spun up an http server that responded to requests with a ‘200 ok’ response. This allowed the containers to start up and stay up. We then set the minimum instance count to 1, so there would be at least one worker running, and, boom.
There were some subtleties we had to work out, like we needed the docker container process to respond to unix signals. For this we used googles recommended approach. We used ‘tini’ for init, so it could manage the other processes in the containers and respond to unix signals. We had some problems around redis reconnection issues early on but ironed them out eventually. The system is now pretty reliable. Is it optimum, no, does google cloud offer a perfectly good pub/sub service to do this sort of thing? Yes, but, we had a tight deadline and porting all our services to pub/sub in the time we had was not possible. So quite a good hack that let us get done in time, with minimal changes to our applications.
Along with all the docker work as part of our migration to Google cloud I took the opportunity of needing to rewrite all our projects readme files to centralize the common information and then use a template to generate per project readmes. This made keeping shared information across readmes easier and I used the tempalting script to generate a table of content for the readme automatically. It’s a small thing but this sort of automation can really reduce the burden of keeping documentation up to date.
I joined Hubble where its software platform was far more mature then ether Songkick or The Guardian and so had to first learn about what was there and understand how it got there. It is always important to remember that while a past decision may look like a bad one now it was made with the best of intentions using the best information at the time. Another observation is that it doesn’t really matter if the development team is working well or working badly, if it isn’t talking to the other teams, listening and responding to them the perception will be it is doing nothing.
The biggest difference this time around was realizing that my main role as architect was to engender ways of working and thinking alongside guiding technical decisions. Since it is impossible to be involved in every decision you need to educate every developer to be able to make good architectural decisions. The the architects job becomes supporting the teams decisions and providing a frame work to shape them.
One of the things I’m most proud of is the development environment. At most previous jobs getting setup for the first time and keeping my workstation working has been a pain. Dependencies won’t build (Nokogiri ‘Failed to build gem native extension’, any one?) then the application fails to start because it has an undocumented dependency … the list goes on! How do we share secrets? Where do we get production like data from? Ugh. Now Docker is a relatively recent phenomenon—it wasn’t an option way back in 2009 at Songkick (we had our own server room with dev environments running in virtual machines!). Scripting all the little setup tasks, making the steps as uniform as possible across all languages and frameworks and giving helpful error messages that was a massive win.
Another lesson was a bias towards automation, fairly early after I’d joined I was pushing for continuos deployment (as I’d experienced at Songkick). Now at Hubble due to the aforementioned poor test coverage it took us a while to get there, but, we did, and again it was a win. Later when working on the development environments writing scripts for all the little niggly setup steps, that really are just donkey work. Setting dependabot to automatically merge (and as a result deploy) security patches. Using terraform to describe our infrastructure. It all adds up, and is a very tractable way of making the whole team more productive.
With all that said I feel like Hubble’s tech and tech team are in a really good place now. The move to gcp, has given them scalability and has a lot more scope for growth than Heroku ever offered. I feal like the product and development teams are collaborating better than ever.
My work at Songkick §
From May 2009 until July 2018, for all intents and purposes nine years, I worked at Songkick, was a part of a growing team, going through a merger and then finally an acquisition it’s been a long and eventful journey.
So what did I do? Well I was part of an amazing team that built Songkick, Tourbox, Songkick tickets, and Detour. The last two have now been retired, but were significant products that benefited from the Songkick platform that we built and which Songkick.com and Tourbox still rest on. I cannot over emphasis that everything mentioned below was as part of team, I certainly could not have achieved what I did and learnt so much without their help.
I joined Songkick to work on the then secret Songkick 2, It was a much more featureful version of the existing site, with an emphasis on users building their concert going history. It was ambitious, with user activity feeds, comments, reviews with star ratings, setlists, interactive visualizations of users concert histories, private user accounts, user to user messaging, user as well as artist, venue and city tracking and quite a few more minor features like most popular venue in a city.
Shortly after launch we started adding features, we spent 2010 iterating on product ideas to increase user engagement and sign up. With these additions the product became increasingly difficult to work with, our tests were taking longer and longer and our bug count was increasing at an alarming rate.
This came to a head in 2011, we realized we would need to re architecture the Songkick web application. Our single Rails application which was doing everything: ingesting data, sending emails, serving web pages, and serving the admin function. This monolith had become too big it was hard to change, hard to deploy and hard to test—something had to give.
Frontends & services
I was asked to explore what we should do to improve the situation. After much research and discussion with the team I proposed we move to a service oriented architecture which we would migrate to gradually. The goal was to keep releasing the application as we moved it to the new approach. To make this possible we decide to take vertical slices through the app on a page per page basis, migrating the artist page, then the venue page and so on. The idea behind this was twofold, first by exposing the new code in production as soon as possible we would identify bugs or weaknesses sooner, and second we would not have to worry about the new application diverging from the production application. Doing it this way was very much inspired by my experience at the Guardian where, while we were writing a new application rather than restructuring and existing one, we still released it incrementally allowing us to solve problems as they arose and not suddenly on release date.
When unpicking a monolithic application and moving it to a services based approach you have a chicken and egg problem. You have an application that talks to a database and you have no services and it isn’t initially clear what bits of your domain model belong in which hypothetical service. To overcome this, as we sliced our way through the application we added a set of service classes. Imagine a line in the code above that line nothing talks to the database and below it nothing talks directly to the rest of the application. Below the line we had the Songkick domain model and spoke to the db, exposing a json api talking to this api was a client service class that turned the json back into ruby objects representing the domain objects of Songkick’s model. This allowed us to behave like we already had services while also figuring out what services we would need and what part of the model belonged in which service.
Once we were far enough along another part of the team came in and started turning those virtual services into real ones. We were able to continue the migration in parallel without even knowing which bits were now talking to actual services and which ones were still in the application. I am glossing over some of the harder to work out bits, like while in transition how do you share models between the in migration monolith and the new shiny services. Keeping track of the dependencies becomes very complex and for a while caused us a lot of pain (see a diagram from an early part of the project).
So now we have a system where data is presented to the user via a series of specialist frontend rails applications, an admin app, a sign in/sign up application, a public api and so on. Backing up these is a fleet of services. With only the frontends publicly available on the internet. So in the next section let’s discuss those frontends.
Pages & components
This approach proved be very maintainable, particularly our css become much easier to modify and the number of visual bugs declined significantly compared to the old codebase. It is also very easy to test, since each component is designed to be self contained mocking its dependencies is very straightforward. It made adding and crucially removing features much easier while also providing an obvious level to put many A/B tests.
While we moved the views to the above approach we also moved the models, which are defined by Songkick‘s domain, into services. Now we had frontend applications that are primarily concerned with presenting information to our users and a fleet of services which contain Songick‘s domain model. This separation now allows us to ship new and changed features very quickly.
Our services were extracted from the monolith and we followed a pattern with them. They were sinatra applications; each service would have its own database; they would be independently deployable; most requests to them would be synchronous; where we need asynchronous behavior we would use rabbitmq; they would accept html form encoded data and json and they would emit json, each service should be self contained.
Quite early on we decided not to version our apis instead, as we owned the whole stack, if we decided to change an api we would do the forward compatible/backward compatible dance. First we changed the service to expose the new api and then shipped it. Then each client application is updated to use the new api and shipped. Finally the old api is removed from the service and once again deployed. This may seem like a faff, but, it allows the service apis to evolve without building up cruft and, as with user facing products, features that are not pulling their weight need to be removed.
The above all sounds very smooth, but making the migration to where all services are self contained and have their own databases took a long time. Crucially though, we could do it incrementally and we were able to do it while delivering value to the business every day. After the initial push of six months to simplify and rewrite Songkick.com to the new architecture it took another year for a smaller portion of the team to finish the creation of all the services. And while the migration continues to this day it had achieved what we needed, unblocking the team, we could once again make changes to our product and ship them quickly.
Once we had finished (it’s software, nothing is ever really finished) the service oriented architecture migration we were back on track and shipping new features faster than ever. For the first time at Songkick the development team wasn’t the bottle neck.
Life in an soa world
Naturally completely changing the way you make software will create new problems and present new opportunities. Since any data lookup now involved a network request the client applications had to be aware of this cost and not accidentally make unnecessary requests. Some things would require a new endpoint so we could do a batch lookup instead of making hundreds of individual requests. Knowing which user action triggered which service calls was an unexpected challenge. With new problems come new ways of doing things. We added request tracing to our frontend and services so we could log which frontend request resulted in which sql query in a service. We added a display of which services were called when a page is visited (this is only visible to developers) and how long each call took, any duplicate calls are highlighted in red. When some things change more than one service may need to know about that change, so we had to tackle that too.
Using services, like all engineering decisions, is a compromise, they add new constraints and new ways of failing. They require a different way of thinking and if you make the wrong service or decide a part of you domain model belongs in a different service moving it can be hard. Now you also have to worry about networking and all the new strengths and weaknesses that implies. But given the speed of development and the ease of adding new products they enabled the benefits for us far outweighed the disadvantages.
With the re architecture out the way, we could go back to doing the important work of improving Songkick’s products and making new ones. At the start of 2013 Songkick decided to enter the ticketing market. I was the technical lead on the project, and this was the first new project using our services oriented architecture and the page and component model for the frontend. We had a very ambitious target to sell our first ticket in February, so we didn’t have time to hang around. Naturally the first thing we did was get the team together to create a common understanding of what selling tickets would entail. There were several sessions where we started to thrash out our domain model. These early sessions done before we started writing any code were very productive and set us up for later product additions. One of the main strengths of these modeling sessions was to establish a shared understanding of how things would work and what they would be called. I am convinced that we were able to develop the product faster and with better code because of the time we spent in those modelling sessions.
We wanted to ship the first version of the app as soon as possible, so we pared it down to the most minimum of minimal viable products. It would be only general admission and eticket and have a very simple user interface. By keeping everything as simple as possible we were able to go live five weeks later.
With the first version shipped we could start user testing and adding features as we needed them. First came paper ticketing and then reserved seating. Both major additions to the product and requiring a revisit to the model we had devised at the beginning. We continued the approach of discussing the modeling before starting coding and refactoring existing concepts to fit more comfortably with the new features.
Using our service and frontend approach allowed us to easily include ticketing into the Songkick ecosystem. Allowing us to add user accounts and self service for some ticketing functions without needing to change the core ticketing application or any of the other Songkick applications. This flexibility is one of the things that makes service oriented architectures so attractive. They allow you to add features to your application in a highly modular fashion, which is not only good for adding new things but makes taking them away later much easier.
Post merger ticketing
The fan registration app was interesting because it had to be architectured to be finished in a specific time frame and with the programmers available. It also had to be able to cope with potentially very high demand. To this end we chose to make it with a go service and a react frontend. For managing state on the frontend we used Redux, this was 2015 so fairly early on to be using Redux. The registration application was necessarily very minimal given the constraints. That said, it was very successful easily handling the traffic and delivered in the very short time required.
During this time I also started to step away from day to day coding to concentrate more on the ways in which we could improve our applications architecture. My main goal was to identify ways we could change our codebase to make making changes easier and reduce the risk of making those changes. For this I brought together many of the development team and got them to tell me how they thought our application worked, which parts talked to each other and what would happen if one part failed. By doing this I learned the not only did no one person know what all the things did, but, many people had contradictory ideas of what did what. So I set out to document the current state of our architecture and then worked with the team on how we could simplify it.
Another thing I started doing was talks for non technical people explaining concepts like reliability; robustness; why we use libraries; and what are the tradeoffs of using them. The aim of these talks was to demystify the concepts we used every day and give insight into the what and why of what we were doing. I believe that this sort of explanation is key to build trust. If what the development team does is not understood by the wider company it will be hard for them to trust that the team really does need the time and resources it is asking for. Additionally by explaining our jargon we make talking to our colleagues easier and less prone to misunderstandings.
Things I learned about shipping features quickly
Along the way, having worked on many projects at Songkick, and before that at the Guardian, I think I’ve learnt a few things about shipping software quickly. Below are a few things, not in any particular order, that I think of as key.
Have feature flippers
This way if a new thing is broken or performing poorly you can turn it off without needing to redeploy.
If possible roll forward not back, very much facilitated by 1
If you have broken production, it happens to the best of us, it is preferable and faster to ship a fix rather than go through the: revert, ship, fix, ship cycle.
Every feature should have success criteria
When added if, after some predetermined time, it hasn’t met expectations it must be removed.
Where possible A/B test new features
For the A/B test, implement the feature in the cheapest possible way. Most new features fail, so no point in making it perfect as you are very likely to be deleting it in a weeks time.
If a feature does succeed throw away the experiment code and now do it properly.
Easy to say but hard to get right. You should define what are the key features for your application — these should be exhaustively tested — with unit tests, integration tests, and acceptance tests. Then as features decline in importance the level of testing can be lower.
Have cross functional teams
This means having teams that represent all the functions in making software, design, development and product. When the three work well together the result is magic. Features are well designed, technically feasible, and because of the shared understanding of both how and why it is being made, the whole design and development cycle is faster.
Constantly review your process
The team should be striving to improve how it works all the time. So having in place a way to periodically review what is going well and what isn’t is key.
Always dedicate some amount of effort to reducing tech debt
This is a hard one, but tech debt slows you down and eventually can cost you the agility and speed you need to react to changes. Clearing some of it up continually is the best way to avoid being hamstrung by it.
As always these are guidelines not gospel, but I have found them to be a solid way of working. The removal of unsuccessful features is very important. Every line of code is something extra to maintain, understand, and a potential source of bugs. If you want to be able to make changes with confidence not only do you need a well structured system, the less of it there is the easier it will be.
Nine years at Songkick involved making a lot of software, some of which is still around, a lot of which has been retired. I learned a huge amount about how to make software sustainably. Obviously every team will have its own way of doing this, but I genuinely believe there are general principals too.
In particular the cross functional teams are key, they help avoid the ‘us and them’ mentality which can develop between programmers and designers or product managers. Having all the aspects of product development on the team also improves communication and so reduces bugs due to missed requirements and misunderstanding. It encourages conversation, builds trust and allows for better decision making at all levels.
Finally trust the teams to do the right thing. If you have good communication and constantly address weaknesses in how you work, then you will be able to trust the team to make the best decisions. This confidence and trust will allow them to move fast and make good software.
Qcon London 2013 §
I did a talk at Qcon in London I was a speaker on the Architectures of the Small & Beautiful track. The talk was about our re-architecture of Songkick’s rails application. I covered some of the reasons why we needed to clean it up and what that gave us after six months of work. It covers some of the same ground as my post on Songkick’s devblog but with more about why rather than actual discussion of the code although there was some code still in it. You can download a pdf of the slides if you like.
My work at the Guardian §
From May 2006 till May 2009, I worked at the Guardian on guardian.co.uk. It was without a doubt the largest and most ambitious project I had worked on. We redesigned and redeveloped (r2) the vast majority of the Guardian’s web site.
I started at the Guardian as the first client-side developer and as such was in the privileged position of being able to decide on coding standards and having free reign in relation to template structure. Needless to say I made a lot of mistakes — I also think I got most of it right. By the end of my stay we had 280 templates over 300 components, 2MB of css and it all mostly worked every where. I’m going to try and describe what and how with some thoughts on what not to do thrown in along the way.
Templates & components
The templating language chosen for the r2 project was velocity, it has the virtue of being very simple. It did force some design decisions such as data sources and the very extensive nesting of components. In most cases velocity is easy and readable. I’m not going to discuss the language in any detail since the structure of our templates is generally applicable and velocity’s strengths and weaknesses are of limited interest.
At a template level our approach was very conventional — each page has a template and these templates have the bare bones of markup. We didn’t modularize this as much as we might have. A defining characteristic of our templates is template metadata, this is a map of properties and values that the cms uses to apply appropriate templates to content. It also sets the template’s name in the cms gui and sets a unique template id. The template id allowed us to move and rename the template files without affecting the operation of the cms. The use of a map meant the values were labeled and so in may ways this becomes self documenting. We also hooked our css merging tool into the template metadata so we knew what to call the merged css file for that template.
Our approach to components is quite a bit more interesting. A component is the basic way of getting code reuse and as such is the building block of the frontend. By the end of the r2 project some of our core components were being used in hundreds of places, this meant that adding some new functionality to every page on the site is often easier then adding it to a limited number of pages.
Possibly our best decision was that if a component has no data to display it will out-put nothing at all — no empty <div></div> tags, no empty lists, no comments, nothing! While this presents a problem when debugging (how we solved that I’ll come to later) it keeps page weight down and empty components cannot impact on the page layout. We also kept our comments in velocity so they never turned up in the html thus allowing us to duck the duplicate content bug in ie 6 and in truth comments in the html served to the web add nothing. That said we do have a few and we probably could get rid of them, but, I don’t think we have enough for it to matter deeply.
Since everything on the Guardian site is workflowed and we have our own development environments the application already runs in multiple modes. In preview mode or on a developer workstation our components are wrapped in comments, with the component name and a few other useful items. This is invaluable when debugging since finding the component causing the problem is otherwise hit or miss and some of our components are nested eight levels deep, just try tracking down the troublemaker without some kind of marker. I would recommend doing this, or something similar, in any templating system it will save hours of time and help keep your client-side developers sane.
The components themselves were developed to do one thing, this resulted in several layers to each piece of functionality. As mentioned we nest components deeply, often a component would be decision making it would choose which component is most appropriate to handle the shape of the data. It also means that a component will commonly be just 10 to 20 lines with some choices and a couple of tags and lots of calls to other components. The nesting of components is the primary way we get code re-use, while velocity does have macros they simply didn’t have the flexibility afforded by our components, particularly since macros could not have optional parameters.
This emphasis on lots of small components may seem to be storing up trouble for future as the number of places changes need to be made constantly increase. In fact because the individual components are generally very simple, reading and understanding them is easier. On occasion when there is a lot of complexity it is centralized. We also have a strong emphasis on deleting unused code and components.
Template vs layout
As I mentioned before we had approximately 280 templates, many of these did in-fact correspond to one layout, but not all. Some of our templates particularly ones for the front of the site or sections on the site or used to give special emphasis to a subject had multiple layouts.
The cms did not use wysiwyg editors (and this is a good thing) it did however allow the editors to place hints — when a new column should start, whether to use a special rendering and so on. These hints could then be interpreted by the template and produce an alternative layout. This is an incredibly powerful device and we used it extensively. It was also this hinting system that drove much of the component choosing behavior mentioned above. The other driver of component selection was image size, the component used would change depending on the image width (we almost never used height). This works so well because the Guardian’s grid meant image sizes were entirely predictable.
The distinction between a template and a layout has some interesting repercussions for training. While a one to one relationship between a template and a layout is easily understood, the way our templates reshape themselves depending on the content placed and the hints applied can seem quite mysterious.
When dealing with a site as complex and ambitious as the Guardian’s the way css is structured becomes critical. Writing maintainable css is not a topic that has received much discussion and certainly was not a hot topic in 2006.
The lesson we learnt over the course of r2 are relatively straight forward and easy to follow.
Create a style guide and stick to it
Consistency is important, keep the coding style guide short because otherwise no-one will read it. Make a point enforcing it.
Never use css shorthands
This is probably the most controversial comment but, I stand by it. Shorthands affect the cascade making it harder to interpret. Since properties can often be ordered arbitrarily it makes them impossible to search for. Some shorthands simply aren’t readable.
Don’t use hacks
For the real trouble makers (internet explorer) use conditional comments. For the other browsers the differences are now so small you should not need hacks.
Put css for a particular styling in its own css file
This makes the css modular and means when the styling needs to change it only needs to happen in one place.
Don’t comment out code — delete it
Since having versioning software means all changes are undoable, if you think some code is no longer needed delete it. Even if the removal is temporary make a note in you commit comment about the code needing to come back. Commented out code lingers around, causes confusion and if you are merging the css can lead to odd bugs.
Trust the browser
If the styling is different in Safari or Firefox (especially version 3) it is probably because you don’t understand what is going on
Put one selector per line
This plays nicely with merge tools and so make the css work better in source code management tools. It also makes the code more readable when there are many selectors.
Put one property and value per line and put a space after the colon
Again this works well with merge tools and makes searching the css easier.
Merge your css
This makes using lots of small files work — since you don’t serve the thousands of individual files to the client, but, you can still work this way.
The above list is not dogma it is what we learnt over the course of three years. It is also what makes wrangling 2MB of css possible. It is not to say that this will lead you to perfectly maintainable and readable code, but, it does help.
It is also worth noting that the recurrent bugs, the ones that took longest to solve, were hardest to diagnose and the fixes had the most unexpected consequences were invariably where we didn’t follow these principles.
Markup the good the bad and the unintelligible
Choosing the correct tag for the job is a fundamental part of client-side development. It is very easy to invent a new markup language just using divs, spans and classes and ids but is counter productive. Html has slightly less than 100 tags, some are pointless (big anyone?), some are clearly the result of html being designed by computer programmers (var, code, samp, kbd and on …) and some are there to try and trip up the clever. The most notable of these is the definition list (dl, dd and dt). People just seem to want to use this whenever they can and a favorite use is captioning images, why I don’t know.
On the Guardian’s website we tried to follow a very simple system: label things what they are. This works quite well but occasionally you do find yourself wandering what something really is. This can lead to navel-gazing. We applied the naming principle to class names an ids as well. To decide what something was named we looked to the editors and the subs. We followed their naming conventions whenever possible. This was invaluable since it meant we all had the same vocabulary. When a bug was raised or a question asked both parties would be talking about the same thing using the same name. It also meant that if the function of some markup changed (and it does with remarkable regularity) then the tags, their class names and ids had to change also. This may sound like a large overhead but it isn’t, the time saved not having to translate badly chosen labels all the time more than compensates for the additional effort.
The use of the business language to define the labels used in the code is I believe one of the best decisions we made in r2. I would recommend anyone to do the same wherever possible. Often the business names have evolved through long usage and as such provide clarity and consistency not an easy thing to achieve.
The emphasis on the correct use of tags resulted in some interesting observations. For instance a while ago Google published an analysis of the use of tags in web pages. Looking at their numbers we end-up as outliers on the right of the graph. With on average 1500 tags per page and 33–35 different types of tags per page (the most common number being 18). It also gave us a very concrete thing to consider when interviewing candidates.
Invariably we would ask, ‘How would you mark this up?’. The replies would range from the terse, to the absurdly detailed. In interviews and in coding tests I started to notice some strategies around markup, choices made about which tags to use and which property values to use that become something of an alarm. The use of definition lists to put captions on images was one of the most clear signs of someone being too clever by half. The use of numeric values for the font-weight property suggested a lack of pragmatism and a failure to understand how browsers work. I know that Safari 4 and Firefox 3 have started supporting this, but, since most people have only two weights of their fonts installed and that font-weight: 500 doesn’t exactly scream bold, this kind of thing is best avoided. An over enthusiasm for absolute positioning was also a bad sign as absolutely positioned layouts tend to be brittle. Especially when the font size is increased. That is not to say it should never be used, just that, since absolutely positioned blocks are taken out of the flow if the flow changes the consequences are seldom pretty.
Never an easy task. What with trying to name things well, is this a list with only one item or is it actually just a paragraph? Do I sound like I’m about to disappear up my own arse? Well yes, but sometimes getting a name right is important.
So how do you avoid navel-gazing? Well start off with the principle that you can always change it later, then check to see do the users already have a name for this thing, if they do use it. If they don’t does it matter that much? Finally consider some other person coming to your code, will they understand what it is meant to do. This last is a good reason to avoid abbreviations what does vss mean? Who knows.
All this adds up to being careful about the names chosen and being willing to change them. And maybe a little navel-gazing is not such a bad thing in the long run, just, don’t get carried away.
Accessibility is a complex and subtle subject although you might not believe it if you listen to some of the accessibility zealots. I passionately believe web sites should be accessible by all, the part I’m having increasing difficulty with is how this is achieved.
Many of the accessibly guidelines, such as w3c and wcag present no evidence. We are expected to take it as a given that what is in the guidelines is actually true. I’m not convinced that they all are and neither are all the disability action groups. I remain unconvinced that setting the longdesc on an image when no user agent exploits it actually aids accessibility. The blind, sometimes frighteningly passionate, demands for these things I feel causes more harm than good.
Then there is the complexity, much of the accessibility debate revolves around the mythical screen reader user. This person is the poster child, and in some cases, the only subject considered. The emphasis on one disability is not helpful. Users with poor eyesight out number blind users by on order of magnitude (should we not cater for them?), what about motor difficulties? All these questions and precious little debate about what to do to really help and improve a sites accessibility. This becomes even more difficult with a site as large and complex as the Guardian’s.
As to our commenting system — we’re working on it ok?
So that I’m not accused of spouting unsubstantiated rubbish about accessibility advocates, a few links …
Accessibility is a harsh mistress or this Why the Alt Attribute May Be Omitted (try reading some of the comments). I could go on, Joe Clark has some good rants about all this too. At the moment many web accessibility advocates are doing themselves and the people they claim to represent a lot of harm by being so extreme.
Solving the ‘but I want this thing there’ problem
This is not an easy one. Early on we realized that the customization of what appears on a page is a very desirable feature. The thing is, often these customizations only make sense when applied across many pages, per page editing is of limited value for certain items, like a weather widget. In some cases this problem is solved by creating a template with the customization required but this is of only limited use since you end up with 300 odd templates and maintaining that many becomes arduous.
For the first section we tackled in r2 (travel) we didn’t consider this problem. Having completed travel it was clear that some kind of mechanism would be required to customize which components appear on a template without requiring developer intervention. The need was particularly acute for our promotional content. Since the need was strongest there and it was generally confined to one area of our templates (the far right column) we devised a system which allowed the editors to make rules as to when a component should appear. This deceptively simple idea has far reaching consequences. It allows customization against many criteria and has been used for the last two years with some success.
As I left we were extending the rules idea to the rest of the template, it is not complete and some of the consequences are hard to predict but it will give an unprecedented level of editorial control over what appears on templates and in what order. It is in fact dangerously close to a fully customizable templating system. I wish them luck because I think this could be a revolutionary idea in how to build templates for content management systems and it might just solve: ‘but I want this thing there’, which would be good.
In the end I suppose like all engineering problems, web sites are an exercise in compromise. The complexity of the templates and the css has a cost in maintenance and a benefit in flexibility. While shiny new features attract developers and journalists alike, keeping things maintainable and durable argue caution and circumspection. If a site needs to last five years should we be making components that depend on flickr and twitter? Maybe the cost of being wrong most of the time is out weighed by being right just occasionally when it comes to new technology …
I have spent three years working on what is now a first class web site and cms, clearly I did not do it on my own. There is a team of incredibly talented people working on the site, just on the client-side at the hight of r2 there were nine people, the whole team at the time ran to over 100. I’ve spoken about what the team I was leading did but none of it is possible without everyone involved.
I’ll be presenting a talk at @media about our experiences at the Guardian relating to css and making work reliably and keeping maintainable.
Working at the Guardian
Since January 2007 I’ve been working full time at Guardian unlimited. I also freelance occasionally at Songkick.com where I do client-side development and encourage good practice.
I have continued the work I started as a freelancer working on the travel site. We have since launched our new network front, Business, Money, Environment, UK news, World news, Politics, Science and Technology sections in the new design and cms.
While much of my work until march last year was focused on writing the templates and making the pages as accessible and standards compliant as possible. After that as the team grew from two to seven and soon to nine most of what I now do is strategic. Deciding which way to solve problems when to try new strategies and what we need to revisit.
One of the things I first noticed about the structure of the pages as they were being designed was if they were not articles (comment, reporting and commercial) they were lists of suggested articles. So one of the defining aspects of the markup is extensive use of lists. Using lists to chunk page content rather than using hundreds of
div tags is not something I’ve seen in many other pages, I may not be looking in the correct place, but, I think this might be a small innovation.
Naturally the separation presentation from content is not perfect. Bits of markup were added to give hooks for the css often this took the form of nested lists, so to achieve a two column layout I would use a list with two
li tags each containing a sub list of article trails. Those trails each having there own hierarchy of headings and paragraphs. Added to this we tried whenever possible to express meaning or intent when choosing
id names, although it turns out naming is hard.
Using lots of lists is not without problems. Even after you remove the browsers default styling there are still browser bugs to cope with. Notably Internet Explorer seems to have particular problems with lists, less so in version seven but still not great (combine that with ie’s quirky way of doing floats and things do get interesting). Firefox also has some problems with lists but they are minor by comparison. Another problem with using extensive nesting while trying not to go mad with class names is ie six’s lack of support for some of the more useful css selectors. This weakness means that when you look at code you will see a lots of
Nesting is a fundamental concept in html and as such it being a source of pain seems odd. If I create a list of lists I’d expect it to work — and at the most basic level it does. But when you choose to style these lists inside lists it all goes a bit to pot. You see ie (alright ie 6 and older siblings) only supports the descendant selector and not the child selector. As a result if if say —
ul.trail li — I’m stuck because now that styling affects all the nested
lis too. There are ways of dealing with this, the two most obvious, are use more class names and set the descendants descendant. Nether of these options is particularly appealing, all the same we did both.
So now what? So now we have a very solid (mostly semantic) html base with css for layout that scales well. Our css is starting to get to the point where it gives the new look and feel reliably and our vocabulary of class names allows us to create variations on the new look rapidly, and interestingly robustly, working cross browser with only a minimum of fuss.
The use of css means, contrary to what many commenters seem to have assumed, our site works quite well on mobile devices. Not that I’ve checked all of them, but, I have checked on Blackberry, Sony Ericsson and Nokia phones, only a small subset, but in those and in Opera mini the page does work well.
Since I’ve not freelanced since 2006, none of the sites below, except the Guardian, still contain my work, having been redesigned and redeveloped. I’m going to leave the links though as they keep a record of my freelance work.
I was working at dlkw (now part of dlkw Low) on the halifax website.
I worked at Poke between August and November 2005 and I’ve worked on some interesting projects. I did a fair chunk on the orange entertainment site, although the bulk of the work on the css was already done. I also worked on a small site for SAB Miller which so I can’t really discuss, which is a shame as I am very pleased with the work I did on it.
On Rio Tinto my brief was to move their table based layout to a css layout and to improve the accessibility of the site. I was contracted to view. The final result is much simplified xhtml with semantic mark-up.
Just recently I finished wendysmith.co.uk. The site needed to present the artist work in a sympathetic manner and be easy to maintain. I designed the site and wrote the php, html, and css. The current version of the site is no longer my work.
Working as a freelance contractor for wheel:. The work was developing html and css for three clients: BT, Ernest Jones and H Samuel. The most important work I have done was on the new Unilever design.
The BT work was coding css for the new look on the BT small & medium business pages.
The Ernest Jones work was updating and adding products to their standard templates as was the H Samuel work. This mostly required changing or creating graphics and adapting their css to the new look. Unfortunately, I can’t point individual pages as they are a moving target. The last thing I worked on was the Bridget Jones promotion.
On Unilever I wrote most of the css and many of the html mock-ups. Obviously on a project this size many people are involved so I was just a small cog in the system. That said, I am very proud of the work I did on the Unilever css. The site has been redesigned and is no longer my work.
Training for Life
At Training for Life I worked on web design and development. The sites I worked on were usually for charities and small to medium businesses. My role was lead css and html developer as well as doing php development.
The largest piece of work I did was developing an intranet application for the NHS Confederation. It was implemented in php with mySQL as the database back end. If you are interested in looking at it please contact me so I can give you a url and a user name and password to view a demonstration installation.
Opportunities Child (no longer with us) I wrote the asp that connected to the access database for the site’s content. I also wrote the css. Again the layout is pure css.
Urban Inclusion we developed the design from their existing branding. I implemented the design, wrote most of the css and did the accessibility work. The site no longer users my work.
16 Hoxton Square (gone) & Hoxton Apprentice were developed for Training for Life as they are one of the partners in the scheme. On this I did quality control and cross browser testing.
At dct I did two things, I taught long term unemployed people computer applications, html and css and I maintained dct’s intranet.
From the teaching I learned just how counter intuitive most computer applications are and saw the impact of poor design on people. It also gave me a ready pool of volunteers to use for testing designs (it can be quite humbling).