The massive numbers behind DNA

We all talk about DNA a lot (especially at uBiome, of course, where we speak of little else) so I figured it might be interesting to compile a short primer on how humanity’s knowledge of DNA came about.

DNA_stock
You may well be fully up to speed on this, but it seemed to me that all but the most expert microbiologist can sometimes use a refresher.

DNA, of course, stands for deoxyribonucleic acid, which is often called the instruction manual for an organism.

And like some other instruction manuals – those for the International Space Station and a VCR spring to mind – DNA is fiendishly complicated.

First, though, some history.

Perhaps you imagine that DNA was discovered relatively recently, but as a matter of fact it was first isolated in 1869 by the Swiss biologist Friedrich Miescher.

Miescher had planned on becoming a doctor, but typhoid fever left him partially deaf.

Feeling that his hearing impairment might make his chosen profession difficult, he turned instead to physiological chemistry.

His employer encouraged him to study white blood cells, which he obtained in quantity from used bandages supplied by the local hospital.

Now my apologies if you’re eating lunch at the moment, but Miescher was able to isolate white blood cells by washing the pus out of bandages, which sounds deeply unpleasant.

And you thought your job was bad.

Anyway, a series of chemical processes left Miescher with a curious precipitate he called nuclein in the bottom of his beaker.

Which is unfortunately where it stayed for the next 70 years.

What he called nuclein we now know as DNA, but since people thought proteins were the key to life, nobody got very excited about Miescher’s discovery.

This all changed in 1944 when three scientists at Rockefeller University Hospital in New York figured out that they could change one strain of bacteria into another using purified DNA.

It was the first time that DNA had been shown capable of transforming the properties of cells.

Work on DNA accelerated thereafter, with James Watson and Francis Crick proposing their double helix model of DNA in 1953.

They suggested that DNA is made up of two strands of molecules coiled around each other, and took home a Nobel prize for their efforts.

Interestingly, Watson recognized the vital – and at the time unsung – contribution made to their work by the British chemist Rosalind Franklin who produced X-ray diffraction images of DNA, and died at the age of just 37.

While amazingly complex, a DNA molecule is actually made up of combinations of just four basic building blocks, or bases, known as adenine, guanine, cytosine and thymine, labelled A, G, C, and T.

You could imagine these as four colors of Lego bricks – say red, white, black, and green – which you use to build a wall of scrambled colors.

If someone knew the overall dimensions of your wall, you could instruct them how to match it by simply handing them a string of colors.

For example: white, blue, blue, red, white, green etc.

Or you might abbreviate that to WBBRWG.

Rather than Lego, in the case of DNA you’d use the letters A, G, C, and T, ending up with a pattern that might look like ‘AGTCCGCGAATA’.

Now this seems a relatively simple concept until you start looking at the enormous number of “base pairs” in a genome.

A genome is the complete set of genes present in a cell, genes being made up of those same four basic building blocks.

Genome sequencing is basically just working out the order of those letters.

So how many letters are there in a full genome?

Well, even something relatively simple, like a C. difficile bacterium has a genome size of around 4.4 million base pairs, or 8.8 million letters, equivalent to the contents of approximately 15 copies of “War And Peace”.

However, this is as nothing compared to the size of the full human genome, which contains no fewer than 3 billion letters, or about ten sets of the entire multi-volume Encyclopedia Britannica.

Or to put it another way, if you were to build a model of the human genome by making a Lego construction using a single classic four-knobbed brick to represent each letter, you’d end up with five and a half life-sized Empire State Buildings in order to represent the whole genome.

That’s big.

With numbers like these, it’s easy to see why the data science behind DNA sequencing is, well, colossal.

And to think it all began almost 150 years ago with some rather gross wound dressings.


 

Further reading

DNA Sequencing

DNA Sequencing Fact Sheet

Friedrich Miescher

Genetics by the Numbers

History of Genome Sequencing

Structure of Nucleic Acids

The Complexity of Eukaryotic Genomes

What’s a Genome