Does the title of the movie Gattaca refer to a DNA sequence?
Dear Straight Dope:
The movie title Gatacca, though I'm not 100% sure on the spelling, is made up of three DNA nucleic acids … the G, T, and A ones, obviously. I've heard this particular string actually represents something in some species or other, but this seems a little convenient to me for movie making. What's the scoop?
Son of Dex replies:
If you heard this from the same person who suggested that "Gattaca" uses only three letters, then it's your own damned fault for believing it. But never mind. Does the DNA sequence GATTACA actually represent anything in biology?
First a word about the movie. Released in 1997, Gattaca (the one movie poster I've seen gives the title in all caps, for what that's worth) is set in a vaguely dystopian future, in which DNA information determines everyone's fate and genetic engineering is used to breed the elite of society. "Gattaca" is the name of an aeronautics company that launches space missions; the company itself has nothing to do with genetics. Given the setting, however, it's easy to believe the name was chosen because of its relationship to DNA.
OK now. The genetic code is written using four "letters," G, A, T, and C, each of which represents a molecule known as a nucleotide. The letters stand for the names of the four nucleic acids guanine, adenine, thymine, and cytosine. (Yes, U can replace T in RNA, but we're just going to focus on DNA for this discussion.)
Does this sequence actually occur in any real species? Yes, frequently. Think about it. There are seven letters in GATTACA. With four possibilities for each letter, the odds of a seven letter sequence being GATTACA are 1 in 16,384 (4 (superscript: 7)). The human genome contains about 3 billion nucleic acids, which means that the sequence GATTACA probably occurs in the human genome about 180,000 times.
A friend of mine at a rival pharmaceutical company ran the sequence GATTACA through a search program that peruses gene sequence databases. She limited the search to the first 30 genes containing the sequence. The machine not only delivered these 30, which included 23 human genes, 3 fruit fly genes, and 1 E. coli gene, it also mentioned there were approximately 92,000 appearances of the sequence it didn't report because she only asked for 30.
My father, SDStaff Dex, would no doubt sit back on his haunches, content here. But, in my never-ending efforts to emerge from the old man's shadow, I feel it my duty to push further: what does this sequence actually mean?
There are basically three functions for a DNA sequence. It can encode a protein, which means that it's part of a gene. It can be meaningless space filler, in which case it's called "junk DNA." Or it can serve as a regulatory sequence, which means that it tells the molecular machinery where to find and when to follow the instructions contained in a gene.
If GATTACA is found in a gene, then its meaning should be easy to figure out. It takes three nucleotides to encode for one amino acid, the fundamental building block of a protein. (Following our metaphor, three "letters" encode a "word" — an amino acid — and multiple "words" are required to create a "sentence," that is, a protein.) There are seven letters in GATTACA. If we try to "read" it by dividing in to groups of three, we end up with an extra letter. This leads me to believe the name wasn't chosen because of the protein that it encodes. And anyway, at most it codes for two amino acids, which isn't particularly meaningful. Most proteins have more than 100 amino acids.
For the record, you can read GATTACA in three ways, depending on where you start. It can be read GAT TAC A, in which case it codes for the amino acids aspartic acid and tyrosine, with an "A" left over. It can also be read G ATT ACA, in which case it starts with an excess G, followed by an isoleucine, and a threonine. Or, for the truly daring, it can be read GA TTA CA, which codes for a leucine in the middle, an extra GA at the front and an extra CA at the end. XGA, where X is any nucleic acid, can stand for arginine, glycine, or a "stop" message, which tells the molecular machinery to stop reading. CAX can stand for glutamine or histadine. Remember, you asked.
That chucks out our first possibility. The second is that GATTACA appears in junk DNA (which statistics alone suggest it surely must). If that's the case, then it's meaningless by definition. (There are a number of scientists who think that so called "junk DNA" may actually contain extremely important, highly detailed structural information that is vital to the proper functioning of the body. However, that just gets complicated, so we'll skip it and assume that junk is junk.)
So what about the last possibility — could the sequence be regulatory? That's harder to answer. There are lots of regulatory sequences, and they are probably not all even known at this point. I couldn't find it in any of my textbooks, but then, they don't generally list DNA sequences in the index. I asked a Ph.D. molecular biologist with whom I work. He didn't recognize GATTACA as a regulatory sequence, so if that's what it is, it's not well known.
Having been failed by both books and co-worker, I naturally turned to the Internet, searching for the sequence GATTACA on the National Center for Biotechnology Information Homepage (www.ncbi.nlm.nih.gov/). I pulled up a reference for the sequence of cytochrome oxidase I of the species Lasioglossum gattaca — that is, a mitochondrial protein in some species of bee. I gotta say that if this is what the film makers were referring to, I sure don't get the relevance.
That was as far as I could take it. The sequence exists in nature, but if it has any scientific significance it has thus far eluded me.
I got some additional insight from another of my co-workers, who earned her master's degree decoding DNA sequences. Apparently, she and her entire lab group went to see the film when it opened, and spent some time discussing the meaning of the title. After a great deal of scientific investigation of the same type for which the Straight Dope Science Advisory Board is renowned (which is to say it involved a fair amount of beer), they concluded there really is no other catchy way to put those particular letters together.
They opted not to publish this astonishing result. Meaning that if I publish this first, I'll get the credit for their work. Ain't science great?
SDStaff Doug adds:
I happen to know the people who named the sweat bee Lasioglossum gattaca, and that, at least, was named for its sequence. I also saw the movie, and it's impossible to tell whether the scriptwriters intended it that way, but given the importance of gene sequencing to the plot, I'll bet they did it on purpose.