Source: http://commons.wikimedia.org/wiki/File:DNA_Overview.png
Source: http://commons.wikimedia.org/wiki/File:DNA_Overview.png

Listening to the Genome: Music or Noise?

One of the great triumphs of twentieth-century biology was the discovery of how genes make proteins. Genes are encoded in DNA. To turn the sequence of a gene into a protein, a number of molecules gather around it. Reading its sequence, they produce a single-stranded version of it made of RNA, called a transcript. The transcript gets shipped to a cluster of other molecules, the ribosome, which picks out building blocks to construct a protein that corresponds to the gene. The protein floats off to do its job, whether that job is to catch light, digest food, or help generate a thought.

We have about 20,000 protein-coding genes. If you tally up the amount of DNA they constitute, you get less than 3 percent of the human genome. Which naturally raises the question of what’s in the other 97 percent.

This question is hardly new, and the answers have earned scientists scads of Nobel prizes over the years.

Some of them found pieces of non-protein coding DNA that are essential to our survival. Over fifty years ago, Francois Jacob and his colleagues realized that the non-protein-coding DNA contains stretches, called regulatory elements, that act like switches for genes. When proteins and RNA molecules grab onto those switches, the genes become active.

Scientists have also known for decades that sometimes when a cell makes an RNA transcript, it can use that transcript for an important job without bothering to translate it into a protein. The ribosome, for example, is an assembly of proteins and RNA molecules. George Palade published this discovery in 1955.

Since then, scientists have plunged ever deeper into the workings of regulatory elements and functional RNAs. Many of our genes are controlled not just by a single switch, but by a veritable combination lock of different regulatory elements. RNAs can carry out many more jobs than just working in the ribosome. They can silence other genes, for example, by locking onto their transcripts.

Understanding the other 97 percent of the genome is a challenge at once profound and medically practical. Earlier this year, for example, scientists identified a gene for a long piece of RNA called PTCSC3 that suppresses cancer in the thyroid.

But for just as long, scientists have known that some of the genome does not carry out such vital functions. Barbara McClintock discovered in the 1940s that parts of the genome can make copies of themselves which can then insert themselves elsewhere in our DNA. It turns out our genomes are a veritable zoo of these so-called “mobile elements,” including ancient viruses. In some cases, evolution harnesses these mobile elements for useful purposes. But a lot of them have mutated to the point that they do nothing at all. About eight percent of our genome is made up the littered remains of dead viruses, for example.

While the basics of the human genome have been clear for decades, the particulars have remained murky. Scientists today are using better tools to explore the genome. They can now gain some clues about any particular chunk of DNA by looking at its sequence. It’s possible to recognize a protein-coding gene, for example–and it’s also possible to see if it has mutations that have rendered it functionally dead (a pseudogene).

But there’s no getting around the hard work of old-fashioned biology–of peering into cells to see what’s going on. And when scientists look in there, things get contentious.

In 2008, I wrote in the New York Times about a then-new project called ENCODE, in which a small army of scientists would create an encyclopedia of information about the entire genome, not just the protein-coding bits. Last year, the ENCODE team unveiled their analysis of this encyclopedia. It traveled through the high-profile-paper-becomes-a-press-release-and-inspires-breathless-articles-with-misleading-headlines sausage machine and ended up giving the impression that until now, scientists thought everything in the genome besides protein-coding genes was “junk,” and that the ENCODE project proved–without a doubt–that about eighty percent of the genome has a function.

What the scientists actually demonstrated was that cells produce RNA transcripts from a huge portion of the genome, not just for the protein-coding parts. They also observed that proteins were able to grab onto those regions–a suggestion that they were switching on genes for RNA. They concluded that this kind of evidence demonstrated that eighty percent of the genome has “biochemical function.” (John Timmer wrote a good analysis of the ENCODE saga at Ars Technica.)

The ENCODE team incurred a remarkably high tide of criticism from other researchers. One long-running complaint is that the mere existence of an RNA transcript does not mean it serves any function at all. Cells can be sloppy, shooting off RNA transcripts from useless DNA. Those accidentally transcribed pieces of RNA promptly get destroyed.

To get a feel for its intensity, check out this piece that Dan Graur and colleagues published this February in the journal Molecular Biology and Evolution:

Here, we detail the many logical and methodological transgressions involved in assigning functionality to almost every nucleotide in the human genome. The ENCODE results were predicted by one of its authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass-media hype, and public relations may well have to be rewritten.

I’ve been very curious about how scientists would move on from here, and how the debate over the genome would evolve. Now a team of scientists at the University of California at San Francisco has published an interesting paper on the issue in the journal PLOS Genetics.

The UCSF researchers come to a conclusion much like that of ENCODE. They analyzed newly compiled databases of the RNA produced in cells from different tissues. They then pinpointed which segments of DNA encoded that RNA. They found about 85% of the genome produced at least one copy of RNA in one of the databases. The UCSF researchers argued that these results support ENCODE’s work.

The scientists then probed those transcripts to see whether they were just sloppy mistakes or served a function. They focused on one class of transcripts, known as long intergenic noncoding RNAs, or lincRNAs for short. A number of scientists have been cataloging lincRNAs for a few years now, but they’ve only identified a few thousand that appear to have a function. The UCSF searched their new databases for more lincRNAs. They first identified long transcripts, and then they winnowed down their list to get rid of false positives. They filtered out sequences that might be fragments of protein-coding genes that managed to slip into the database, for example. They also combined segments of DNA that overlapped in a way that suggested they both came from a single lincRNA gene.

Counting previously discovered lincRNAs, the researchers ended up with a total 55,000 potential non-protein coding genes. The scientists then looked at each of the candidate genes to look for clues to whether they serve a function. One clue was that the transcripts tend to show up just in one kind of tissue. That’s the rule for many proteins–hemoglobin is very useful in your blood but not very helpful in your eye.

The scientists also found that these stretches bear some hallmarks of being switched on and off. DNA is wound around spools called histones, and the candidate lincRNA genes had proteins latched onto them that can unwind DNA so that it can be transcribed.

Another clue came from comparing the candidate lincRNA genes in humans to other species. If a piece of DNA serves no function, it will be prone to picking up mutations.  Since the  DNA encodes nothing of importance, mutations to it can do no harm.

Mutations that strike functional pieces of DNA, on the other hand, can be devastating. In these cases, natural selection should eradicate them over millions of years. The lincRNA gene candidates that the UCSF scientists found are fairly similar to versions in other mammal. That suggests that evolution is conserving them–and that they serve a function.

If these 55,000 candidates do turn out to be true genes for lincRNAs, then they will outnumber traditional protein-coding genes by a factor of five or more. The scientists don’t claim that they’ve definitively proved that these are genes, however; they look at their catalog as a collection of candidates that deserve to be tested with experiments. “The time is ripe for this dark matter of the human genome to step further into the spotlight,” they write.

I asked a few of ENCODE’s outspoken critics about the new paper, to see whether it changes their view on the genome’s other 97 percent.

Sean Eddy, a biologist at Janelia Farm Research Campus, is very skeptical of all such large-scale catalogs. When he’s looked closely at such catalogs, he’s found plenty of false positives. Rather than just compile a list of possible genes, he thinks scientists should do some careful quality control. They should be like inspectors at a factory, and pick out a random set of candidates to test. Only if careful experiments show that really do behave the way a functional gene behaves can they have confidence in their catalog.

While he was filling up his coffee this morning, Eddy thought up an analogy for this kind of research–one, he wrote to me, “that might be clarifying rather than dumb.”

If you took a big chunk of English text and screened it for novel “dark matter” (the birth of new words!) by eliminating all words that appear in the dictionary, you would indeed find a lot of “novel” words in your “high throughput screen”, and maybe get excited. But the moment you actually looked at a sample of what you’d found, you’d see it was almost all stuff that was obvious in retrospect. You’d say, “Oh yeah, numbers. Oh yeah, abbreviations. Oh yeah, foreign words. Oh yeah, proper names. Oh yeah, misspellings.” And you’d have five new null hypotheses, alternative explanations for your “novel words”; then you’d go back and revise your screen to eliminate those. To my mind, a lot of the lincRNA papers fail to do the part where they look carefully (manually) at what their screen produced, so they fail to develop their intuition for the various failure modes of the high-throughput computational screen.

Larry Moran, a biochemist at the University of Toronto and fierce critic of ENCODE, had a similar response. “Let’s assume that these 55,000 RNAs have a function of some sort,” he wrote to me. “If true, that would require rewriting the textbooks because none of the thousands of labs studying gene expression over the past five decades has seen any hint of such a massive amount of control and regulation by RNAs.”

Moran also pointed out that in many cases, the supposed genes produced just one lincRNA per cell. It strains his imagination to picture a way for a single lincRNA to have any important role in a cell’s existence. Far more likely, it’s just a segment of DNA that the cell accidentally transcribed. “If there have to be more than 10 transcripts per cell then the number of transcripts is reduced to 4,000,” Moran wrote. “If you need more than 30 transcripts per cell then that leaves only 950 putative functional RNAs.”

Moran and Eddy both point out that even if the UCSF researchers are right and all 55,000 DNA segments are real genes for important lincRNAs, that discovery would not, in fact, clear up all that much about the genome as a whole. Here’s how Eddy put it:

Even supposing that all 55,000 were truly functional and important RNAs; 55,000 * 2000nt average lincRNA transcript length = 110MB, less than 4% of the human genome. So I think the questions about these transcripts have to be separated from the concept of junk DNA – if someone did show that an additional 4% of the genome was functional, that would be super cool, but it wouldn’t bear on the questions around junk DNA, which have to do with the majority of the genome.

I contacted two of the UCSF co-authors to respond to these critiques but haven’t yet heard back from them. As soon as I do, I will add their response and post a notice on the blog that I’ve updated this piece. I’d also love to hear from both sides of this debate in the comments below.

Update 6/22: Here’s what Michael McManus, a co-author of the new paper on lincRNAs, said in reply to my queries. The emphases and links are mine…

CZ: Even if all 55,000 transcripts you identified were functional, they would come to a few percent of the human genome. That wouldn’t address the larger question of whether transcriptionally active “junk DNA” isn’t junk.

MM: We agree. There are far more transcripts expressed than those which we have catalogued. Upon observing that nearly the entire genome is transcribed, we chose as a first step to focus only on lincRNAs, but there are many other transcripts we did not focus on in this paper. In fact a study that we reference in our paper looked extremely deeply at very narrow intergenic regions thought to be transcriptionally inactive, but found a dizzying array of complex, regulated transcripts at very low levels in these regions. This clearly shows that what we’ve found is the tip of the iceberg. However, we cannot distinguish between functional and nonfunctional transcripts without performing functional experiments and this is the obvious next frontier for determining how much of this transcriptionally active “junk DNA” is or is not junk.

CZ: Could there be alternative explanations for these sequences?  Could these supposed lincRNA genes actually be pieces of ordinary protein-coding genes, or a false positive from how the experiment is designed.

MM: These are both potential sources of artifactual expression signal and we did work to mitigate both scenarios. We noticed that many currently annotated genes can actually extend outside of their annotated regions when using RNA-seq data to analyze expression levels, so we removed any putative lincRNAs that overlap any of these empirically extended gene structures. After this filtering, we found that the majority of lincRNAs in our catalog are relatively distant (>30 kilobases) to the nearest protein coding gene. We assert that our catalog is by no means perfect but does represent a more refined dataset for investigators to further evaluate. (Emphasis CZ.)

Regarding the second source of artifact, we did take multiple steps to minimize the potential for genomic DNA contamination, and this is described in the Methods section of the paper. Again, it is fair to say that in some rare cases, some of the putative lincRNAs we discovered may be artifacts and additional data is needed (longer reads, deeper sequencing) to achieve even higher confidence in all lincRNAs. For this reason we define the lincRNAs as “putative” in the paper, because they must be manually experimentally validated with great care. This mantra is true even for the large number of existing protein coding genes that have been reported but haven’t been validated.

CZ: Could you test out twenty candidates for lincRNAs to see if they aren’t just noise?

MM: In concept this is true. Manual verification of lincRNAs is an important future direction, and an important pre-requisite for follow-up functional studies. That said, RNA-seq data is becoming a widely accepted approach for studying lowly expressed transcripts as evidenced by large numbers of publications using the technology.

CZ: The commenters thought it unlikely that lincRNAs that are present at just a few copies per cell would be able to have a function. If you use a cutoff of 30 copies per cell, only 950 lincRNAs remain.

MM: We make no broad-sweeping assertions about the functionality of the lincRNAs described in the study. In fact, it is entirely possible that almost none of the lincRNAs we reported are functional. (Emphasis CZ) Moreover, we are hesitant to make strong claims that relate low expression level to functionality, given the reports that low level lincRNA transcripts are functional (examples are HOTTIP, CCND1 ncRNA and others), Therefore an important first step toward identifying which functional lincRNAs is to generate a catalog of all putative lincRNAs for follow-up function based studies.