blogorrhea: April 2014

Tuesday, April 29, 2014

Genes, Codons, and Purines

A universal feature of protein-coding genes is that they tend to use a lot of codons that begin with a purine (A or G). In fact, it's typical for a given gene's codons to use a purine in position one 60% or more of the time. But it's fair to ask: How universal is this trend, exactly? Does the rule apply for organisms with extremely high (or low) genomic G+C content? Does it apply for endosymbionts with greatly reduced genomes? Is it just a "sometimes" rule? Are there important exceptions?

I decided to collect codon statistics for 109 different bacterial species, representing members of all major taxonomic groups, with a wide range of genome sizes and GC percentages. For each organism, I determined the average percent A+G content in codon base one (AG1) across all CDS genes. Then I plotted AG1 against the genomic A+T content for each organism. (A+T is of course just one minus the G+C content.) Here's the graph of AG1 content for all the organisms:

Codon base-one purine content (average for all CDS genes) versus genomic A+T content for N=109 bacterial species. Dot size corresponds to genome size.

The fun thing about this graph is that each data point is sized according to the genome size of the organism in question (in other words, the area of the dot is proportional to genome size). As you can see, bacteria at the high end of the A+T scale (low G+C) tend to have smaller genomes. But the more important thing to notice is that AG1 is 58% or more for all 109 genomes. This means that the phenomenon of high average purine content in codon base one appears to be universal, at least for the sample group. (Organism names are listed in a table below.)

Of course, within a given genome, genes vary somewhat in terms of the per-gene average AG1, but it's still quite rare to find a protein gene that has average AG1 under 50%. For example, below is a histogram plot of AG1 content for all protein-coding genes of Sorangium cellulosum, a bacterium with genomic GC content of 72% (A+T = 28%).

Per-gene AG1 usage (codon base-one purine content) for all CDS genes of Sorangium cellulosum.

As you can see, very few genes lie to the left of x = 0.5. (Of Sorangium's 10,400 protein genes, only 321 have an average AG1 under 50%. Those could easily be mis-annotated genes or gene fragments.) Most organisms show much the same distribution of average AG1 values across CDS genes.

Gene annotation programs could probably benefit from using a check of AG1 to verify that a putative gene is in the correct reading frame. GC3 content is often used in this way, but AG1 is actually a much more discriminating test, especially with low-GC genomes (where the "wobble base" GC percentage is not particularly helpful).

Listed below are the 109 organisms (and their taxonomic categorizations) used in this investigation.

Organism	Taxon
Acidaminococcus fermentans strain DSM 20731	Firmicutes:Clostridia
Acidovorax avenae subsp. citrulli strain AAC00-1	Proteobacteria:Betaproteobacteria
Aerococcus urinae strain ACS-120-V-Col10a	Firmicutes:Lactobacillales
Aeromonas hydrophila strain ML09-119	Proteobacteria:Gammaproteobacteria
Aggregatibacter actinomycetemcomitans D11S-1	Proteobacteria:Gammaproteobacteria
Agrobacterium radiobacter strain K84	Proteobacteria:Alphaproteobacteria
Anaerobaculum mobile strain DSM 13181	Synergistetes:Synergistia
Anaerocellum thermophilum strain DSM 6725	Firmicutes:Clostridia
Anaerolinea thermophila strain UNI-1	Chloroflexi:Anaerolineae
Anaplasma marginale strain Florida	Proteobacteria:Alphaproteobacteria
Arcobacter butzleri ED-1	Proteobacteria:Epsilonproteobacteria
Atopobium vaginae strain DSM 15829	Actinobacteria:Coriobacteridae
Azospirillum brasilense strain Sp245	Proteobacteria:Alphaproteobacteria
Bacillus amyloliquefaciens strain Y2	Firmicutes:Bacilli
Bacillus anthracis strain CDC 684	Firmicutes:Bacillales
Bacillus subtilis BEST7613 strain PCC 6803	Firmicutes:Bacilli
Bacteroides dorei strain 5_1_36/D4	Bacteroidetes:Bacteroidia
Bartonella quintana strain RM-11	Proteobacteria:Alphaproteobacteria
Blastococcus saxobsidens strain DD2	Actinobacteria:Actinobacteridae
Borrelia miyamotoi strain LB-2001	Spirochaetes:Spirochaetales
Brachybacterium faecium strain DSM 4810	Actinobacteria:Actinobacteridae
Brucella ovis strain ATCC 25840	Proteobacteria:Alphaproteobacteria
Buchnera aphidicola (Acyrthosiphon pisum) strain 5A	Proteobacteria:Gammaproteobacteria
Burkholderia pseudomallei strain 1710b	Proteobacteria:Betaproteobacteria
Caldicellulosiruptor lactoaceticus strain 6A	Firmicutes:Clostridia
Calditerrivibrio nitroreducens strain DSM 19672	Deferribacteres:Deferribacterales
Campylobacter concisus strain 13826	Proteobacteria:Epsilonproteobacteria
Candidatus Cloacamonas acidaminovorans	candidate division WWE1:Candidatus Cloacamonas
Candidatus Methylomirabilis oxyfera	candidate division NC10:Candidatus Methylomirabilis
Candidatus Pelagibacter ubique strain HTCC1062	Proteobacteria:Alphaproteobacteria
Carboxydothermus hydrogenoformans strain Z-2901	Firmicutes:Clostridia
Chlamyda trachomatis (i) strain L2/434/Bu; i	Chlamydiae:Chlamydiales
Clostridium botulinum A strain Hall	Firmicutes:Clostridia
Coprobacillus sp. strain 8_2_54BFAA	Firmicutes:Erysipelotrichia
Coprococcus catus strain GD/7	Firmicutes:Clostridia
Cycloclasticus zancles strain 7-ME	Proteobacteria:Gammaproteobacteria
Deinococcus radiodurans strain R1	Deinococcus-Thermus:Deinococci
Desulfococcus oleovorans strain Hxd3	Proteobacteria:Deltaproteobacteria
Ehrlichia canis strain Jake	Proteobacteria:Alphaproteobacteria
Enterobacter cloacae strain SCF1	Proteobacteria:Gammaproteobacteria
Erwinia amylovora strain ATCC 49946	Proteobacteria:Gammaproteobacteria
Escherichia coli B strain REL606	Proteobacteria:Gammaproteobacteria
Geobacillus kaustophilus strain HTA426	Firmicutes:Bacillales
Geobacillus thermoleovorans strain CCB_US3_UF5	Firmicutes:Bacillales
Geobacter metallireducens strain GS-15	Proteobacteria:Deltaproteobacteria
Geobacter sulfurreducens strain KN400	Proteobacteria:Deltaproteobacteria
Geobacter sulfurreducens strain PCA	Proteobacteria:Deltaproteobacteria
Geobacter uraniireducens strain Rf4	Proteobacteria:Deltaproteobacteria
Geodermatophilus obscurus strain DSM 43160	Actinobacteria:Actinobacteridae
Gordonia bronchialis strain DSM 43247	Actinobacteria:Actinobacteridae
Haemophilus ducreyi strain 35000HP	Proteobacteria:Gammaproteobacteria
Halogeometricum borinquense DSM 11551	Euryarchaeota:Halobacteria
Helicobacter pylori (Helicobacter pylori SAfr7) strain SouthAfrica7	Proteobacteria:Epsilonproteobacteria
Klebsiella oxytoca strain 10-5243	Proteobacteria:Gammaproteobacteria
Kribbella flavida strain DSM 17836	Actinobacteria:Actinobacteridae
Ktedonobacter racemifer DSM 44963	Chloroflexi:Ktedonobacteria
Lactobacillus acidophilus strain 30SC	Firmicutes:Lactobacillales
Lactobacillus reuteri strain MM4-1A	Firmicutes:Lactobacillales
Lactococcus lactis subsp. cremoris strain A76	Firmicutes:Bacilli
Leptolyngbya sp. PCC 7376	Cyanobacteria:Oscillatoriophycideae
Leptonema illini strain DSM 21528	Spirochaetes:Spirochaetales
Leptospira biflexa serovar Patoc strain Ames; Patoc 1	Spirochaetes:Spirochaetales
Leuconostoc gasicomitatum LMG 18811 strain type LMG 18811	Firmicutes:Lactobacillales
Mesorhizobium australicum strain WSM2073	Proteobacteria:Alphaproteobacteria
Mesorhizobium ciceri biovar biserrulae strain WSM1271	Proteobacteria:Alphaproteobacteria
Methylobacillus flagellatus strain KT	Proteobacteria:Betaproteobacteria
Methylophaga sp. strain JAM7	Proteobacteria:Gammaproteobacteria
Mycobacterium tuberculosis = ATCC 35801 strain ATCC35801; Erdman	Actinobacteria:Actinobacteridae
Mycoplasma gallisepticum strain F	Tenericutes:Mollicutes
Neisseria gonorrhoeae strain NCCP11945	Proteobacteria:Betaproteobacteria
Nocardia brasiliensis ATCC 700358 strain HUJEG-1	Actinobacteria:Actinobacteridae
Nocardia cyriacigeorgica strain GUH-2	Actinobacteria:Actinobacteridae
Nostoc sp. PCC 7120 (Anabaena sp. PCC 7120) strain PCC7120	Cyanobacteria:Nostocales
Novosphingobium aromaticivorans strain DSM 12444	Proteobacteria:Alphaproteobacteria
Oceanobacillus kimchii strain X50	Firmicutes:Bacilli
Orientia tsutsugamushi strain Ikeda	Proteobacteria:Alphaproteobacteria
Paenibacillus polymyxa strain M1	Firmicutes:Bacilli
Polynucleobacter necessarius strain STIR1	Proteobacteria:Betaproteobacteria
Propionibacterium acnes TypeIA2 strain P.acn33	Actinobacteria:Actinobacteridae
Proteus mirabilis strain HI4320	Proteobacteria:Gammaproteobacteria
Pseudomonas fluorescens strain Pf0-1	Proteobacteria:Gammaproteobacteria
Pseudonocardia dioxanivorans strain CB1190	Actinobacteria:Actinobacteridae
Ralstonia eutropha strain H16	Proteobacteria:Betaproteobacteria
Rhizobium tropici strain CIAT 899	Proteobacteria:Alphaproteobacteria
Rhodobacter sphaeroides ATCC 17029	Proteobacteria:Alphaproteobacteria
Shigella boydii strain Sb227	Proteobacteria:Gammaproteobacteria
Slackia heliotrinireducens strain DSM 20476	Actinobacteria:Coriobacteridae
Sorangium cellulosum strain So0157-2	Proteobacteria:Deltaproteobacteria
Staphylococcus aureus strain 04-02981	Firmicutes:Bacillales
Streptococcus agalactiae strain 2603V/R	Firmicutes:Lactobacillales
Streptomyces cf. griseus strain XylebKG-1	Actinobacteria:Actinobacteridae
Streptosporangium roseum strain DSM 43021	Actinobacteria:Actinobacteridae
Sulfurimonas denitrificans DSM 1251 strain ATCC 33889	Proteobacteria:Epsilonproteobacteria
Thioalkalivibrio nitratireducens strain DSM 14787	Proteobacteria:Gammaproteobacteria
Thiobacillus denitrificans strain ATCC 25259	Proteobacteria:Betaproteobacteria
Treponema azotonutricium strain ZAS-9	Spirochaetes:Spirochaetales
Treponema pedis strain T A4	Spirochaetes:Spirochaetales
Turneriella parva strain DSM 21527	Spirochaetes:Spirochaetales
Vibrio cholerae strain BX 330286	Proteobacteria:Gammaproteobacteria
Wolbachia endosymbiont strain TRS of Brugia malayi	Proteobacteria:Alphaproteobacteria
Yersinia pestis D106004	Proteobacteria:Gammaproteobacteria
Bacillus thuringiensis serovar andalousiensis strain BGSC 4AW1	Firmicutes:Bacillales
Ureaplasma urealyticum serovar 5 strain ATCC 27817	Tenericutes:Mollicutes
Bordetella pertussis strain 18323	Proteobacteria:Betaproteobacteria
Comamonas testosteroni strain KF-1	Proteobacteria:Betaproteobacteria
Eikenella corrodens strain ATCC 23834	Proteobacteria:Betaproteobacteria
Janthinobacterium sp. strain Marseille	Proteobacteria:Betaproteobacteria
Rhodopirellula baltica SH strain 1	Planctomycetes:Planctomycetacia
Blastopirellula marina strain DSM 3645	Planctomycetes:Planctomycetacia

Monday, April 28, 2014

Are Some Leprosy Pseudogenes Turned On?

Most genomes, whether human or bacterial, contain significant numbers of pseudogenes (that is, genes that are presumed to be inactive due to internal stop codons, frameshift errors, or other serious defects). The usual presumption is that such genes are dormant or dead and thus are not expressed as proteins, since the proteins would be severely truncated or contain nonsense regions, etc.

However, we know that in Mycobacterium leprae (the leprosy bacterium), some 43% of the organism's 1,116 pseudogenes are transcribed. While most of the transcripts are no doubt used in some kind of regulatory capacity, it would be surprising if not a single transcript got translated into protein.

One sign that a protein gene is highly expressed is the presence, upstream of the start codon, of a strong Shine Dalgarno sequence. This is a special sequence of bases that serves as a binding area for 16S ribosomal RNA. The Shine Dalgarno sequence serves to increase the translational efficiency of the genes that have such signatures. (Not all do.) Generally, they are found ahead of high-priority/highly-expressed genes (such as genes for ribosomal proteins). Absence of a SD sequence doesn't mean the gene doesn't get translated. Many genes carry no SD signal.

I created scripts that looked at all 1,116 pseudogenes in M. leprae, to detect the occurrence of Shine Dalgarno sequences in the 20-base-pair region upstream of what would normally be the start codons of said genes. Intriguingly, 31% of pseudogenes carry a length-4 SD sequence (nearly 7 times the number of such sequences expected to occur by chance). By comparison, 50.6% of normal genes in M. leprae carry a length-4 SD sequence.

When I looked for length-5 SD signals, I found that 8.6% of pseudogenes carry such a signal, compared to 26.4% for regular genes. Length-6 signals were found for 33 pseudogenes (3% of the total of 1,116 pseudogenes) versus 176 normal genes (representing 10.9% of 1,604 normal genes). These numbers are about eight times higher than expected to occur by chance.

These numbers are summarized in the table below, where I also show similar findings for genes and pseudogenes of Bordetella pertussis.

Organism	Data Set	Motif Length	Expected	Found	% of Genes
M. leprae	genes (n = 1604)	SD6	5	176	10.9%
		SD5	25	423	26.4%
		SD4	119	812	50.6%
	pseudogenes (n = 1116)	SD6	4	33	3.0%
		SD5	18	96	8.6%
		SD4	83	346	31.0%
B. pertussis	genes (n = 3377)	SD6	9	493	15.0%
		SD5	49	1032	30.6%
		SD4	259	1939	57.4%
	pseudogenes (n = 370)	SD6	0	38	10.20%
		SD5	1	89	24.10%
		SD4	11	194	52.40%

B. pertussis preserves a higher proportion of SD signals in pseudogenes than does M. leprae. This is expected, since most B. pertussis pseudogenes are still "in frame," whereas most (but not all) M. leprae pseudogenes harbor frameshifts.

Length-5-or-longer SD signals occur at about one third the rate in psuedogenes of M. leprae that they do in normal genes of M. leprae, but still much higher than expected by chance. If 43% of pseudogenes are transcribed (as we know they are), and a third of those transcripts have strong enough SD sequences to facilitate translation, it means about 159 pseudogenes in M. leprae could be expected to have expressed protein products. Those products would, of course, be truncated and/or contain nonsense regions. Presumably, many would be marked for proteolysis (either by the tmRNA system or through other mechanisms).

Sunday, April 27, 2014

A Stroll on the Beach in 1,000,000,000 B.C.

A popular conversation-starter is to ask someone the following question: If you could have access to a time machine, where would you go in time? Would you go forward, or backward? To what date? And why?

In The Time Machine, H. G. Wells has his hero go forward to the year 802,701 A.D., where he finds that the above-ground humans still look like humans, but have lost their humanity, in that they no longer value knowledge or culture. Meanwhile, the below-ground humans have also lost their humanity. They feed on the above-ground humans. Wells leaves it to the reader to decide which group suffered the greater loss.

In my case, I often imagine that if I were to obtain access to a time machine (and could use it for only one round-trip journey), I would go back to 1 billion years B.C.E. My mission: to collect environmental samples (soil, sea water, mud) for biological analysis.

Of course, for this journey I will need to wear a space suit. The earth's atmosphere, a billion years ago, would contain far too little oxygen to breathe, just one to two percent oxygen, less than exists today at 50,000 feet above sea level.

I would hope to step out of the time machine on a beach, or within walking distance of the ocean. If I find myself standing on solid land, it will likely be the continent of Rodinia. This is the supercontinent that will break apart, then come together again (around 300,000,000 B.C.E.) to form Pangea.

The ground is rocky, gritty; there is no soil in the normal sense. And certainly no plants. And no insects. In fact, as I look around, I find what appears to be a completely sterile landscape. Without an ozone layer, the atmosphere allows the full strength of the sun's ultraviolet rays to reach the ground, where it sterilizes everything. Were I to step out of my space suit (and breathe through a scuba tank) for even ten minutes, I would be fatally irradiated and die a horrific death 48 hours or so later.

The landscape is rocky, Venusian, harsh. I need to find the beach. That's where all the life on earth is at this point in time―under water.

The beach doesn't look like a normal beach. The sand is dark and gritty. It will be many millions of years before sea creatures have mastered the art of depositing carbonates in their exoskeletons. The first white beach is eons in the future.

I obtain a sample of ocean water for later analysis. Back in the lab, I'll do a naughty thing and taste a teaspoon of it. Oddly, it doesn't taste like normal sea water. It's noticeably less salty, with a disagreeable brackish odor. I spit it out.

The ocean is where life is, at 1 billion years B.C.E. But there is no life to be seen with the naked eye. The first true multicellular creatures are 300 or 400 million years in the future. If I'm lucky, my sea water sample will contain some clue of eukaryotic life. More than likely, it will contain mostly cyanobacteria and the phages (viruses) associated with them, and maybe some other types of bacteria (the forms we now call alphaproteobacteria), but it might contain microscopic green algae (and their viruses). I can't wait to examine the samples to see if eukaryotic forms are present, and if they are, I can't wait to see what their endosymbionts look like (maybe they've already become mitochondria and chloroplasts). DNA sequencing will tell me how many endosymbiont genes have migrated to the nucleus, and which ones are still active. Much work needs to be done.

At 1 billion years B.C.E., earth is a lonely place: no mosquitoes (no insects at all), no birds in the sky, no grass, no moss, no real sign of life―except in the sea (and even there, the life forms are microscopic). The waves that break on the shore are barely foamy at all.

Before returning to the present day in my time machine, I'll wait for night (the wait won't be long; the days are only 20 hours long) and hope to watch the moon rise. The moon, when I see it, will be shockingly huge, maybe twenty percent bigger in diameter than I'm used to, because at 1 billion years B.C.E., the moon is something like 40,000 miles closer to the earth.

As I stare at the shockingly large moon, craters clearly visible, I'll reflect on all that lies ahead for earth: the snowball-earth freeze-over that brings glaciers to the equator around 700 million years B.C.E. The proliferation of life forms after 582 million years B.C.E. (the so-caled Cambrian Explosion). The five major extinction events that will wipe them out. The emergence of dinosaurs, birds, animals with fur. Animals that walk on two legs.

Nighttime is cold, but I can't make a fire on the beach; there isn't enough oxygen in the air. Come to think of it, my space suit's oxygen supply is running low. I need to get back to the time machine, and quickly. I'll die if I stay in this place.

Saturday, April 26, 2014

The GGAGG Reflex

A common myth in biology is that genes coding for proteins need to have a Shine Dalgarno sequence upstream of the start codon. Students sometimes spout this as an inarguable fact; a kind of molecular biology catechism. I call it the GGAGG reflex.

In fact, no SD sequence is required. None at all. It's important to be clear on this.

In case you're not a biogeek: In the 1970s, Australian scientists John Shine and Lynn Dalgarno were the first to notice that the tail end of the 16S bacterial ribosomal RNA contains a short nucleotide sequence whose reverse complement is often found immediately upstream of a protein gene's start codon. The exact sequence varies from organism to organism, but the rRNA trailer sequence is usually pyrimidine-rich. In E. coli, the sequence is CACCTCCTTA. (Here, I am of course talking about the DNA sequence. In RNA it's CUCCUCCUUA.) If you reverse the sequence, the Watson-Crick complement is TAAGGAGGTG. Some portion of the latter is often found a few nucleotides upstream of a start codon; not 100% of the time, but too often to be by chance.

The key intuition here is that Watson-Crick binding of the tail end of the 16S rRNA to the corresponding antisequence ahead of the start codon helps stabilize the ribosome so that it is more likely to translate the gene. The degree of binding depends, of course, on the fidelity of the SD sequence ahead of the gene. Usually, the purine-rich SD area is not an exact match for the 16S rRNA trailer, and in fact the SD region quite often has no detectable SD signature whatsoever.

How often is "quite often"? In 2002, Ma et al. undertook a survey of 30 organisms representing bacteria from all major taxonomic groups. Somewhat surprisingly, they found that in 17 out of 30 organisms, a Shine Dalgarno sequence was present at fewer than half of all CDS (protein-encoding) genes. Among the bacteria most likely to use SD sequences were Bacillus subtilis and Thermotoga thermophilus, in which 90% of known protein genes have an upstream SD signal. Among those least likely to use SD sequences were low-GC/small-genome organisms (intracellular parasites, Mycoplasmas, and pathogens), with many groups, like the Actinobacteria (47%), falling somewhere in the middle.

Before taking these findings to heart, though, it's worth noting some serious weaknesses in the Ma et al. study. In obtaining the above numbers, Ma et al. used a rather permissive definition of "SD sequence," based on a minimum binding-energy cutoff (∆G) of -4.4 kcal/mol, which means they counted GAGG as a SD sequence (and also GGAG and AGGA). If one were to count only GGAGG (length 5) and longer motifs, the percentages given by Ma et al. would be much lower. (I present some data of my own on this further below.) The reason this is a very serious issue is that the probability of random occurrence of short (length-4) sequences like GAGG is substantial. Ma et al. failed to report the expectation odds for the various "signals" they looked for. Hence, for short motifs, we have no way of knowing, for the various organisms, what the expected rate of occurrence of short signals was. If an organism with genomic GC content of 66% has a putative SD motif of GAGG, AGGA, or GAGG in the 20-bp target region for 20% of its genes, how does that compare with the random occurrence rate for those sequences, given the organism's DNA base composition? We're not told.

Bearing in mind the weaknesses of the study, a number of nonethelesss interesting findings came out of the Ma et al. survey, including:

A SD sequence is rarely long or canonical; many times it's just GAGG or GGAG or AGGA (putatively) or a corruption of the expected form (e.g., GGTGG instead of GGAGG)
SD sequences occur more often with highly expressed genes (such as genes for ribosomal proteins and core energy metabolism genes) than with low-expression genes
In some (not all) organisms, the SD sequence is more likely to occur in conjunction with an ATG start codon and less likely to occur with GTG or TTG
Vanishingly few SD signals are located further than 14 bases or closer than 4 bases away from a start codon

Anybody with modest JavaScript skills can write scripts that verify some of these findings against public genomes. I took a quick look at the genome for Rothia mucilaginosa DY-18 (a member of the Actinobacteria family and a common inhabitant of the human mouth). First, I determined the most likely SD sequence for Rothia based on the 16S rRNA trailer of CCTCCTTTCT (implying a SD sequence of AGAAAGGAGG), then I had my script scan the genome in both directions, looking for any of the six possible length-5 motifs within the full-length sequence (so, AGAAA, GAAAG, etc.), in the 20 base pairs upstream of every annotated open reading frame start codon. In total, I found 686 putative SD sequences within 4 to 14 bases of an annotated start codon. Since Rothia mucilaginosa has 1905 CDS genes, this means 36.0% of protein genes carry a putative length-5 Shine Dalgarno signal. When I re-ran the check using all possible length-4 SD signal variants (using the relaxed criteria of Ma et al.), I found 1160 positives. Thus, 60.9% of CDS genes in R. mucilaginosa have a length-4 SD signal per Ma et al.

On a probability of abundance basis (given Rothia's actual base composition stats), we would expect to see 203 length-5 SD motifs by pure chance in the genome's 1905 20-bp regions. The actual number (686) is obviously quite a bit higher than expected, tending to validate the notion that these are, indeed, SD motifs we're looking at. For length-6 motifs, the trend is even sharper: The expectation is 40 occurrences by chance; the actual number is 340. So at a length of 6, a motif has high odds (~90% chance) of being real.

By contrast, the statistical expectation for length-4 motifs calculates out at 957, which is only slightly less than the number found (1160). Therefore, in dealing with Shine Dalgarno sequences, at least in Rothia, it's meaningful to deal with length-5 and longer motifs, but probably not meaningful to deal with length-4. When you spot a length-4 motif, odds are very high you're looking at a randomly occurring pattern.

If you enjoyed this post, please share the URL with a friend. Thank you!

Thursday, April 24, 2014

Are Overapping Genes Real?

Bacteria belonging to the Pseudomonas family are a perennial favorite among bacteriology instructors (and students) because of the curious ability of some of its members to produce pigments that fluoresce under an ultraviolet light. If you're unlucky enough to get an infected cut on the arm while working in the garden, it's possible your cut will fluoresce under a black light. That's enough of a diagnosis to pronounce the infectious agent.

Fluorescent colonies of Pseudomonas.

Silby and Levy, investigating the adaptation of the bacterium Pseudomonas fluorescens to soil, uncovered the existence of at least ten antisense genes in P. fluorescens. They went on to demonstrate experimentally that one of the genes, cosA, produces not just antisense RNA but an associated protein. Tellingly, Silby and Levy commented:

These findings suggest that current genome annotations provide an incomplete view of the genetic potential of a given organism.

The implication is that additional antitranscriptome genes remain to be found, not only in Pseudomonas but in other organisms.

There's a good reason they haven't been found yet. Overlapping genes are automatically rejected by many of the annotation programs that are commonly used to find, identify, and label genes in genome sequences. (The oft-used freeware Glimmer 2 program allows you to set the overlap-rejection threshold.) Many yet-to-be-discovered antisense genes have been deliberately and systematically obscured in published genomes.

Still, once in a while such genes do surface. For example, in Pseudomonas stutzeri A1501, we find a pair of overlapping genes at an offset of 3035137 on the chromosome (see illustration below).

Overapping genes in Pseudomonas stutzeri.

The top gene is annotated merely as a "hypothetical protein," while the underlying gene on the opposite strand is an aspartyl-tRNA synthetase. One's normal inclination is to dismiss a hypothetical protein as being unimportant, but this may not be wise. Twenty percent or more of bacterial genes are annotated as hypothetical proteins; common sense says they can't all be unimportant. In fact, in "Transcriptome Analysis of Pseudomonas syringae Identifies New Genes, Noncoding RNAs, and Antisense Activity" by Filiatrault et al. (2010), researchers found that 818 out of 1,646 protein genes in P. syringae annotated as "hypothetical proteins" were expressed under iron-limited conditions. Many (probably most) genes annotated as "hypothetical protein" are quite real and should probably be re-annotated as PUF: "protein of unknown function."

In this case, the "hypothetical protein" shown in yellow (above) turns up medium-strength protein-BLAST hits with other "hypothetical proteins" from other organisms, including a hit with an E-value of 3.0×10^-49 in Parasutterella excrementihominis YIT 11859 and a comparable hit on a predicted phosphatase/phosphohexomutase in Rothia mucilaginosa DY-18.

In this particular case, the hypothetical-protein gene lacks a strong upstream Shine Dalgarno sequence (a sequence preceding many genes that helps bind a ribosome to the mRNA). But so too does the gene on the opposite strand. (This is not unusual. The SD sequence is not required for translation and in fact, in about half of bacterial species, a Shine Dalgarno sequence is associated with fewer than 50% of genes.) Hence, the jury's out on whether the antigene is expressed. It could be that no protein is made from the top strand but the gene provides RNA-mediated control of the gene on the bottom strand. We won't know for sure until someone investigates.

In Pseudomonas aeruginosa strain PADK2_CF510, we find another instance of a bidirectional overlapping gene pair (see graphic below). In this case, the gene on the top strand (CF510_06030) encodes the large subunit of an isopropylmalate isomerase. The gene on the bottom strand (CF510_06025, shown in yellow) is annotated as "Flp pilus assembly protein TadG." It could very well be a misannotated non-gene. However, five genes away is FimV (CF510_06060), another pilus-assembly (motility) protein. Moreover, the gene marked TadG has a strong upstream SD sequence containing the canonical GGAGG motif. The gene above it has a weaker GGAAA motif.

P. aeruginosa has an overlap of an isopropylmalate isomerase gene and a gene for a motility protein. The latter is shown in yellow.

In previous posts, I've mentioned (and shown data for) the fact that in the overwhelming majority of protein-encoding genes (across every kind of genome), the first base of a codon tends to be purine-rich. One check of whether a bidi-overlap gene is "real" or not ought to be that the first codon base should be purine rich in both reading directions. This is, in fact, the case for the examples shown above. The aspartyl-tRNA synthetase gene for P. stutzeri has AG1 (1st base, purine) content averaging 59.8%, whereas its bidirectional partner gene ("hypothetical protein") has AG1 = 58.5%. The isopropylmalate isomerase of P. aeruginosa has AG1 = 65.9%, while its antisymmetric partner (TadG) has AG1 = 56.2%.

If you enjoyed this post, please give the URL to your biogeek friends. Thanks!

Tuesday, April 22, 2014

How Antisense Genes Are Discovered

In the past ten years or so, a great deal of research has focused on antisense transcription of genes. Normally, RNA gets transcribed from one strand of DNA only. But it turns out, in many cases RNA also gets transcribed off the opposite strand of DNA (an antisense copy), either at the original gene (so-called cis transcription) or at a copy of the gene some distance away (trans transcription). The latter can be a pseudogene, or a normal copy of the gene.

Antisense transcripts occur very widely not only in human DNA but in bacteria, yeast, and (in fact) every place where scientists have looked, and places where they haven't looked. Some of the most interesting discoveries have happened when researchers weren't specifically looking for antisense transcripts but found them by accident. How does that happen? It happens in experiments involving IVET (in vivo expression technology), an important experimental technique for uncovering new genes.

IVET is a powerful gene manipulation strategy for discovering which genes in an organism (a pathogen, usually) are up-regulated or turned on during host infection. Let's say you're studying a new pathogen and you want to get an idea of which genes, in the pathogen, are turned on during the infection process. First, you need a strain of the organism that's disabled by virtue of lacking a working copy of a particular metabolic enzyme, say an enzyme needed for purine metabolism, e.g. purA. Secondly, you need a vector for inserting a promoterless copy of the working gene into the bacterium. What this usually means is, you need a plasmid (a small extra chromosome; many bacteria have them, and they can often be manipulated in the lab) on which to place a functional purA gene. The gene won't be expressed, however, if it lacks a suitable promoter region on the DNA upstream of the gene. That's good; that's what you want. You want to put a promoterless copy of the good gene on the plasmid, along with (this is crucial) a random chunk of DNA from the pathogen, inserted ahead of purA on the plasmid. In practice, it's easy to create a bunch of plasmids with this arrangement: a working copy of purA, and ahead of it, a random chunk of pathogen DNA. The idea is that you now attempt to infect a lab animal with the bacterium containing the plasmid. If the bacterium establishes infection in the animal, presumably it's because a random chunk of DNA happened to contain a promoter region (and associated downstream genes) that gets turned on during infection. If you now isolate the bacterium from the sick animal, you can look to see what kind(s) of genes got transduced into the bacterium.

IVET is a promoter trap technology for selecting bacterial genes that are specifically induced when bacteria infect a host organism. A plasmid vetor contains a random fragment of the chromosome of the pathogen (red) and a promoterless gene (selective marker, burgundy) that encodes an enzyme required for survival. Pooled plasmid-containing clones are inoculated into the mouse (B). Only those bacteria that contain the selective marker fused to a random gene that is transcriptionally active in the host are able to survive. After a suitable infection period, bacteria that express the marker are isolated from the spleen or other organs. The inclusion of a lacZY mutant gene (blue) allows post-selection screening for promoters that are active only in vivo. What you want are bacteria that are lac-positive only in the host environment, not "constitutive" (always-on).

Exactly this sort of technique was used by Silby, Rainey, and Levy to determine which genes were activated in Pseudomonas during colonization of soil. (The IVET technique can be adapted to any scenario in which an organism differentially expresses genes in its adaptation to a "host" environment, even if the environment is, in fact, a plant, or soil in this case, rather than a mouse.) They were looking to see which genes in Pseudomonas play an essential role in that organism's ability to thrive in soil, and they successfully identified more than 50 promoters (and associated fusions) that come alive during soil colonization. When they looked at 22 "soil genes" that got turned on, they found ten previously undescribed genes that were transcribed in the antisense direction from regions overlapping known genes. They called these ten genes "cryptic fusions" because of their un-annotated existence on the supposedly silent, antisense side of known genes.

Cryptic fusions discovered by Silby et al. are shown in grey, in their antisense orientation to known genes (darker grey).

It's not unusual to find that antisense transcripts are playing a regulatory role. When a gene gets transcribed in both directions, the resulting sense and antisense RNAs can combine (by Watson-Crick pairing) to form a double-stranded RNA product, preventing translation of the RNA into protein. But incredibly, sometimes an antisense RNA transcript encodes a legitimate protein (a protein that gets made off the antisense copy). Silby and Levy documented this for the previously unknown cosA gene in Pseudomonas. It seems likely additional antisense proteins await discovery. (Most studies stop at the level of identifying RNA products.)

The finding of antisense transcripts in IVET experiments is common. One of the authors of the Pseudomonas study (Rainey) had previously published a study of rhizosphere-induced genes in Pseudomonas but had not published the fact that 20% of genes found this way were in an antisense orientation to normal genes. Likewise, a 1996 study of Pseudomonas aeruginosa infection in the mouse (Pseudomonas is an opportunistic pathogen) found antisense activity. In fact, the first-ever paper on IVET (by Mahan et al., 1993) described finding antiscript products.

IVET has uncovered a previously unknown "antitranscriptome" world hidden inside living cells. Until we explore this world fully, we won't know how much undiscovered biology we've left on the table.

Sunday, April 20, 2014

Bidirectionally Overlapping Genes

The occurrence of bidirectionally overlapping genes in bacteria is rare, and most such examples are dismissed as chimeric or representative of simple genome mis-annotation. After all, how can a gene make sense in one direction, but also make sense on the reverse-reading complementary strand of DNA? Such a situation is more than a mere palindrome. It's akin to the phrase:

Warsaw won, eh?
He now was raw.

The phrase has a sensical message in each direction, yet is not a mere bidi-symmetry of the "A man, a plan, a canal, Panama" kind. It defies credulity to believe a stretch of DNA spanning several hundred bases (several hundred "letters") could evolve to give a useful message in both directions. And yet, what is life itself, if not credulity-defying? Somehow, life began from primordial chemistry and evolved toward DNA genes coding for proteins. Is it so hard to believe that early replicant molecules (probably RNA) were transcribed and translated in both directions, and that some of the happy accidents survived? Is it so hard to believe that some proteins began life as reverse transcripts ("nonsense" proteins) that then evolved toward specialized functionality?

A bonafide example of a bidirectionally transcribed and translated gene was verified experimentally in 2008 by Silby and Levy, who were investigating the soil bacterium Pseudomonas fluorescens PF0-1. They found that the hitherto unknown cosA gene, which overlaps (on the opposite DNA strand) a gene for a fusaric acid resistance protein, is not only expressed as a protein but is required for soil colonization.

A section of P. fluorescens PF0-1 genome showing the existence of overlapping genes (note the yellow-colored segment, representing the cosA gene; the larger green gene above it, on the opposite strand, encodes a fusaric acid resistance protein). The overlapping genes have been shown experimentally to be expressed as protein.

Ironically, a month after Silby and Levy published their results, BMC Genetics published a study by Pallejà et al. looking at large gene overlaps in bacterial genomes. The Pallejà study concluded:

Among the 968 overlaps larger than 60 bps which we analysed, we did not find a single real one among the co-directional and divergent orientations and concluded that there had been an excessive number of misannotations. Only convergent orientation seems to permit some long overlaps, although convergent overlaps are also hampered by misannotations. We propose a simple rule to flag these erroneous gene length predictions to facilitate automatic annotation.

Silby and Levy argue that, to the contrary, current genome annotations are obscuring potentially important discoveries:

[Our] findings suggest that current genome annotations provide an incomplete view of the genetic potential of a given organism . . . In eukaryotes, the concept that genomes include numerous sense/antisense gene pairs is becoming increasingly obvious with genome-wide transcriptional studies in yeast [8] and Arabidopsis [10]. Antisense transcripts have been implicated in eye development [20] and control of entry into meiosis in yeast [21]. However, discussion of antisense transcription is limited to possible regulatory roles for antisense RNA [e.g. 8], without consideration of the possibility that they may specify proteins. Genome annotations do not routinely predict the existence of two protein-coding genes on opposite DNA strands, and in fact normally deliberately eliminate predicted overlaps. Moreover, small protein-coding genes can be missed by predictive algorithms. For example, the blr gene in E. coli specifies a 41 residue protein, and was discovered in a sequence believed to be intergenic [22]. The fact that antisense genes have been implicated in important biological functions indicates that more attention should be given to this emerging class of genes.

I happen to agree with Silby and Levy. It would be a shame if bidirectional overlaps in genomes are not investigated. The notion (furthered by Pallejà) that annotation software should suppress such findings automatically is repulsive. It's the kind of intolerant, rigid, dogmatic thinking science, quite frankly, doesn't need more of.

Saturday, April 19, 2014

Codons and Reverse Complement Codons

A very unusual and surprising property of protein-coding genes is that if a codon A appears with a certain frequency in genes, the reverse-complement codon of A will also have a similar frequency of occurrence. For example: If CTT (leucine) appears at a frequency of 1%, the reverse complement codon AAG (lysine) will also appear at roughly 1%. If CGT (arginine) appears at 0.2%, ACG (threonine) will appear at around 0.2%. (These are whole-genome frequencies.)

This correlation is strongest (r=0.75) for organisms with a high genomic G+C content, such as Streptomyces griseus, and lowest (r=0.28) in low-GC organisms like Clostridium botulinum.

This is a very peculiar property, when you think about it. We don't usually imagine an organism being constrained in its choice of codons for a particular protein. If a particular protein calls for a huge amount of leucines (CTTCTTCTT) we don't imagine that there's a requirement for an equivalent quantity of AAG to be used somewhere else. And yet, the correlation between frequency-of-occurrence of a codon and its antisymmetric twin is, as I say, surprisingly high in many organisms.

This sort of thing is very hard to explain without invoking a theory of proteogenesis that involves antisense proteins. Imagine a poly-lysine gene of AAA repeated 100 times. The gene gets duplicated on the opposite strand. Now the original strand has 100 AAAs and a run of 100 TTTs. If a reading frame opens up on the TTT stretch (and the protein is beneficial to the organism; it survives), there is now codon/anticodon parity of the kind I'm describing, between codons in the poly-lysine gene and the poly-phenylalanine (TTT) gene.

Why does this relationship hold for high-GC organisms but not as much for low-GC organisms? Probably because antisense genes in high-AT organisms contain a lot of stop codons (TAA, TGA, TAG, which by the way occur at about the same frequencies as TTA, TCA, and CTA, respectively). The presence of few stop codons in high-GC antisense genes gives those genes a chance to be expressed and evolve further. Of course, if you buy this theory, it tends to argue for a "GC World" scenario in which the early proteosome evolved from GC-rich double-stranded genomes.

To illustrate the unusual correlation I'm talking about, I took the codon frequencies of Pseudomonas fluorescens PF01 (genome-wide) and made a graph that plots the frequency of occurrence of each codon on the x-axis, versus the frequency of occurrence of the corresponding reverse-complement codon on the y-axis. (So if CTA occurs at 0.3% and TAG occurs at 0.2%, I plot a point at [0.3, 0.2].) The SVG graph (below) is interactive: You should be able to hover over a point and see a tooltip that shows the identity of the corresponding codon, and its reverse twin, and their respective frequencies.

NOTE: If your browser does not support SVG, a PNG copy of the graph is here.

The symmetry pattern is expected: For every codon/anticodon there's a corresponding anticodon/codon pair with frequencies swapped. What's more important than the symmetry pattern is the fact that frequency values in Y increase monotonically in X and vice versa, with a correlation coefficient in this case of r=0.63 (F-statistic 41, p < .001). This means that codons tend to occur at about the same frequencies as their reverse complement codons. There are outliers, to be sure, but the overall trend is statistically solid.

Leave a comment if you have any thoughts on what's going on here.