Parent Directory
|
Revision Log
appeal
> Reviewer #1 (Remarks for the Author): > > The authors describe a method to detect putative ncRNAs (more precisely > regions with evidence for evolutionary conserved RNA secondary structure > motifs) It is based on models for structured RNAs that are composed of > two sub-models: a pfold-like ncRNA model and a null model for the > unstructured parts of the sequence window. > > A major point of this contribution is that it uses gsimulator to model > genomic background instead to the shuffling method used in previous > approaches to RNA gene finding. Thank you for pointing this out; we did not intend for the paper to be primarily about the simulation method. We have accordingly rewritten much of the paper to make clear that its emphasis is 1) the evaluation of different gene prediction grammars, and 2) preliminary biological characterizations of predictions made with one of the best-performing of the tested grammars. We have added many analyses to strengthen this emphasis: 1) - We have removed text focusing on our simulation method, since as the reviewer pointed out, it is being published elsewhere. - We have added a comparison to the EvoFold grammar, which was the first phylo-grammar-based de novo ncRNA prediction tool. 2) - We have screened all of our predictions against the RFAM database to search for homology to known families of ncRNAs. We found promising results, including a tandem array of 17 predictions homologous to the family snoR28 in a region on the X chromosome spanned by a cDNA in FlyBase, suggesting that they lie within an intron of an unannotated protein-coding gene. - We have searched for possible correlations between our predictions and coding regions by comparing the distribution of distances from our predictions to a uniform background, finding a significant depletion of predictions towards the 3' end of coding regions. We furthermore find a significant depeletion within the first intron of protein-coding genes. > Reading the manuscript more carefully, > however, it turns out that gsimulator is to be published elsewhere. We have tried to remove excessive hyperbole about the simulation method. Given that the primary focus of the paper is now clearly a comparison of gene prediction models (and not the simulator that was used in this comparison), we don't think that separate publication of the simulator should be counted as a negative (indeed, separating the two should be considered "best practice"). > Thus the "meat" of this Plos.Comp.Biol. contribution is the pipeline for > ncRNA gene finding, substantial parts of which, namely the xrate system > were also previously published, and its application to the 12-Fly > data. We would like to question the view that prior publication of the XRate compiler means that substantial parts of this work have already been published. It is true that we wrote a paper on the XRrate compiler in 2006, but a compiler is a very general tool -- prior publication of which is not (usually) taken to detract from publication of subsequent programs written in that compiler's programming language. XRate is a prototyping tool for an extremely broad class of models. To give an idea of the breadth of this range, XRate could be used to build programs such as: -- PAML or HyPhy (phylogeny) -- Exoniphy (protein genefinding) -- DLESS (identification of accelerated elements) -- PhastCons (identification of conserved elements) -- EvoFold (ncRNA gene prediction) -- PFOLD (RNA structure prediction) -- Thorne, Goldman and Jones' 1996 model for protein secondary structure prediction -- Profile HMMs (e.g. HMMer) -- Profile SCFGs (e.g. INFERNAL) -- Many substitution models for codons, proteins and DNA -- GRAIL or GeneMark (HMM-based gene prediction) -- SignalP and TMHMM (prediction of transmembrane helices) -- MEME (motif-finding by EM) ...and many others. XRate has been evaluated as a generic, multi-purpose substitution rate estimation tool, but has never been used for genefinding. This is an important and complex application, quite distinct from the elementary proof-of-principle uses to which XRate was put in our 2006 paper (i.e. measuring evolutionary rates and predicting protein secondary structure). A useful analogue here may be the Viterbi/CYK algorithm for parsing HMMs/SCFGs. This algorithm has been known since 1967, but this surely does not diminish the value of such gene-finding programs as GenScan, SLAM, DoubleScan, GeneMark, GRAIL, QRNA, EvoGene, Exoniphy or EvoFold -- all of which use the same basic parsing algorithm, but vary substantially in the fine structure of their models. > In addition, I have several concerns. > > 1) The author *claim* in 2.1 that generation of the background is superior > to shuffling. The manuscript does not, however, give any rational argument > for this claim (except that a stochastic generating model is intellectually > more appealing). Obviously I can produce as many shuffled alignments for a > resonably long sequence as I want or need. There are of course undeniable > problems with shuffling procedures, but it can be expected that this is > also the case for simple HMM-style-models of genomic background. In fact, > the unrealistically low estimates for the false discovery rates in Tab 3 to > my mind argue against the present approach rather than in its favor. Regarding the "unrealistically low estimates for false discovery rates": based on results from the SIMGENOME preprint referenced in the text, we have changed our simulation procedure to model genomic features which were not included in the previous approach, including coding regions and small islands of conservation. This gives a much more stringent benchmark and dramatically increased our false-positive estimates reported in Table 5. This approach is further described in Section 2.1, "Simulations of neutral evolution" (highlighted in the text). We have removed statements suggesting that our approach to simulation is inherently better, stating instead that we simply used a different approach than shuffling. For completeness we have also performed a comparison between false-positive estimates based on our simulations and estimates based on a shuffling appraoch. These results can be seen in Figure 1, and suggest that shuffling-based approaches can give very different false-positive estimates depending on the number of shuffles (highlighted in the text). We do not intend to claim that our simulation procedure is inherently superior to shuffling, but rather that it provides an alternate, model-based approach. > A model-based approach for structured RNAs has also been explored by Tanja > Gesell in her SISSI project, which was recently also applied to RNA gene > detection: a corresponding preprint by Gesell and Washietl has been > available on the web for a while already > (http://www.tbi.univie.ac.at/papers/PREPRINTS/08-01-001.pdf) Although this web preprint was not published in a journal (or indexed in Pubmed) at the time of our first submission, we thank the reviewer for bringing it to our attention. We have appropriately cited this work in the separate paper describing the simulation tool. > 2) It is doubtful at best that the ROC curves are informative about the true > quality of the method since the accuracies are computed from an extremely > biased set of test data - containing e.g. dozens of virtually identical > copies of tRNA alignments. In general, ROC make little sense if one cannot > argue that the classification tasks are at least approximately independent > or "uniformly distributed" in a suitable sense. I suspect that the lack of a > generally accepted large unbiased set of *independent* true ncRNA > alignments (akin e.g. to the unrelated protein subset of the PDB) is the > reason why previous studies of ncRNA gene finders have largely refrained > from focussing on ROC curves or extensive discussions of sensitivity. It is an excellent point that we should be careful to make sure that the results of our model evaluation is consistent across different ncRNA families. We accordingly partitioned the set of "true positives" into four sets, 1) tRNAs, 2) miRNAs, 3) snRNAs, snoRNAs and "other", and 4) all but rRNAs. We removed rRNAs from our test set because they are unaligned in the input dataset. After performing separate evaluation based on each family, reported in Figure 2, we found that the relative performance of grammars was consistent across the different test sets. We believe that this consistency of results supports our assertion that our ROC curve evaluation procedure is a good method for comparing the performance of different tools for de novo ncRNA gene annotation. > 3) The sensitivity of the approach is not at all striking. While performing > very well on tRNAs (which are nearly 100% conserved in the 12fly dataset), > the sensitivity on most other RNA classes, in particular microRNAs and > snoRNAs, is only about 2/3 to 3/4 of Rose et el. (64 vs. 96, and 56 vs 75, > resp.) The high number of highly conserved predictions, tRNAs and CDS (see also > below) is reminiscent of the differences of evofold and RNAz in the ENCODE > satellite paper by Washietl et al. It would therefore be interesting to see > how the overlap with previous screen depends on sequence conservation. In order to investigate this possibility, we performed a detailed breakdown of our overlap with the EvoFold and RnaZ results based on sequence conservation and found that after correcting for the background distribution of conservation levels across the genome, there was no significant correlation between conservation level and degree of overlap. These negative results are mentioned in Section 2.7, "Comparison to previous screen" (highlighted in text). > 4) A closer look at the unfiltered results in the supplement is not > encouraging. Unless rigorously filtered, the approach returns 172 raw > sequences of length 1 (!) and 1994 candidates with length 10 or less as > putative ncRNAs. To this reviewer, such classification results suggest that > the RNA (sub)models need to be revised. We provide the unfiltered results with caveats, in order to make available intermediate steps of our analysis. The filtered results are our "final product". However, we can certainly explain the existence of such apparently anomalous results. This is an artifact of alignment-based approaches to gene detection. If a true ncRNA in D. yakuba is aligned to only a single nucleotide in D. melanogaster, and our screen detects this ncRNA in D. yakuba, then it will appear as if there is a ncRNA in D. melanogaster which has length 1, precisely the situation described by the reviewer. It is for this reason that we apply melanogaster-specific filtering criteria: We are ultimately interested in validating our predictions with experiments in melanogaster, so we want to ignore such predictions, even if they correspond to real ncRNAs in the actual paper. That said, this is closely-tied to a real drawback of any phylo-grammar approach, which is that it is alignment-dependent. We try to emphasize this problem throughout the text, and suggest that alignment-independent methods such as Torarinsson and Gorodkin's (2008) are promising. > 5) It is furthermore highly doubtful - and not properly discussed in the > manuscript - that the overwhelming majority of the predictions are located > in coding sequences (CDS). Since coding sequence is expressed almost by > definition, the high overlap of the current set of predictions with the > transfrags from Manak is trivial. Is there a significant enrichment in the > non CDS portion of the dataset? Hence unless the author want to argue that > MOST structures RNA motifs are located within coding regions, their screen > has an FDR of at least 75%. Changing the simulation method (to include simulated CDS, among other things) has upwardly revised our false positive estimates. We do not find significant enrichment for (the Manak et al) transfrags in intergenic regions. We do not find it surprising that many of our predictions are located in CDS regions: (1) they are more conserved and so generate more false positives (now addressed by our improved simulator, though we hypothesize that further increases could be obtained by including protein-level rate heterogeneity); (2) CDS regions align more reliably than intergenic regions; (3) there are many documented examples of RNA regulatory elements in protein-coding regions, particularly in viruses, so the biological hypothesis cannot be ruled out. The revised manuscript includes a more detailed discussion of possible correlations between our predictions and coding regions. We compared the distribution of distances from our predictions to a uniform background, finding a significant depletion of predictions towards the 3' end of coding regions. We furthermore find a significant depeletion within the first intron of protein-coding genes. > 6) As a methodologcial point, I don't see the benefit in training the > models from tRNAs and rRNAs, which are the most conserved and most > easy-to-find ncRNAs. There is no need for a comparative approach to find > them, one can just as well blast the rRNAs and use tRNAscan-SE. Because > rRNAs and tRNAs differ in this respect from almost all other ncRNAs, it > seems counterproductive to gauge the model on them - maybe this explains > the methods' bias towards well-conserved sequence more than the increase of > alignment quality with sequence conservation. While we agree with the reviewer that one should be careful to consider different ncRNA families when evaluating the performance of different gene prediction models (c.f. our stratification of ROC plots into four categories based on the target families considered), we respectfully disagree that training on ribosomal RNA is insufficient. Almost all RNA structure prediction tools are trained on tRNAs and rRNAs, not because they are the only target families of interest, but rather because their structures are well-known and accurate structural alignments are available. Effectively training the most parameter-rich part of our model, the substitution matrix of base-pairs, requires a large database of accurate structural alignments, which essentially confines us to tRNAs and rRNAs. In fact, our manuscript includes results of experiments with a model whose base-pair substitution matrix was trained on a broader subset of RFAM with literature-derived structure annotations, rather than just rRNA alignments (Figure 4). We empirically found that this model performed worse than models which were trained only on rRNA alignments. We hypothesize that this is because rRNA alignments are relatively higher-quality and include fewer non-canonical base-pairs (see Figures 4 and 5). There are actually many good reasons to train on tRNA and rRNA: - They are well-studied, and therefore well-annotated. - There is a lot of sequence available for the same reason. - They reasonably represent the structural features one expects to see in other ncRNAs. (We are not modeling specific aspects of tRNA or rRNA structure, as is the case with tools like tRNAscan-SE, and the grammar is general enough that RNAs of other structures should be detectable.) - It is inaccurate that these genes are highly conserved at the sequence level. There is quite some sequence divergence, and plenty of alignable pairs with (e.g.) under 50% sequence identity. Of course there is structural conservation, but that's exactly what is needed in the training set. In any case, XRate estimates instantaneous substitution *rates* (not probabilities) so that the actual distance between pairs in the training set is not that relevant, as long as they are not exactly identical (one can estimate instantaneous rates equally well from many close pairs as from a few distant pairs). > 7) The title of the manuscript seems to promise insights on ncRNAs in > Drosophila. It does not say much in the respect, with the sole exception > of the brief discussion of the potential role of structured UTRs in mRNA > localization. As mentioned earlier, we have added much more analysis of our results, including homology searches against RFAM, investigation of correlations with known protein-coding genes and structural clustering. These include several biologically-oriented results, including the snoRNA story that is now mentioned in the Abstract. We have also expanded the discussion of elements located in UTRs. > Reviewer #2 (Remarks for the Author): > > In this manuscript the authors present an analysis of several related > probabilistic RNA grammars that include explicit evolutionary models > to describe multiple sequence alignments of structural RNA genes. > Using one of the analyzed grammars, they construct a ncRNA genefinder > program. They also present a set of on-coding RNA > candidates in Drosophila melanogaster. > > ``Are these real or are these artifacts?' According to the abstract, > that seem to be the main question the authors set themselves to > answer. I do not see any way in which this paper clarifies the issue. We have dropped this sentence from the paper and attempted to clarify that our goals for the paper are: 1) the evaluation of different gene prediction grammars, and 2) preliminary biological characterizations of predictions made with one of the best-performing of the tested grammars. We have performed many additional analyses to strengthen these goals (described later). > Points: > > (1) Despite its length, this paper does not provide the necessary > information that would make the results presented in the paper > reproducible. > > All the basic questions are unanswered, starting with the most > trivial: which D. melanogaster genome was used (with release date), > which other genomes were used in the comparison (we are only told that > they used 12 other drosophila genomes). Rational for the selection of > genomes. Overall degree of conservation between the reference genome > and the other genomes. Program used to create the alignments with > version number and parameters used. Statistics about the alignments > such as: number, distribution and coverage respect to the > D. melanogaster genome, depth, average length, gap statistics, base > composition statistics, etc... > > I believe the only information we are given is in the Introduction: > ``Using a combination of the better refinements, we scan a multiple > alignment of twelve Drosophila genomes'. A statement hardly leading > to a reproducible result, or to an understanding of the > characteristics of the alignment. In fact, all of this information *was* just one step away from our paper in the citation graph (via the references to the Drosophila comparative genomics papers), but we agree that it is important for reproducibility to make this information as obvious as possible. Accordingly, we have amended the manuscript to include this. We have added Section 4.1, "Sequence and alignment data" (highlighted in the text), which gives much of this information, including lists of genomes used, release information, location of alignments (which were produced by the Drosophila 12 Genomes Consortium rather than us), etc. A full characterization of the sequence data for the twelve Drosophila genomes and the alignments can be found in the papers Clark et al. and Stark et al. 2008 announcing the sequencing and initial analysis of these genomes. Tables 1 and 2 of the Clark et al. paper give detailed statistics on coverage, assembly quality, etc. We can include more of the statistics given in these references if the reviewer feels this would help clarify matters. > (2) The null model is not properly described, thus their study of > specificity is by not means convincing. Generating synthetic > sequences can be a great tool, however here it requires a leap of > faith on the part of the reader to asses the relevance of their method > for estimating false positives. Other than saying that 'it was trained > on drosophila', we have no information about the type of alignments > that their generative model produces. How many alignments were > generated, how did they compare to real alignments in length, > gappiness, number of sequences, base composition > heterogeneities... The paper repeatedly states the ``realism' of > their null model and the ``sophistication' of their sequence > simulator, however, the reader has no means to ratify that type of > assertment. Thank you for pointing out this oversight. We have added Tables 1 and 2, which compare the genome-wide statistics (% ID, % gap, % coding, % intronic) of our simulated data with those of the PECAN alignments, as well as the single and di-nucleotide frequencies. We also mention in Table 1 that while our simulated data models heterogeneity in base composition across different genomic features such as coding and intergenic sequence, it does not model local fluctuations in base composition. > The shuffling method is not used because ``excessive shuffling can > destroy local correlations', thus it should be at least good as a > lower bound on the number of false positives predicted. If another > method is to be used, it should be at least compared to a low-bound > method as shuffling. The authors should convince the reader with > statistical information that those synthetic alignments have anything > to do with the null hypothesis they are trying to test. Even more so > when they are predicting zero false positives! (Table 3). One has to > wonder, what about the almost fourteen thousand predictions in coding > regions? We have made two improvements to address the reviewer's suggestion (the below text is copied from the response to reviewer 1): Based on results from the SIMGENOME preprint referenced in the text, we have changed our simulation procedure to model genomic features which were not included in the previous approach, including coding regions and small islands of conservation. This gives a much more stringent benchmark and dramatically increased our false-positive estimates reported in Table 5. We have removed statements suggesting that our approach to simulation is inherently better, stating instead that we simply used a different approach than shuffling. For completeness we have also performed a comparison between false-positive estimates based on our simulations and estimates based on a shuffling appraoch. These results can be seen in Figure 1, and suggest that shuffling-based approaches can give very different false-positive estimates depending on the number of shuffles. We do not intend to claim that our simulation procedure is inherently superior to shuffling, but rather that it provides an alternate, model-based approach. > This affects also their study of the different grammars tested which > is also based on uncharacterized simulated data compared against > uncharacterized alignment(s) that contains (presumably) mostly tRNAs. As noted above, the results are now broken down by constituent ncRNA family. We believe that the consistency of the results across families supports our approach. > In fact, for any given null model, selecting a program's score that > meets one's acceptable specificity is independent of testing on any > number of known trues (real ncRNAs). It does though depend on the > testing sample size. One should create a dataset of decoys of similar > characteristics and size than the real test. One can then analyze the > predicted RNAs from the dataset of decoys as a function of the score, > and select the appropriate score that would produce whatever is > considered an acceptable number of false positives. I do not believe > that any similar analysis has been performed in this work. We respectfully disagree with the reviewer's suggested technique for evaluating our models. We believe that the most effective way to evaluate the performance of a gene-prediction tool is to test it on curated true positives, ie annotated biological data. While simulation of the true positive set can certainly provide interesting information such as possible upper-bounds on the performance of a methodology, we believe that this approach can lead to biased results which are not as closely tied to our target application of predicting ncRNAs in Drosophila. Specifically, the nature of the bias with synthetic "true positive" datasets is that they must be generated using a particular simulation model (or, equivalently, a choice must be made as to which statistical characteristics of the known ncRNAs one wishes to reproduce in the decoy set). A fair comparison between the probabilistic models used for *annotation* (i.e. inference) becomes more difficult, because the inference model that is closest to the simulation model has an inherent advantage. Furthermore, there is insufficient evolutionary distance in Drosophila to accurately estimate the slowest rate parameters in a general 16*16 covariant base-pair substitution model (for example), so in order to generate synthetic decoys, one would have to make somewhat arbitrary decisions about bringing in further examples from another source, which could bias the results away from Drosophila. These may not be insurmountable difficulties, but they do introduce a whole host of additional questions and problems. In practice, we think that the reviewer's concerns are empirically addressed by our detailed breakdown of ROC curves by ncRNA gene family, and by our now-deeper examination of the statistical characteristics of the null simulated data. Together, these revisions address the reviewer's concern about "uncharacterized simulated data compared against uncharacterized alignment(s) that contain ... mostly tRNAs". > (3) Answering the question of whether a list of candidates is real or > not requires much more work than what the authors present. Of course, > if they truly believe that they have zero false positives, their job > is complete as it is. But a paper should not be a matter of believe, > but of convincing oneself and the reader with evidence. Even before > getting into experimental testing, there is a large number of > computational tests the authors should do in order to further > characterize the candidates. How many of those could be misspredicted > or missannotated CDSs (very important considering the number of > candidates they produce in coding regions). How many are homologous to > known ncRNA genes that are not properly annotated in the drosophila > genome? How about the genomic context of the different candidates in > other organisms? How far from drosophila can a candidate be found? Do > any of the candidates come in families? etc... These are excellent points and we have tried to undertake all of the reviewer's suggestions. We have investigated possible correlations with coding regions by looking at the distance from our intergenic predictions to the nearest protein-coding genes; we found a significant depletion towards the 3' end of coding regions. We additionally found a statistically-significant reduction in the number of predictions in the first intron of protein-coding genes compared to what is expected by chance. We compared our predictions in intergenic regions to RFAM covariance models in order to search for homology to known ncRNA families. We found one particularly-promising result, a tandem array of 17 predictions homologous to the family snoR28 in a region on the X chromosome spanned by a cDNA in FlyBase, suggestive of a tandem array of snoRNAs within an intron of an unannotated protein-coding gene. We searched for possible families among our top 500 predictions using a structure-based clustering approach, and found many shared short motifs but no large-scale homology. > In summary, this paper despite providing a large number of names for > programs and scripts lacks any type of relevant information to make > the results reproducible. The screen is poorly described, and the > analysis of specificity very unconvincing. The characterization of > candidates is very superficial. We have tried to address all of these points, and believe that the paper is substantially improved as a result. We thank both reviewers for their helpful comments.
| Questions? Mail ihh at fruitfly dot org | ViewVC Help |
| Powered by ViewVC 1.0.3 |