PGGP - Strategy 5

This project is collaborated between The Forsyth Institute (TFI) and The Institute for Genomic Research (TIGR), and is funded by National Institute of Dental and Craniofacial Research (NIDCR)

Predicted coding region identification and annotation

The TIGR informatics group will provide support during the initial stages of predicted coding region identification, similarity searches, and organization of the data into functional groups. This work will be augmented by the analyses performed by the co-investigators at The Forsyth Institute who will focus specifically on issues of relevance to P.gingivalis.

Predicted coding region identification and annotation. The annotation of the H. influenzae and M. genitalium genomes has provided us with the opportunity to evaluate a variety of annotation methods and tools. For H. influenzae we used the GeneMark (Borodovsky and McIninch, 1993) program to identify potential coding regions. GeneMark makes use of codon frequency matrices and was trained on a set of 122 H. influenzae coding sequences from GenBank. Our analysis of the M. genitalium genome relied on more traditional open reading frame (ORF) analysis based on the ORF prediction software incorporated in GeneWorks (IntelliGenetics, Inc.). We will continue to evaluate new approaches to coding region identification as they become available, using the H. influenzae and M. genitalium genomes as test sets. Currently, the predicted protein coding regions are initially defined by searching for ORFs longer than 100 codons (brute force method). This initial set of ORFs is used for similarity searches as described below. Coding potential analysis of the entire genome will also be done with GeneMark software (Borodovsky and McInich, 1995) that had been "trained" with a set of P. gingivalis ORFs with reliable database matches and that produced a fourth-order Markov model. The sets generated by both methods will be cross-checked and combined. ORFs with low GeneMark coding potential and no database match will be eliminated from further consideration. Searches of the predicted coding regions are performed with BLAZE (Brutlag, et al., 1993). The protein-protein matches are aligned with Praze, a modified Smith-Waterman (Waterman, 1988) algorithm that maximally extends regions of similarity across frame shifts.

Gene identification is facilitated by searching against an in-house protein database, called NRAA. NRAA is composed of 1) protein sequences from the Swissprot, Genpept and PIR public archive databases that have been currated at TIGR for sequence redundancy within species, gene common name, source taxon, biological role and enzymatic function, and 2) additional sequences from SwissProt, GenPept and PIR that have not yet been filtered through the TIGR curration process.Assignment of putative identities to ORFs is a two-step process. First, the search results from all ORFs against NRAA are evaluated. Second, those assigned a putative identification are re-evaluted in the context of biological pathways. These two processes assign a gene common name and biological role to each ORF, and save a link to the match sequence, the percent identity and similarity to the match sequence, and the alignment of the ORF with the match sequence. The ‘intergenic regions’ are searched against a data set of all available peptide sequences from Swiss-Prot, the Protein Information Resource (PIR), GenBank, and the Prosite database. Each putatively identified gene is assigned to one of 102 role categories adapted from Riley (1993).

This page is created and maintained by Drs. Margaret Duncan, Floyd Dewhirst, and Tsute Chen, Department of Molecular Genetics, The Forsyth Institute .

Last modified on 02/20/2002