The American Academy for Microbiology convened a colloquium July 19-20, 2004, in Washington, DC, to address the critical challenge of prokaryotic genome annotation and to seek ways to accelerate progress in the field. Recent advances in DNA sequencing have produced a spectacular amount of new data; literally hundreds of thousands of sequenced prokaryotic genes now await annotation. These genes can be enumerated, compared, and grouped by sequence similarity into families, yet an understanding of their biochemical functions is lacking. Genomics provides that rare opportunity in science where the boundaries of current knowledge can be clearly defined. The annotation initiative proposed in this document will extend those boundaries and will likely lead to new applications and new progress in healthcare, biodefense, energy, the environment, and agriculture. This research could also impact many commercial enterprises, such as the chemical, food and dairy industries.
Colloquium participants included microbiologists, biochemists, and bioinformaticians. Observers from the National Institutes of Health, the National Science Foundation, the U.S. Department of Energy, the National Aeronautics and Space Administration, the Office of Science and Technology Policy, and the U.S. Department of Agriculture were also in attendance. Participants discussed the currently available sources of genome annotation information and the strengths and limitations of those sources. Four areas of concern in genomic annotation were identified:
- As many as 40% of all predicted genes in completed prokaryotic genomes have no functional annotation.
- Many genes have a predicted function, but that prediction has not been experimentally validated.
- As many as 5-10% of predicted gene functions may be incorrect.
- Many known enzymes have no corresponding genes identified in the sequence databases.
Much of the currently available annotation information is provided by computer programs that predict the functions of newly sequenced genes on the basis of their similarity to genes of known (or predicted) function. This technique is inherently limited in both breadth and accuracy by the small size of the core foundational set of genes with experimentally established functions. By expanding that foundational set through a systematic program of biochemical study of genes of unknown function, we can dramatically increase the quality of prokaryotic genome annotations, and enhance our understanding of current and future genome sequences.
The experimental elucidation of function for a hypothetical gene can be a significant challenge for the biochemist. However, in the past five years new bioinformatics techniques, mostly based on comparative genomics, have been developed that can provide clues about the function of a gene. Functional genomics methods, such as gene expression chips, can also provide hints about gene function. Such clues can greatly accelerate experimental studies by suggesting plausible hypotheses to be tested in the laboratory.
Colloquium participants agreed that accurate and complete annotation is vital to making full use of genomic data. However, there are great deficiencies in currently available annotation sources. Moreover, there are few sources of dedicated funding for experimental approaches to annotation. In light of these facts, it was recommended that a new initiative be undertaken that would synergistically combine computational methodologies for functional prediction with a systematic experimental approach to test those predictions. It would also broaden the foundational set of experimentally determined gene functions by finding missing genes for known enzymatic functions. Such a program would both increase experimental knowledge and spur further accuracy in bioinformatics prediction leading to repeated cycles of validation and prediction.
As part of the proposed initiative, a new resource focused on annotation should be developed. The central component is a database containing:
- Predictions regarding the functions of genes of unknown function, deposited by bioinformaticians, based on computationally inferred clues, which will serve as a starting point for experimental investigations.
- The results, positive or negative, of those experimental investigations, which in many cases will establish new gene annotations backed by rigorous experimental work.
- A prioritized list of sequenced genes for which no functional information is currently available.
- A list of biochemically-characterized functions for which no gene has yet been assigned (referred to as orphan functions).
- Data on previously characterized proteins currently in the public databases.
The basic design of the database was discussed, and recommendations for hosting, administration, and management of the database were put forth.
Achieving an accurate and detailed annotation of a newly sequenced genome is a critical, but often difficult, step in the process of analyzing the sequence data. This is especially difficult for organisms where genetic tools have yet to be developed. Unfortunately, the pace of experimental elucidation of gene function is very slow compared to the pace of sequencing and computational prediction of function. Thus, rather than attempt to experimentally explore the functions of every unknown gene in every sequenced genome, it is preferable to focus experimental investigations on the most informative targets. For instance, the scope and accuracy of existing bioinformatics techniques would be greatly enhanced by obtaining one or a few experimental functions for members of gene families that are found in many organisms. This experimental annotation initiative will encourage and enable experimental biochemists to participate in the annotation of prokaryotic genomes.
The initial focus of this particular initiative would be prokaryotes – bacteria and archaea – because (a) they possess relatively small genomes comprising genes that are usually easily defined, (b)a great deal of prokaryotic genomic sequence data is available in the public domain, and (c) because they are experimentally tractable. Schemes need to be developed to determine which among the prokaryotic gene products and orphan proteins should receive attention first. In one possible plan, priority would be given to families of similar genes that are found in many different genomes, because determining a biochemical function for one member will likely implicate all family members as possessing the same or a similar function.
The details of the bioinformatics part of the initiative, such as database design and operation, should be open to the discretion of those researchers who apply for funding to construct it, but certain broad recommendations for the content and administrative aspects of the resource were formulated by the colloquium participants. For example, the database should include not only protein gene products but also functional RNA products. It was stressed that the input of both bioinformaticians and experimentalists would be vital to the success of the initiative, and their collaboration should be encouraged. The creation of an external database advisory board was also recommended. Funding would be required to support the bioinformaticians who will make and evaluate the bioinformatics predictions and generate and maintain the computational resource. However, the largest requirement for funding would be to support the experimental biochemical work testing bioinformatics predictions. It was proposed that one or more pilot projects be undertaken to assess the feasibility of the approach before embarking on a large scale initiative.
The potential impact of the proposed initiative is difficult to overstate since it would affect all aspects of biology. The participants feel that this project is essential to enable the next step in moving genomic science forward from accumulating a large depository of sequences towards achieving a true understanding of the basic elements of prokaryotic biology. Without a forward- looking initiative like the one proposed here, the functional data needed to propel systems biology forward will not be available, and those trying to understand the complex interactions of genes and their products in living cells will continue to work with many components of unknown function. In addition, elucidating the enzymatic functions essential for prokaryotic life will impact our understanding of eukaryotic organisms, which possess many of these same genes. This initiative will also foster closer collaborations between experimental and computational scientists and help to reinstate the importance of biochemical research. Finally, although much of the project will focus on traditional biochemistry, the initiative can be expected to stimulate new advances in functional screening, new functional genomic technologies such as phenotype arrays, and significant industrial and commercial opportunities in the form of new targets for both medical and industrial applications of prokaryotic biology.