November 9, 2000 - Microbial Genome Sequencing Analysis and Annotation

Representatives from "the Microbe Project," a federal Interagency Working Group under the aegis of the National Science and Technology Council, requested advice from the American Society for Microbiology (ASM) on several highly focused, microbial genomic sequence annotation-related issues. In response to that request, the ASM Public and Scientific Affairs Board (PSAB) organized an ad hoc committee, several of whom met November 9, 2000, to review those issues and to develop provisional recommendations as part of the ASM's response. Before the ad hoc committee met, ASM sought advice on these annotation-related issues from a broad and representative sample of microbiologists who are actively conducting research on microbial genomics. Many of these researchers provided detailed responses for the ASM ad hoc group to review during the November meeting, and those comments were carefully considered in developing this response to the Interagency Working Group's request. ASM is pleased to provide the following provisional recommendations and would be happy to provide further help in response to future requests from the working group.

At least 30 separate microbial sequencing projects are completed and another 100 or more are under way. Microbial genomic database development now is being supported by several federal agencies, including the National Institute of Allergy and Infectious Diseases (NIAID) within the National Institutes of Health (NIH), the Department of Energy (DOE), and the National Science Foundation (NSF). Microbial genomic databases that are being generated through these federally sponsored efforts are represented on several different Web sites that are managed by various entities in the United States and Europe. For example, the National Center for Biotechnology Information (NCBI) within NIH maintains completed genomic databases, including those derived from microorganisms, on its Web site. Additional Web sites managed in the private sector present voluminous microbial data sets that are annotated to varying extents.

Some other efforts to generate microbial genomic data sets that are being undertaken by industry are not being presented on freely accessible Web sites but, instead, are held in confidence of distributed on a limited basis for commercial purposes.

Defining and Implementing Annotation

The ASM ad hoc committee recommends that attention be given to carefully defining "annotation" as a first step toward developing a plan to coordinate microbial genome databases, particularly because there appears to be little consistency in how this term is being used among different researchers working on microbial genomic-related projects. Moreover, the information being annotated is complex and extremely diverse, making it that much more important to develop a commonly understood definition of this term.

As a first step toward defining this elusive term in the context of microbial genomics, annotation entails assembling information of several distinctive types, starting with refined DNA sequence data but extending beyond that level to varying degrees of complexity. For example, completed DNA sequence information may be segmented into distinct intervals that may be demarcated in terms of them encoding specific types of "product," such as proteins, transfer RNAs (tRNAs), and phage sequences. A particular gene of the first type may be annotated in such terms as its protein coding region, its transcript, promoter region, and so forth. At a higher level of annotation, a protein that is encoded by a particular gene may be annotated in terms of its physical attributes, such as molecular weight, membrane spanning regions, structural domains, or three-dimensional structure. Moreover, annotation at the level of comparative biology may include information linking a particular protein from a specific microorganism to similar proteins from other organisms or to members of similar protein families. Genes may also be annotated at a functional level, in terms of their respective roles in cellular metabolism, a particular systematic enzyme number (EC) designation, protein-protein interactions, and expression profiles. Also the organisms encoded by these genomes have datatypes related to scientific names, strain, geographic incidence, optimal growth conditions, gram-stain and so forth.

Annotation goes beyond assigning these multi-level classifications of data. For instance, investigators with other expertise can be expected periodically to interrogate annotated data sets and to reinterpret or augment their contents in a process that will enrich their complexity and presumably their usefulness. Thus, care needs to be given to how such additional interpretations of the resident annotations within a data set are handled and credited. In general, currently available archival systems are not designed to reflect how such processes occur.

Technology development per se is not a requirement for effective bacterial annotation. While any methodology will certainly employ some software and database development that is has yet to be fully developed, the issue is much more a problem of appropriate application of software and databases that exist today. The analogy to other aspects of biology is appropriate; most molecular biology laboratories share a nearly universal set of tools to accomplish their work. Enzymes can be ordered from catalogs, most labs have centrifuges, growth chambers, electrophoretic units that allow them to do pretty much the same work. It is the application of these and other commonly available technologies that embody the fundamental success of each laboratory -- not the development of a new technology.

Assuming that the representatives of different research constituencies can develop a more consistent if not altogether consensus definition of datasets for annotation, additional care needs to be given to developing standardized means for encoding, evaluating, and even rejecting, when appropriate, new and often complex information that is being considered for inclusion in dynamic data sets. Identifying, adopting, and then uniformly applying a series of standard operating procedures (SOPs) should be considered a key component of this critical, evaluative process. In addition, a process will be needed whereby specific analytic tools are "converted" for use as SOPs. Many imaginative, highly useful analytic algorithms already are available, suggesting that emphasis need not be put so much into developing new algorithms as into refining and more extensively applying those that are available.

Centralized or Unified Facility for Handling Annotated Databases

The ASM ad hoc committee did not reach a ready consensus on the overall practicality of establishing a central facility for handling annotated databases. Nonetheless, they expressed wide agreement over the value of moving toward a centralized, or at least unified facility or clearinghouse under a single administrative structure through which thoroughly annotated microbial genomic-related data can be made widely available to the research community. No one is comfortable with the idea of one single user facility that would impose bacterial annotations on the community. The alternative-fully decentralized facilities, such as a series of completely independently maintained Web sites-is seen as tending to encourage the development of "fiefdoms." Reliance on separate operations through such fiefdoms would tend to undermine vital efforts to institute uniform, high quality standards to the annotative process and might also interfere with efforts to ensure that those annotated data sets are made widely available to the research community.

A centralized or unified structure might be used to bring disparate individuals and groups together to help in achieving some of the goals associated with instituting a successful, uniform, annotative process, including fostering the development and enforcing the use of SOPs. The ad hoc group urges members of the federal Microbe Project to review several historical examples of relevant accrediting groups within the biological sciences, including those charged with developing and enforcing rules for naming microorganisms and for systematically assigning EC numbers to enzymes. Although these bodies and the rules that they formulate do not operate perfectly, they do provide models for how research community-generated, standard-setting operations can help in bringing essential order into rapidly expanding, sometimes chaotic scientific specialty areas. They also provide examples of useful enforcement procedures: for instance, journal editors reject manuscripts when investigators fail to conform to standardized microbial species nomenclature.

Another aspect of such a centralized facility that would need to be fully explored is what tools and services that are at its disposal should be made available to the scientific community. Services might include database hosting and standardized annotation. A mechanism for community feedback, particularly for facility-generated annotation, would be important.

A specific challenge that needs to be addressed soon is how to link the many draft microbial sequences that are now emerging. Many of these essentially complete genome sequence databases are going to be inaccessible or updates during the gap-filling and assembly process will lag behind. A comprehensive database would be of immediate use. The goal of this reporting function would be to prevent needless duplication of effort by stakeholders in a particular genome, and to nucleate the annotation communities for "orphan" genome sequences by putting interested parties in touch with one another.

The ASM ad hoc group agrees that an effort to build a brand new, freestanding facility to handle microbial genome annotations is not warranted. Although such an operation could be housed in a single building as part of an established entity, it may well be developed among geographically scattered sites as a virtual but unified (in terms of its operations) facility. In any case, adequate resources will be needed to guarantee its sustained operations for at least a decade. Indeed, this need for sustained operation is an important argument against a fully decentralized facility, some of whose "nodes" can readily be lost as investigators who maintain them turn their attentions elsewhere, lose support for those or other activities, or abandon maintenance efforts for other reasons.

The ASM ad hoc group also urges that substantial resources be committed for these microbial genome annotation efforts. At one level, the entity responsible for this task will need to support specialized personnel, including postdoctoral-level researchers, whose services may be needed to develop and implement SOPs or to provide additional services on behalf of users of the annotated data sets. At another level, this effort will fail unless there is a serious and substantial commitment of resources to sustain these efforts for a multiyear period of at least five to ten years.

Although the ASM ad hoc group briefly considered several alternative institutions in which this unified facility might be housed, they decided against recommending any one of them to the federal interagency group. Instead, the ad hoc group members urge the interagency group to consider several different types of organization and to choose among them on a peer-review basis, such as by publishing a request for applications or through a competitive contract review. Among the types of organization that might compete for this multiyear task are federal institutions such as NCBI; not-for-profit biological resource centers such as the American Type Culture Collection; various private-sector enterprises, including the Institute for Genomic Research, Double Twist, Integrated Genetics, and a variety of other companies that conduct genomic analysis and develop databases and similar analytic tools for use by biologists; and universities with centers that specialize in genomic research.

Accessibility Issues

Although the ASM ad hoc group recommends that the federal Microbe Project consider applicants from the private sector as suitable candidates for annotating and maintaining microbial genome data sets, it does so with the understanding that those data sets are to be made fully available to the broad community of researchers whose interests the facility will serve. In other words, control over the data sets is to remain in the public domain, even if the project itself is managed by an entity from the private sector.

Other entities in the private sector that are conducting microbial genomic research should be encouraged to furnish data sets and to participate in the annotation of data sets within the overall collection. It may be useful to develop prototype agreements, similar in concept to material transfer agreements, to facilitate the transmission of such information for wide use through this system. Still other entities in the private sector should be encouraged to make their analytic or other relevant expertise available and also to develop appropriate service functions to facilitate the use by the broad research community of information held within the annotated microbial genomic data sets.

Importantly, a suitable advisory body will need to be established to oversee the entity charged with this annotative responsibility and to advise it on instituting other measures, such as quality control procedures and the development of SOPs, that are to be applied to the acquisition of new data and to the safeguarding of data sets that are part of the system. The members of the advisory body should be broadly representative of the community it serves, including members drawn from federal agencies, scientific organizations, and the broader research community. This advisory body should also be free to appoint members with specific expertise to help in dealing with specialized issues that come before it.

The ASM ad hoc group recognizes that the analysis and annotation of microbial genomes involve an international effort. Therefore, the group encourages officials directing the federal Microbe Project to identify and work cooperatively with pertinent research groups outside the United States who are conducting such efforts. Although programs to share costs or more ambitious undertakings to establish the equivalent of an organization such as the Human Genome Organisation (HUGO) for the sake of coordinating these microbial genomic efforts might be ideal, the ad hoc group considers them impractical over the near term.