The NCBI Taxonomy Project The purpose of this note is threefold: (1) to outline the taxonomy project that we have been working on at the NCBI for the past year, (2) to solicit volunteers from the taxonomic and phylogenetic communities (to help curate the taxonomy) and from the users of the sequence databases (to help in identifying problems with the taxonomy), and (3) to establish contact with culture collections, stock centers, herbaria & museums, and any other groups that are maintaining general and/or specialty taxonomies and/or phylogenies. The problems with taxonomies used by the sequence databases are well-known: each of database comes with its own taxonomy; each is different from the others, and none of them are in full agreement with the current taxonomic consensus (even if we could imagine that such a thing existed), and all of them contain a wide variety of different kinds of errors and inconsistencies. At an even more basic level, it is not always possible (even within the same database) to determine if two entries come from the same species. We have developed a taxonomy database management tool (the TaxMan) which is based on a tree-structured database appliction developer's tool (the TreeTool). This tool includes a rich set of functions for merging & crossmapping trees. We have used the TaxMan to build representations of each of the sequence database taxonomies, as well as a few other taxonomies obtained from other sources (the ICTV international standard taxonomy for the viruses, the USDA taxonomy for the plants, and the FlyBase taxonomy for the Drosophilidae). We have used the TaxMan to merge all of these taxonomies into a single tree, which we can associate with the database that we are maintaining which merges all of the sequence databases into a single structure. After we had merged the sequence database taxonomies, a workshop was organized by Mitch Sogin, of the Marine Biological Laboratory at Woods Hole in order to review and revise the taxonomy and to discuss mechanisms by which the taxonomic community could maintain the taxonomy (as new species enter the databases and as the taxonomic consensus develops). This workshop included a dozen representatives, each specializing in different branches of the taxonomic tree, and included both classical and molecular systematists. The revised 'backbone' tree will be much more of a phylogenetic taxonomy than a classical taxonomy; we feel that this will be of more general use to the users to the molecular sequence databases. The nucleic acid sequence database collaborators (EMBL and DDBJ) have agreed, in principle, to adopt the revised taxonomy as a database standard. We realize that for any given taxonomy there will be at most one person in the world who is completely happy with it. Although we need a single 'backbone' tree to associate with the sequence databases (and we will try to make this tree as good as possible) we do not want to claim that our tree is the canonical international standard taxonomy. We plan to develop the TaxMan to make it easy for concerned users to modify the 'backbone' taxonomy as they see fit, crossmap their personal tree back onto the 'backbone' taxonomy, and index the sequence databases through their own tree. For example, we have promoted the Archaea to kingdom level, alongside the Eubacteria and the Eucaryotae. Others may wish to use the traditional classification (with the Archaea and the Eubacteria buried in the Procaryotae) or other modern reclassifications (e.g. the Eocytes). As another example, we plan to move the birds (Aves) beneath the Archosauria, as a sister group to the Crocodylia. As a consequence of this approach, since we are moving towards a phylogenetic taxonomy instead of a classical taxonomy, the classical concept of taxonomic rank names (e.g. family, order & etc.) disappears. In the revision of the protozoan taxonomy which we have recieved, the familiar rank-level suffixes (-idae, -ida, -iformes, etc.) have been replaced with the generic suffix (-ids). We will, however, retain the other names (like Kinetoplastida, Trypanosomatidae & etc.) as synonyms in the tree, so that users may continue to retrieve the same set of organisms with these names. There are several consequences for the database users & submitters. First, we plan to formalize the use of organism names in the database - to collect all of the variant spellings, synonyms and misspellings and to select a preferred scientific name for each organism. Second, we plan to phase in (and retrofit the databases with) new taxonomic classification lines from the revised tree as the subtrees are returned by the workshop participants. And finally, each of these fields may change in new releases of existing entries in the databases, as new synonyms and misspellings are identified in organism names and as the taxonomy is revised to reflect new work in the field. Submitters who wish to associate different names and taxonomic classification lines with their entries will be allowed to enter this information in a /note attached to the source feature in the flatfile format. We have added a directory to our anonymous ftp site (, a.k.a. to make available files associated with this project. We were very happy to receive over fifty positive responses to the posting about the NCBI Taxonomy Project. We would, however, like to clarify the sources of data to the project and, in particular, to acknowledge the premier contribution of the PIR-International taxonomy maintained by Andrzej Elzanowski, MIPS am Max-Planck-Institut fuer Biochemie, Martinsried, Germany. The work performed at the PIR-International is, in our opinion, of the highest biological content and quality. The PIR-International is the only one of the sequence databases that has employed a taxonomist to maintain their taxonomy. We are indebted to the work done by other sequence database groups (SwissProt, GenBank, EMBL, and DDBJ) as well as taxonomic databases from other sources (the ICTV virus taxonomy, the USDA plant taxonomy, and the Drosophilidae taxonomy from the FlyBase project). We want to ensure that their contributions are publicly acknowledged here, and have detailed our use of these sources consistently in the text documents posted at our FTP site (repository/taxonomies/taxman at We have used the PIR-International taxonomy as the starting point for our merged view of the sequence database taxonomies. The other sequence database taxonomies (GenBank, EMBL, DDBJ and SwissProt) were used for the next rounds of the taxonomy merging process. Where we have found international standard taxonomic databases (for example, the ICTV taxonomy) appropriate to our particular needs as maintainers of the GenBank sequence database, we have substituted these taxonomies at the appropriate branches of the tree in our merged view of the sequence database taxonomy. We gratefully acknowledge the contributions from all of these sources, and look forward to a continuing collaboration with these (and other) parties in our effort to produce consistent views of the taxonomy that will span the various sequence databases. Please send comments & criticisms to: Scott


