Table of Contents Foreword and Preface Introduction History of the Human Genome Project DO

Master Index Current Directory Index Go to SkepticTank Go to Human Rights activist Keith Henson Go to Scientology cult

Skeptic Tank!

Table of Contents Foreword and Preface Introduction History of the Human Genome Project DOE-NIH Coordination Scientific Five-Year Goals of the U.S. Human Genome Project Highlights of Research Progress Mapping Informatics Sequencing Activities Addressing Ethical, Legal, and Social Issues Related to Human Genome Project Data Technology Transfer and Industrial Collaboration Human Genome Center Research Narratives Lawrence Berkeley Laboratory Lawrence Livermore National Laboratory Los Alamos National Laboratory Program Management Infrastructure DOE OHER Mission Program Management Task Group Field Coordination Human Genome Coordinating Committee Human Genome Management Information System Human Genome Distinguished Postdoctoral Fellowships Resource Allocation Interagency Coordination Joint DOE-NIH Activities Joint Mapping Working Group Joint Informatics Task Force Joint Sequencing Working Group Joint Working Group on Ethical, Legal, and Social Issues Joint Working Group on the Mouse Other U.S. Genome Research U.S. Department of Agriculture National Science Foundation Howard Hughes Medical Institute International Coordination HUGO: Worldwide Genome Research Coordination UNESCO: Promoting the Interests of Developing Countries Appendices A. Primer on Molecular Genetics B. Conferences, Meetings, and Workshops Sponsored by DOE C. Members of the DOE Health and Environmental Research Advisory Committee D. Members of the DOE-NIH Joint Working Groups E. Glossary Index to Principal and Coinvestigators Listed in Abstracts Acronym List Foreword Acquiring complete knowledge of the organization, structure, and function of the human genome_the master blueprint of each of us_is the broad aim of the Human Genome Project. It is a new kind of program in biology, both in its size and focus on a limited set of goals and in its dependence on the development and use of technology. The coordinated U.S. Human Genome Project was officially initiated by the Department of Energy (DOE) Office of Health and Environmental Research and the National Institutes of Health (NIH) National Center for Human Genome Research (NCHGR) in FY 1991 with the publication in April 1990 of Understanding Our Genetic Inheritance; The U.S. Human Genome Project: The First Five Years 1991-1995. The DOE effort, which began very modestly almost 4 years before, is now over 5 years old. Taking stock of what has been done and what remains to be done is particularly appropriate at this time. That the ambitious scientific goal of the Human Genome Project can now be imagined is the result of the revolution occurring in biology during the last 20 years. Modern biological science has achieved a profound but still quite incomplete level of understanding of how the diversity of all living things is determined. This insight, along with scientific and technical advances in other fields, has brought unprecedented power both in being able to analyze and manipulate genetic structures and to use and store large quantities of genetic information. DOE is uniquely positioned to bring together expertise in physics, chemistry, engineering, and computer science to help solve fundamental biological problems and to exploit exciting opportunities presented by the Human Genome Project. Genome research will also contribute to the department's role in providing the scientific foundation for understanding the health effects of radiation and of chemical insults to the genome. The DOE program stresses mapping, the development of sequencing technologies and instrumentation, and informatics. Informatics refers to computational approaches in acquiring, storing, distributing, analyzing, and manipulating vast amounts of mapping and sequence data that will result from the project. Another important program component studies the ethical, legal, and social issues arising from use of the generated data, particularly in the privacy and confidentiality of genetic information. Cutting across all DOE biological and environmental research programs are several science education activities. The Human Genome Project is a closely cooperative activity between NIH and DOE. NCHGR is an important and essential participant. Internationally, the formation of the Human Genome Organization and the establishment of national genome projects by an increasing number of countries indicate the fascination and promise of this effort on the collective imaginations of many nations. In addition to the inherent excitement about increased knowledge of human life, the project offers the promise of many new opportunities for benefiting humanity through the development of new diagnostics, pharmaceuticals, and therapies for a multitude of human diseases; a wide range of improvements will flow from other biotechnology advances. Further expected benefits include improved risk assessment for individuals and populations exposed to agents that impact genetic material, as well as possible applications of the data to environmental and remediation issues. To be successful, the program must continue to focus on clear objectives for mapping and sequencing and to incorporate the flow of technological developments into the efforts of all working laboratories. Strategies must be planned carefully and in a comprehensive fashion as the next phase begins, in which mapping and sequencing results proliferate and technologies mature. Planning must be project-wide and include interagency planning at ever-earlier stages. This report describes the status of the DOE Human Genome Program and its accomplishments to date. Research highlights are noted from the program as a whole and from the three principal DOE human genome centers at Lawrence Berkeley Laboratory, Lawrence Livermore National Laboratory, and Los Alamos National Laboratory. These national laboratory facilities of DOE have been especially successful because they are organized to focus efforts, foster interdisciplinary projects, and use advanced technologies, some developed for other purposes, toward program goals. Essential work is also reported from 41 different research universities. Remarkable progress has been made in advanced instrumentation and informatics. A further indication of the increasing development of the DOE program is the simple statistic that the 1989-90 report had 157 pages and included 57 abstracts of work involving 211 scientists. The current program report contains over 240 pages and includes more than 150 abstracts of work involving over 400 investigators, essentially a doubling of DOE program size. The Human Genome Project ultimately will create scientific resources for the next wave of advances in biology and medicine. As the project is completed, accomplishments will dwarf those that have occurred in the biological sciences since the advent of recombinant DNA technologies. By the same token, the ethical and social consequences of the uses of this new knowledge must be considered as the knowledge is acquired; if this knowledge is responsibly obtained and applied, the next decade of biological research will be history's most fruitful and rewarding by any measure. David J. Galas, Associate Director Office of Health and Environmental Research Office of Energy Research U.S. Department of Energy Preface This is the third report summarizing the Department of Energy (DOE) Human Genome Program, its content, progress, and accomplishments. Since the program's conception in 1986 and initiation in 1987 by the DOE Office of Health and Environmental Research (OHER), its broad objectives have rapidly gained both national and international support. The program has made important strides in the development and application of technologies and tools that are required for the cost-effective characterization of the molecular nature of the human genome. This country's Human Genome Project is jointly administered by OHER and the National Center for Human Genome Research of the National Institutes of Health. A successful effort to characterize the molecular nature of human inheritance will require continuing international cooperation involving scientists from many countries. A number of other nations have begun substantial efforts to map and sequence the human genome and those of key model organisms. Although intellectual property issues threaten some aspects of international cooperation, increasing exchange of information has led to more involvement of the international community in discovery, acceleration of the pace of the research, and increased cost-effectiveness. International communication is facilitated by regular meetings to update the maps of individual chromosomes and by contributions to databases such as the Genome Data Base and nucleic acid sequence databanks. Through such databases a worldwide data aggregation and distribution system is being developed to exchange information regarding the genome. Aided by funding from the Human Genome Project, serious study is under way on ethical, legal, and social issues that are becoming more urgent because of the rapid growth in knowledge of human genetics. It is important to develop and disseminate deeper and more widespread understanding of these dynamic issues and of the choices available for families, the law, and society. An educated public is required to make intelligent choices in this area. The national genome project is now the largest provider of funds for study of such issues. A key to the long-term success of the program is the initial phase of intensive resource and technology development that requires input and involvement from many scientific and engineering disciplines. Exciting contributions have already been made to biomedical knowledge and biotechnology, and such advances are certain to continue at an ever-increasing rate. Announcements of discovery of important disease genes have become commonplace. Within 10 years nearly all the perhaps 100,000 genes that make up the human genome are likely to be found. Within 15 years the program is expected to culminate in a reference DNA sequence of the entire genome. Never has such a mass of data flowed into biology and medicine. An understanding of how genetic variations account for much of the richness and adventure of human diversity will be greatly increased. More practically, there can be little doubt of tremendous payoffs in terms of diagnoses and, ultimately, specific therapies for many human diseases. Moreover, new technologies and rapidly developing analytical tools to characterize the human genome will have widespread impact beyond human health. They will find application in revealing the genetic inheritance of many organisms of potential scientific and commercial interest and will provide an important stimulus to broaden and deepen the impact of modern biology in areas such as energy, environmental protection and waste treatment, agriculture, and the materials sciences. Of particular importance is the facile access to proteins that rapidly follows discovery of their genes. As a result of genome projects, we will soon be in a position to begin the systematic large-scale characterization of proteins and their structure. The interplay of molecular biology, structural studies, high-performance computing, and advanced molecular graphics will certainly lead to an understanding of macromolecular structure-function relationships. The scientific and economic implications of such a predictive understanding cannot be overestimated. It is the key to full realization of the potential of modern biology. Intense X-ray light and neutrons produced by unique, large, and expensive machines (synchrotrons and reactors) at DOE laboratories are important national resources for the determination of biological structure and, hence, for the national effort in biotechnology. A central goal of OHER is to provide access to these machines by making facilities and technical support available to structural-biology users, a need that has been projected to increase tenfold in the next several years. Finally, as Robert Sinsheimer elegantly pointed out in The FASEB Journal (November 1991), the Human Genome Project is an epic venture of discovery that will in time clarify many endlessly and fruitlessly debated mysteries of human nature. With this project we are launched upon a new stage of the age-old quest to illuminate the record of the human past_the prehistory of our species as recorded in the genetic script or blueprint for our being. When complete, the project will have provided us with an unprecedented resource_the complete text of our genetic endowment. It will be seen as a turning point in human history. David A. Smith, Director Health Effects and Life Sciences Research Division Office of Health and Environmental Research Office of Energy Research U.S. Department of Energy Acknowledgements The DOE Office of Health and Environmental Research gratefully acknowledges the contributions made by genome research grantees and contractors in submitting abstracts, photographs, captions, and narratives. The Human Genome Management Information System at Oak Ridge National Laboratory (managed by Martin Marietta Energy Systems, Inc., for the U.S. Department of Energy under contract DE-AC05-84OR21400) collected and organized the information, prepared the manuscript, and implemented the design and production of this publication. Introduction The U.S. Human Genome Project is the national coordinated 15-year effort to characterize all the human genetic material_the genome_by improving existing human genetic maps, constructing physical maps of entire chromosomes, and ultimately determining the complete sequence of the deoxyribonucleic acid (DNA) subunits in the human genome. Parallel studies are being carried out on selected model organisms to facilitate the interpretation of human gene function. The ultimate goal of the U.S. project is to discover all of the more than 100,000 human genes and render them accessible for further biological study. Current technology could probably be used to attain the objectives of the Human Genome Project, but the cost and time required would be unacceptable. For this reason, a major feature of the first 10 years of the project is to optimize existing methods and develop new technology to increase efficiency in DNA mapping and sequencing by 1 or 2 orders of magnitude. The genome will eventually be sequenced using continually evolving technologies and revolutionary methods not in existence today. Information obtained as part of the Human Genome Project will dramatically change almost all biological and medical research and dwarf the catalog of current genetic knowledge. In addition, both the methods and the data developed as part of the project are likely to benefit investigations of many other genomes, including a large number of commercially important plants and animals. For more information on the science of genomics, see Appendix A, "Primer on Molecular Genetics," p. 191. Terms are defined in the Glossary, p. 229. An acronym list is on the inside back cover. History of the DOE Human Genome Program A brief history of the U.S. Department of Energy (DOE) Human Genome Program will be useful in a discussion of the objectives of the DOE program as well as those of the collaborative U.S. Human Genome Project. The Office of Health and Environmental Research (OHER) of DOE and its predecessor agencies_the Atomic Energy Commission and the Energy Research and Development Administration_have long sponsored research into genetics, both in microbial systems and in mammals, including basic studies on genome structure, replication, damage, and repair and the consequences of genetic mutations. In 1984, OHER and the International Commission on Protection Against Environmental Mutagens and Carcinogens cosponsored a conference in Alta, Utah, which highlighted the growing roles of recombinant DNA technologies. Substantial portions of the meeting's proceedings were incorporated into the Congressional Office of Technology Assessment report, Technologies for Detecting Heritable Mutations in Humans, in which the value of a reference sequence of the human genome was recognized. Acquisition of such a reference sequence was, however, far beyond the capabilities of biomedical research resources and infrastructure existing at that time. Although the small genomes of several microbes had been mapped or partially sequenced, the detailed mapping and eventual sequencing of 24 distinct human chromosomes (22 autosomes and the sex chromosomes X and Y) that together comprise an estimated 3 billion subunits was a task some thousandsfold larger. DOE OHER was already engaged in several multidisciplinary projects contributing to the nation's biomedical capabilities, including the GenBankr DNA sequence repository, which was initiated and sustained by DOE computer and data-management expertise. Several major user facilities supporting microstructure research were developed and are maintained by DOE (see box, p. 55). Unique chromosome-processing resources and capabilities were in place at Los Alamos National Laboratory and Lawrence Livermore National Laboratory. Among these were the fluorescence-activated cell sorter (FACS) systems to purify human chromosomes within the National Laboratory Gene Library Project for the production of libraries of DNA clones. The availability of these monochromosomal libraries opened an important path_a practical means of subdividing the huge total genome into 24 much more manageable components. With these capabilities, OHER began in 1986 to consider the feasibility of a dedicated human genome program. Leading scientists were invited to the March 1986 international conference at Santa Fe, New Mexico, to assess the desirability and feasibility of implementing such a project. With virtual unanimity, participants agreed that ordering and eventually sequencing DNA clones representing the human genome were desirable and feasible goals. With the receipt of this enthusiastic response, OHER initiated several pilot projects. Program guidance was further sought from the DOE Health Effects Research Advisory Committee (HERAC, see Appendix C for a list of current members). The HERAC Recommendation. The April 1987 HERAC report recommended that DOE and the nation commit to a large, multidisciplinary, scientific, and technological undertaking to map and sequence the human genome. DOE was particularly well suited to focus on resource and technology development, the report noted; HERAC further recommended a leadership role for DOE because of its demonstrated expertise in managing complex and long-term multidisciplinary projects involving both the development of new technologies and the coordination of efforts in industries, universities, and its own laboratories. Evolution of the nation's Human Genome Project further benefited from a 1988 study by the National Research Council (NRC) entitled Mapping and Sequencing the Human Genome, which recommended that the United States support this research effort and presented an outline for a multiphase plan. DOE-NIH Coordination The National Institutes of Health (NIH) was a necessary participant in the large-scale effort to map and sequence the human genome because of its long history of support for biomedical research and its vast community of scientists. This was confirmed by the NRC report, which recommended a major role for NIH. In 1987, under the leadership of Director James Wyngaarden, NIH established the Office of Genome Research in the Director's Office. In 1989 this office became the National Center for Human Genome Research (NCHGR), directed by James D. Watson. After Watson's resignation in April 1992, Michael Gottesman was appointed NCHGR Acting Director. In addition to extramural support for research projects in physical mapping and the development of index linkage markers and technology, NIH also provides support for genetic mapping based on family studies and, following NRC recommendations, for studies on several relevant model organisms. DOE-supported genome research is focused almost exclusively on the human genome through support of large-scale physical mapping, resource and instrumentation technology development, and improvements in computational and database capabilities and research infrastructure. A significant portion of the DOE Human Genome Program is allocated to the DOE national laboratories. In several important areas, DOE and NIH cooperate to support critical resources such as the Genome Data Base (GDB) at Johns Hopkins University. Cofunded since 1991 as the central international repository of human chromosome mapping data, GDB is expected to receive supporting funds from other nations. DOE and NIH also cooperate to support joint workshops; a number of ethical, legal, and social issues projects; and the Human Genome News newsletter. Joint task groups under the DOE-NIH Joint Subcommittee on the Human Genome meet periodically to define program needs and develop recommendations for their parent DOE and NIH committees. OHER and NCHGR cosponsor workshops and meetings of the task groups on mapping; sequencing; informatics; the use of the mouse as a mammalian model; and_in a departure from most scientific programs_ethical, legal, and social issues related to data produced in the project. Many other highlights of the DOE OHER program follow in the succeeding sections of this report, including reports from the human genome centers; further details of program infrastructure, management, and coordination; resource allocation; and abstracts of individual research projects. Scientific Five-Year Goals of the U.S. Human Genome Project from the NIH-DOE Five Year Plan* [Implemented October 1, 1990 (FY 1991)] 1. Mapping and Sequencing the Human Genome Genetic Mapping Complete a fully connected human genetic map with markers spaced an average of 2 to 5 cM apart. Identify each marker by a sequence tagged site (STS). Physical Mapping Assemble STS maps of all human chromosomes with the goal of having markers spaced at approximately 100,000-bp intervals. Generate overlapping sets of cloned DNA or closely spaced unambiguously ordered markers with continuity over lengths of 2 Mb for large parts of the human genome. DNA Sequencing Improve current and develop new methods for DNA sequencing that will allow large-scale sequencing of DNA at a cost of $0.50 per base pair. Determine the sequence of an aggregate of 10 Mb of human DNA in large continuous stretches in the course of technology development and validation. 2. Model Organisms Prepare a mouse genome genetic map based on DNA markers. Start physical mapping on one or two chromosomes. Sequence an aggregate of about 20 Mb of DNA from a variety of model organisms, focusing on stretches that are 1 Mb long, in the course of developing and validating new and improved DNA sequencing technology. 3. Informatics_Data Collection and Analysis Develop effective software and database designs to support large-scale mapping and sequencing projects. Create database tools that provide easy access to up-to-date physical mapping, genetic mapping, chromosome mapping, and sequencing information and allow ready comparison of the data in these several data sets. Develop algorithms and analytical tools that can be used in the interpretation of genomic information. 4. Ethical, Legal, and Social Considerations Develop programs directed toward understanding the ethical, legal, and social implications of Human Genome Project data. Identify and define the major issues and develop initial policy options to address them. 5. Research Training Support research training of pre- and postdoctoral fellows starting in FY 1990. Increase the number of trainees supported until a steady state of about 600 per year is reached by the fifth year. Examine the need for other types of research training in the next year (FY 1991). 6. Technology Development Support automated instrumentation and innovative and high-risk technological developments as well as improvements in current technology to meet the needs of the genome project as a whole. 7. Technology Transfer Enhance the already close working relationships with industry. Encourage and facilitate the transfer of technologies and of medically important information to the medical community. *Understanding Our Genetic Inheritance; The U.S. Human Genome Project: The First Five Years FY 1991-1995, DOE/ER-0452P, U.S. Department of Health and Human Services and U.S. Department of Energy, April 1990. Highlights of Research Progress Mapping A major goal for DOE and NIH, as stated in the Five Year Plan (p. 5) for the Human Genome Project officially implemented in FY 1991, is to develop refined physical maps of chromosomes. Increasingly detailed maps will provide biomedical scientists with rapid access to important areas on chromosomes through their specific markers and ordered sets of DNA clones. Page numbers for research abstracts of investigators noted in parentheses can be located in the "Index to Principal and Coinvestigators Listed in Abstracts," p. 243. Physical Map Construction DOE sponsors both extensive physical mapping studies and supportive resource and technology development. Physical mapping of chromosomes 5, 11, 16, 17, 19, 21, 22, and X has been or is being supported directly. Increasingly detailed maps facilitate access to important chromosomal loci through their constituent markers and ordered DNA clones. The earliest concerted mapping efforts began on chromosome 16 at the Los Alamos National Laboratory (LANL) Center for Human Genome Studies and on chromosome 19 at the Lawrence Livermore National Laboratory (LLNL) Human Genome Center. These efforts have achieved excellent progress (see detailed narratives, pp. 46 and 36, respectively) through the development of effective multidisciplinary teams and efficient methods for generating clone "fingerprints." The fingerprints provide data for recognizing clone pairs that overlap, facilitating the construction of increasingly larger sets of overlapping clones, called contigs. Approximately 90% of chromosomes 16 and 19 is now represented by fingerprinted clones, and multiclone contigs span at least 80% of their length. Initial contig assembly methodologies are complemented by strategies designed to finish the physical maps and align them with genetic maps. This progress, together with the many contributions from other research groups (presented in the Abstracts section of this report), shows that resources and technologies required to achieve the mapping goals stated in the Five Year Plan are rapidly being realized. National Laboratory Gene Library Project (NLGLP) Among the resources most crucial to mapping progress are the libraries of clones representing each of the human chromosomes. Their availability reduces the total genome map ping effort to 24 smaller, more-manageable mapping projects. This chromosome-specific clone library production from physically purified chromosomes depends on the unique LANL and LLNL chromosome-sorting facilities maintained through the DOE NLGLP. These library resources are either distributed from the laboratories or through the American Type Culture Collection. As of December 1991 over 620 chromosome-specific libraries were distributed as resources for entire chromosome mapping efforts and for more-selective gene hunts. Current library production is focused on the needs of the major chromosome mapping projects (L. Deaven, LANL; P. de Jong, LLNL). Recombinant Clone Types Other biological resources are also being developed to further chromosome mapping progress. These resources include several useful genetic elements or recombinant DNAs and their cellular hosts. The largest elements are the intact, single human chromosomes maintained in somatic cell hybrids, such as single human chromosome/hamster-host cell hybrids. They are valuable for sorting out the human chromosomes for construction of single-chromosome libraries. Insert sizes of recombinants range from millions to a few hundred bases. Recombinant cosmid clones with 40- to 50-kb human DNA inserts predominated in the early contig-building efforts and continue to be a basic resource (refer to Abstracts: Resource Development, p. 82). Monochromosomal Yeast Artificial Chromosomes (YACs) YACs with inserts of 200 kb and larger, whose initial development was pioneered with NIH support, are now widely used in physical mapping projects. The recently developed capability to produce YACs from flow-sorted chromosomes is making available mono-chromosomal YAC libraries to speed mapping projects (M. McCormick, L. Deaven, and R. Moyzis, LANL). These libraries are made up of YACs containing human DNA inserts. This contrasts with libraries made from somatic cell hybrids, which are made up of YACs that contain mostly nonhuman DNA inserts. Clone Library Array and Analysis When user laboratories maintain clone libraries in the same arrayed-format addressing system, the information obtained from these libraries is maximized because the accumulated data from different laboratories can be readily combined. The tedious task of arraying thousands of DNA clones has been greatly alleviated through the development and implementation of automated or robotic processing systems (T. Beugelsdijk and P. Medvick, LANL; J. Jaklevic, Lawrence Berkeley Laboratory (LBL); and A. Olsen, LLNL). These systems are being increasingly utilized in clone analyses and in comparisons needed for overlap detection. Multiplexed Clone Overlap Detection Overlap detection of sequence homologies by DNA hybridization is speeded by multiplexing strategies in which the processing of pools of clones or their derivative probes replaces the more tedious analysis of individual clones. Multiplexing was first implemented by the chromosome 11 mapping group (G. Evans, Salk Institute for Biological Studies). Several second-generation multiplexing schemes are now being implemented to speed overlap detection both within libraries and between members of different types of libraries (J. F. Cheng, LBL; P. de Jong, LLNL). Messenger RNA/cDNAs Used To Generate Sequence Tagged Sites (STSs) STS marking of DNA clones provides a common language for uniting the results obtained with different types of recombinant DNAs and varied approaches to map generation. An STS is a short, unique DNA sequence (generally 100 to 300 bp) that distinguishes a chromosomal locus. The STS segment can be selectively amplified within the entire genome by the polymerase chain reaction to provide an identifying tag for any DNA clone containing the site. DOE is emphasizing the use of STSs for expressed genes, as represented by their derivative cDNAs. Mapping these STSs onto contigs and to their chromosomal loci is thus rapidly placing genes on the developing chromosome maps (refer to Abstracts: Resource Development, p. 82). Microdissection Libraries Chromosome microdissection can facilitate region-specific mapping efforts, such as the localized ordering of clones on the much longer chromosomes, by identifying sets of clones derived from the specific region. Region-specific probes can also serve in the identification of locally expressed genes by selectively displaying their counterparts within complex cDNA libraries (F.-T. Kao, Eleanor Roosevelt Institute). Libraries of Hybrid Somatic Cells with Partial Human Chromosomes Aberrant chromosomes arising from rearrangement processes can be moved into host rodent cells, providing for the maintenance of a human subchromosomal segment. A large hybrid set has been assembled for chromosome 16 (G. Sutherland, Adelaide Children's Hospital, South Australia). These partial chromosomes together define over 100 chromosomal segment "bins" to which clones, contigs, and other DNA markers can be assigned by DNA hybridization tests. This resource system is greatly speeding the completion of the chromosome 16 map. Fluorescence In Situ Hybridization (FISH) The previous mapping of DNA clones by FISH onto metaphase chromosomes has now been extended to the much less condensed interphase and pronuclear DNAs. Mapping onto less-condensed chromosomes increases spatial resolution and the capacity to order closely spaced markers. As a component of evolving mapping strategies, FISH is serving to locate and orient cosmid contigs on intact chromosomes and measure distances between the cosmids as well as to mapped cDNAs. (J. Gray, University of California; J. Korenberg, Cedars-Sinai Medical Center; B. Trask, LLNL). Fragile X Locus Cloned The fragile X locus has been cloned and its mode of action is being characterized (C. T. Caskey and D. L. Nelson, Baylor College of Medicine; and collaborators). Fragile X syndrome may be the most common form of inherited mental retardation. About 1 in 1500 males and 1 in 2500 females are affected by the syndrome, which is caused by a high mutation frequency at the fragile X locus. Myotonic Dystrophy Locus Cloned The gene responsible for myotonic dystrophy, an autosomal dominant disease, has been identified and cloned. The structural defect is characterized by a tandemly repeated segment of DNA within or close to the coding region on 19q13.3. The extent of the amplified region appears to be associated with the severity of the disease (C. T. Caskey, Baylor College of Medicine; P. de Jong and A. Carrano, LLNL; and collaborators). Informatics Multiple informatics capabilities will be crucial to the successful application of data derived from the genome project. Informatics expertise, software, and hardware are being developed in the following areas: chromosome map assembly, databases, DNA sequence analysis, and laboratory automation. Map Assembly Algorithms for automatically assembling physical maps from cloned fingerprint data have been further improved (E. Branscomb, LLNL; M. Cinkosky, V. Faber, J. Fickett, and D. Torney, LANL). Software permitting fast parallel computations on multiple computers was developed to speed computation-intensive mapping analyses (E. Branscomb, LLNL). A computer communication and interrogation system is being assembled to minimize redundancy during the production of STS chromosomal markers from cDNAs. Participating laboratories will rapidly query distant databases to determine the novelty of a candidate mRNA/cDNA before further pursuing the STS-generation process. Databases Graphical interfaces for mapping databases were constructed to display several different types of aligned chromosomal data and provide expandable views [R. Douthart, Pacific Northwest Laboratory (PNL); J. Fickett, LANL; S. Lewis, Lawrence Berkeley Laboratory (LBL); R. Overbeek, Argonne National Laboratory (ANL)]. The electronic Laboratory Notebook database and similar databases are being continuously expanded to include new data types as mapping strategies evolve (J. Fickett, LANL). The internationally available Genome Data Base (GDB), housed at Johns Hopkins University and cofunded since September 1991 by DOE and NIH, is the primary reference data-base for human chromosome mapping data produced in the United States and abroad. The organizational structure of GDB is shown on the opposite page (P. Pearson, GDB). In a collaboration between LLNL and GDB, computer system interfaces have been devised for automatically transferring large amounts of data from mapping centers to GDB for integration into and updating of chromosome maps. Enhancements of the GenBankr DNA sequence database located at LANL continue. Primarily supported by NIH with contributions from DOE, GenBank exchanges data daily with European and Japanese databases. GenBank has expanded its electronic data-publishing facilities and has reached agreements with a number of journals to facilitate electronic publication of large volumes of DNA sequence data (J. Cassatt, NIH). Sequence Analysis gm, developed at New Mexico State University, is the first DNA sequence analysis algorithm capable of recognizing and ordering the set of protein-coding regions (exons) from among the noncoding regions (introns) comprising a gene, rather than predicting isolated protein-coding sequences. gm has been distributed to laboratories worldwide (C. Fields, now at NIH, and C. Soderlund, now at LANL). Gene Recognition and Analysis Internet Link (GRAIL), a novel neural network-based algorithm for identifying exons within DNA sequences, is online at Oak Ridge National Laboratory (ORNL) to serve the biological community by automatically analyzing sequences. From a number of examples, this artificial intelligence system learns several distinct sequence characteristics through which exons can be recognized. GRAIL automatically accepts input sequences sent to ORNL over Internet and returns the output analysis to the sender (R. Mural and E. Uberbacher, ORNL). Laboratory Automation Advances continue in the linking of laboratory instruments directly to data-acquisition computers and analysis software at the LANL, LLNL, and LBL human genome centers. Sequencing The DOE Human Genome Program has supported both evolutionary (incremental, gel-based) improvements to classical sequencing methods and several revolutionary (completely novel, gel-less) technologies. Steady advances have occurred in the evolutionary area with the implementation of automated sample preparation, multiplex sequencing, and strategies that minimize the need for prior subcloning. Gel Sequencing Approaches Multiplex sequencing systems have matured enough for transfer to the commercial sector (G. Church, Harvard Medical School; R. Gesteland, University of Utah). The readout of multiplexed gels and blots using stable isotopes as nucleic acid labels has the potential to increase sequencing speeds by at least a factor of 10 because resonance ionization mass spectroscopy is capable of differentiating many isotopes (H. Arlinghaus, Atom Sciences, Inc.; K. B. Jacobson, ORNL). Chemiluminescent label systems are now substituting for the less-desirable radioactive labels in many applications (I. Bronstein, Tropix, Inc.). Systems have been developed to retain chromosome continuity information by bypassing the customary subcloning step in the sequencing of recombinant DNAs (D. Berg, Washington University; C. Berg and L. Strausbaugh, University of Connecticut; J. Dunn and F. Studier, Brookhaven National Laboratory; R. Gesteland and R. Weiss, University of Utah). Fractionation speeds on capillary and very thin slab gels are 10-fold faster than on traditional thick gels (N. Dovichi, University of Alberta, Canada; B. Karger, Northeastern University; L. Smith, University of Wisconsin). The fluorescence/luminescence detection of fractionated nucleic acids has been significantly improved to allow detection of the smaller amounts of DNA loaded on capillary and thin slab gels (N. Dovichi, University of Alberta; R. Mathies, University of California; E. Yeung, Ames Laboratory). Over 300 kb have been sequenced from human and mouse T-cell receptors, providing fundamental new insights into the molecular biology of the immune response (L. Hood and T. Hunkapiller, California Institute of Technology). Gel-less Sequencing Technologies The technology for interrogating or sequencing clones by hybridization with short oligomers has passed a second proof-of-concept test. Three unknown DNA fragments were fully and accurately sequenced (R. Crkvenjakov and R. Drmanac, ANL). In research and development for single-molecule sequencing by processive nucleotide release, the capacity to detect single nucleotides by laser-induced fluorescence has been demonstrated (R. Keller and J. Jett, LANL). Progress is being made in developing methods to sequence DNA using lasers coupled to a mass spectrometer. The great advantage of these approaches is that the mass spectrum can be acquired in milliseconds (C. Chen, ORNL; J. Jaklevic, W. Benner, and J. Katz, LBL; L. Smith and B. Chait, University of Wisconsin; R. Smith, PNL; P. Williams and N. Woodbury, Arizona State University). Activities Addressing Ethical, Legal, and Social Issues Related to Human Genome Project Data In FY 1991, DOE activities on ethical, legal, and social issues (ELSI) included two conferences, three education projects, and three research projects. The first conference, Justice and the Human Genome, held in November 1991 at the University of Illinois College of Medicine, considered discrimination that could result from the use of genetic information about ethnic and other groups. The second conference, held in March 1992 at the Texas Medical Center Institute of Religion, focused on Genetics, Religion, and Ethics. The three education projects on the science and the societal implications of data produced in the Human Genome Project, listed with their preparers, include (1) a module to be developed and distributed to all U.S. high school biology teachers (Biological Sciences Curriculum Study); (2) an educational television series, "Medicine at the Crossroads," which will address the role of genetics in understanding and treating disease (WNET, New York, cofunded with NIH and the National Science Foundation); and (3) a program of hands-on workshops for public officials and other nonscientists (Cold Spring Harbor Laboratory). The three ongoing research projects, listed with the institutions developing them, are (1) a study of ethical issues arising from the rapid proliferation of genetic tests that can predict future disease in otherwise healthy individuals [National Academy of Sciences (NAS) Institute of Medicine, cofunded with NIH]; (2) a legal study of confidentiality protection for genetic data (Shriver Center); and (3) a study to consider problems in funding young investigators in biological and biomedical sciences (NAS). In its first 2 years, the DOE Human Genome Program funded a variety of ELSI activities, noted above. To avoid being spread too thinly, the ELSI component of the DOE Program now focuses on confidentiality and privacy concerns raised by increased genetic data about individuals. This sensitive, personal information, which may predict disorders before symptoms occur or treatments are available, can affect a person's self-image, employability, status in the eyes of others, and ability to obtain health insurance. Since genetic knowledge can also lead to better understanding of disease causation and to more-accurate assessments of environmental affronts, a balance must be achieved between the health of the public and the privacy interests of the individual. The DOE Human Genome Program is funding six new projects covering ELSI activities in research and education. One of the three projects investigating genetic discrimination will compare two states (Florida and Georgia), contrasting their genetic testing, screening, and counseling programs and the impact on different ethnic and socioeconomic communities. Another will examine the impact of two genetic conditions (cystic fibrosis and sickle cell disease) on African-Americans and Caucasians. A third will identify particular social institutions that may engage in discrimination and will consider whether the discrimination, if present, is the result of ignorance or systematic policy. A fourth project will explore in detail (a) the effect of genetic knowledge on the right of privacy and (b) the uses of genetic information in public health planning. A fifth project will develop a program of educational workshops for secondary and high school science teachers, focused on both the science and the ethical, legal, and social issues arising from data generated by human genome research. A six the project will involve a second educational television series, "The Secret of Life" (WGBH, Boston), which will address the current revolution in molecular biology and genetics. Other activities include conferences on Genes and Human Behavior: A New Era? (October 1991); Computers, Freedom, and Privacy (March 1992); and Science, Technology, and Ethical Responsibility (scheduled for June 1992). While very challenging issues are raised by genome research, solutions are not simple; defensible rights often exist on both sides of any issue. Further research is needed, as well as activities to promote public awareness and assist in policy development. Also, with the increasing use of computers to assemble, store, and organize data (including genetic data) into large databases, the issues of security and access control become more acute. To begin reorienting and better defining the scope of ELSI activities in the DOE program, the DOE-NIH Joint ELSI Working Group has established a collaborative effort on privacy to identify an ELSI research agenda and develop a more detailed approach to some of these concerns. Technology Transfer and Industrial Collaboration Technology transfer, considered one of the three most important facets of the DOE mission (along with meeting the nation's defense and energy needs), is enhancing U.S. investment in research and technological competitiveness. By creating new products, markets, and jobs, the rapid deployment of technology from the research laboratory to the marketplace can play an important role in vitalizing the U.S. economy. A vast potential exists for commercial development of genome resources and technology; applications to clinical medicine have already begun. All participants in the Human Genome Program are encouraged to engage in active collaborations with the private sector and transfer their resources and technologies for commercial development. Each national laboratory has a technology transfer office. The LLNL, LBL, and LANL human genome centers provide a variety of opportunities for collaborations on joint projects or for obtaining direct access to technology. They are also exploring additional ways to increase cooperation with the private sector; a number of interactive projects are now under way, and additional interactions are in the preliminary stages. In some instances, private industries are marketing technologies developed at DOE-sponsored research laboratories and are providing research funds or other resources to the centers; other collaborative programs involve joint development of technologies and their applications to achieve project goals. One mechanism being used by the DOE national laboratories is the Cooperative Research and Development Agreement (CRADA). The first CRADA in the genome project, established by DOE in the spring of 1991, was between Life Technologies, Inc. (LTI) and the LANL Center for Human Genome Studies for technologies developed in the single-molecule sequencing project. In this project an LTI-modified DNA polymerase will be used to label a single DNA strand with four different fluorescent, base-specific tags. After an exonuclease cuts the labeled nucleic acid base pairs from the DNA, the labeled bases will be induced to fluoresce as they pass sequentially through a focused laser beam. The bases can be identified and counted by a sensitive photodetector (see figure on p. 25 for more information). If successful, the technology will allow sequencing of 50,000-bp DNA fragments at 1000 bp/s. LTI will have the first opportunity to license products resulting from the joint effort and would pay royal ties to LANL under such a license. Potential commercial advancements in the Human Genome Program have also been recognized outside the genome community. Research and Development magazine selected an achievement by Edward Yeung and other Ames Laboratory scientists as one of the 100 most significant developments of 1991. This R&D 100 Award was given for the development of a user-friendly instrument that detects with extremely high sensitivity the fluorescent molecule concentration (based on laser-excited fluorescence), an improvement that may lead to routine high-speed DNA sequencing by capillary gel electrophoresis. A U.S. patent for portions of this technology has been issued, and several commercial manufacturers are considering the possibilities of marketing the instrument. A technology pioneered by LLNL to identify chromosomal abnormalities (e.g., aneuploidy, translocations, and deletions) has been licensed to Imagenetics, Inc., a medical diagnostics firm that will manufacture the technology and provide funding for future research and development. This technology involves the use of specially developed fluorescent dyes called Whole Chromosome Paints™ to detect diseases such as cancers and leukemia. Whole Chromosome Paints are being marketed by LTI. Some other technology transfers from DOE-sponsored genome research, both at the national laboratories and extramurally, are highlighted below. In progress or awaiting finalization are many more developments and agreements, some of which cannot be disclosed at this time because of their proprietary nature. Resources. Collaborative agreements have aided in the further development of several new technologies used in genome research, as well as in their commercial applications. New methods are being evaluated for use in isolating mRNA, chromosomes, and restriction fragments; in amplifying hybridization signals; and in extending DNA molecules. In addition, bacterial host strains have been developed that give greater stability to cosmid constructs containing human DNAs. Improvements are being made in DNA detection methods by the development of new probes, stains, and fluorescent dyes. As a result of the recent cloning of the fragile X gene, several companies are negotiating for licenses to develop assays for diagnosing fragile X syndrome, probably the most frequently inherited form of mental retardation. Hardware. Automation and enhancement of data collection and analysis has been the goal of many collaborations with the commercial sector. Equipment is being designed to automate (1) the production of high-density arrays on agarose or filters and (2) clone fingerprinting by gel electrophoresis (as well as the data collection and analysis software). Advanced applications for robotic systems are also being developed. The resolution of DNA fragments is being enhanced by improvements in pulsed-field gel electrophoresis. Resonance ionization spectroscopy is being modified to enable rapid detection of stable isotope labels on DNA following gel electrophoresis. A commercial gel scanner is being developed for reading DNA gels. Software. To aid physical map construction, programs are being designed for efficient clone analysis. Several other image-analysis programs are being developed, including data-capture software for images from video screens in combination with a DNA molecule imaging system. Sequencing. Multiplex sequencing technologies are being used to sequence pathogenic microbes. Human Genome Center Research Narratives Lawrence Berkeley Laboratory Since its inception in 1987, the Lawrence Berkeley Laboratory (LBL) Human Genome Center has focused on developing the necessary research and analytical technology to speed genome mapping and decrease the cost of sequencing. Over the last year, LBL has strengthened its ties with the University of California, Berkeley, particularly in the biological sciences. This collaboration fosters interdisciplinary activities in biology, instrumentation, and informatics. Biology The biology component at LBL is concentrating on developing and improving mapping and sequencing strategies for human chromosome 21. To achieve these goals, investigators in each biology project draw on the expertise of the center's instrumentation and computing groups. Two major biology projects are under way, and a third is in development. Physical mapping at LBL is focused on a 10-Mb region of human chromosome 21, and over 90 unique chromosome 21-specific yeast artificial chromosomes (YACs) have been located by fluorescence in situ hybridization (FISH). A new method has been developed that permits rapid isolation of chromosome-specific YACs, using probes isolated from flow-sorted chromosome libraries from Lawrence Livermore National Laboratory. In addition, cDNAs specific to a given YAC are being isolated by an automatable procedure based on magnetic beads. The second major biology effort involves testing new approaches to physical mapping and genomic sequencing. These projects exploit current methods, such as FISH and appropriate pooling strategies, for efficient isolation of overlapping clones. In addition, new work has begun on subcloning and ordering libraries of clones for mapping and on the use of gamma delta transposons as the primer site for sequencing studies. Increased efficiency in constructing physical maps results from a clone-limited strategy for generating maps based on sequence tagged sites (STSs). This nonrandom selection strategy reduces the number of STS assays required and produces contigs that cover a larger fraction of the genome. The third biology project is aimed at developing automated methods for generating genetic maps. A simple filter assay will be used to detect heterozygosity at mapped loci in yeast, mice, and human DNA samples. Instrumentation The instrumentation program within the LBL Human Genome Center has two major areas of effort: (1) biology and instrumentation development and support and (2) new instrumentation development based on emerging technologies. Supporting activities include the design and fabrication of gel boxes, automation of protocols on existing robotic frameworks, and the installation and networking of a variety of image-acquisition systems. In addition, advanced robotic [high-speed colony picking, robotic-based polymerase chain reaction, and DNA synthesis] and laboratory systems integration is under development. Efforts to produce new, adaptable technologies for the genome program include optimizing large-molecule detection systems; designing versatile optical fluorescence systems for multiplex labeling; and developing microfabricated arrays for application to large-scale clone libraries, sequencing by hybridization, and other procedures. The use of computer-controlled robotic systems provides a mechanism for automatically capturing the vast amount of data generated by laboratory operations. This requires a close coordination between hardware and software development in laboratory system design that goes far beyond automation of a few discrete protocols. Informatics A major part of the computing and instrumentation effort is driven by biology projects. The center's computing group focuses on specific applications in four major areas: raw data acquisition and analysis, information tracking and management, data interpretation and comparison analysis, and development of software tools. Visual data for mapping (including in situ pictures, autoradiograms, ethidium gels, and chemiluminescent staining) are handled by BioPix, a set of programs that assemble and integrate data from image capture to analysis. A similar system is being developed for sequence data. The Chromosome Information System (CIS) allows biologists to search, edit, and compare various maps, markers, and related reference information and to interact with other programs to exchange data. The laboratory data analysis system uses existing software packages and provides system management and support throughout the center. New, in-house analysis packages are being devised for sequence alignment and assembly. Software development tools permit rapid design and modification of database management systems, thus facilitating increased productivity, vendor independence, and conceptual clarity. Achievements * Over 90 independent YACs averaging 100 kb were regionally assigned to human chromosome 21 by FISH. These YACs include genetic markers to help integrate maps. * Two hundred unique probes were isolated for chromosome 21 and are being used to identify YACs from genomic libraries. * A rapid cDNA clone-screening method uses immobilized YAC clones to screen cDNA libraries, which are then localized on specific chromosomes. An alternative screening method uses individual YACs or cosmids attached to magnetic beads to isolate specific cDNAs, a method that can be readily automated to speed identification of coding sequences for physical mapping. * Marker-selected libraries, highly enriched for clones containing (CA)n repeats, were constructed from primary genomic libraries. These enriched libraries increase the efficiency of screening almost 50-fold. * A probe-mapping procedure determines the distance between the probe and the chromosome or YAC end. This method, which uses X rays to break large DNA pieces randomly, can be used to map cDNAs and to estimate the length of entire genes. * A double-ended, clone-limited strategy for physical mapping of chromosomes was devised. This strategy maps chromosomes on the order of 100 Mb and should result in larger contigs with a minimum of assays. * CIS, developed by the genome center computing group, was used to produce consensus maps at workshops on human chromosomes 3 and 21 and is being expanded for use with a number of plant species in the Plant Genome Program of the U.S. Department of Agriculture. * High-level database design tools have been developed to permit molecular biologists to define data objects in a way that captures biological concepts. The software automatically generates low-level commands for a commercial database management system, facilitating the evolutionary development of modular system components. These tools are also being used by researchers to design the Superconducting Super Collider database and the Integrated Genome Database. * A variety of mechanical, electrical, and chemical means have been used to manipulate DNA molecules; these methods include stretching molecules physically by externally applied electrical fields and guiding the molecules through grooves in a glass surface; digesting and separating single molecules; and picking up, transporting, and releasing DNA with scanning tunneling microscope (STM) tips. * Investigation of the feasibility of using STM for visualizing the individual bases of single-stranded DNA has shown that while purines and pyrimidines can be distinguished from each other, two bases in the same class cannot be differentiated by this method. * A fast, filter-based assay was developed to identify single base-pair polymorphisms, eliminating the need for gel assays. * Higher throughput was achieved through the construction of a dedicated high-speed colony-picking workstation. The pick rate is 10 to 20 times faster than the initial picking system and both faster and more accurate than a highly qualified human. The new picker arrayed an entire library of over 10,000 clones in 1 day. * Robots have been modified for use with a number of chemistry protocols, including cosmid and YAC library replication, various pooling schemes, and high-density filter array production. Using the robot to replicate libraries has made copies available to researchers in the private sector and in other national laboratories. Future Plans * Construction of a 10-Mb contig of human chromosome 21 based on overlapping YACs. The sequence will be determined by the most efficient strategy available. * Sequencing of a P1 clone. Subclone assembly will use a nonrandom strategy, and primer sequences will originate in the transposon gamma delta. * Construction of chromosome genetic maps of human chromosomes 16 and 19 in collaboration with other DOE genome centers. A simple gel-based heterozygosity assay is being developed to support this research. * Development of a computational biology program within the computing group to design and implement new algorithms for sequence assembly. Preliminary data will come from collaborations with other genome centers. * Design and implementation of a software tool suite for managing information and for optimizing the unique strategy of particular research groups. As large-scale sequencing projects develop, new acquisition and analysis software will be integrated into CIS. * Implementation of QUEST, a database tool that will provide a single entry point to the conceptual data model. QUEST will then implement automatically any changes in the user interface, the database query procedures, and the database schema definition. * Optimization of improved detectors and the associated mass spectrometry system for large biological molecules. * Automation of handling and analysis of dot-blot hybridization experiments and the implementation of a high-speed colony-picking apparatus. For more information on the LBL Human Genome Center, contact Jasper Rine, Director, or Sylvia Spengler, Deputy Director, at 510/486-4943. Lawrence Livermore National Laboratory The Human Genome Center at Lawrence Livermore National Laboratory (LLNL) is a multidisciplinary team effort that brings together chemists, biologists, molecular biologists, physicists, mathematicians, computer scientists, and engineers in an interactive research environment. Many of these individuals have previously collaborated on research projects in molecular biology, cytogenetics, mutagenesis, and instrumentation, as well as in the National Laboratory Gene Library Project (NLGLP). These projects have contributed substantially to the identification and characterization of human DNA repair genes, specifically the three on chromosome 19 that are a focus of interest at LLNL. The short- and long-term goals of the LLNL effort are to (1) develop biological and physical resources useful for genome research, (2) model and evaluate DNA mapping and sequencing strategies, (3) couple these resources and strategies in an optimal way to construct ordered clone maps and DNA sequences of human chromosomes, and (4) use the map and sequence information to study genome organization and variation. To achieve these goals, the Human Genome Center is organized into three broad research and support areas, each consisting of multiple projects led by a principal investigator. Extensive interaction occurs within and among all projects that have as their common goal the construction of ordered clone maps of the human genome. The program structure of the center includes a core facility and projects that focus on physical mapping and enabling technologies. Research and Support Areas Coordination and collaboration take place with other research groups throughout the world that are involved in the genome initiative or other mutual scientific interests. The role of LLNL in the Human Genome Project is seen as encompassing several areas, including technology development, map construction, map interpretation, and integration with ongoing and new programs in structural biology and mutagenesis. The following three components are highly interactive; individual staff members often have responsibilities in more than one component. Core facilities. The administrative group is concerned with budget oversight, external and internal meeting coordination, preparation of center reports, training coordination, property and space management, safety oversight, and secretarial support. The scientific core provides general support to the physical mapping effort, including cell culture and DNA extraction; library, probe, and clone management; oligonucleotide synthesis; fluorescence-based restriction mapping; and DNA sequencing. The core also facilitates material distribution to collaborators in the external community. Mapping activities. Five projects represent the coordinated effort to obtain an overlapping set of clones for human chromosome 19 and to further characterize genomic organization: * Assembly, closure, and characterization of a chromosome 19 contig map. The goal of this project is to construct an overlapping set of cosmid clones using a variety of techniques. An automated fluorescence-based restriction-fragment fingerprinting strategy is used to establish a foundation map of cosmid contigs. The contig closure effort will focus on using yeast artificial chromosomes (YACs) and cosmids with two hybridization-based techniques; one is based on fragments generated from Alu sequence primers or sequence tagged sites (STSs) by the polymerase chain reaction (PCR) and the second on RNA transcripts generated from the ends of cloned inserts. * Interdigitation of the physical and genetic maps of human chromosome 19. The goals of this effort are to locate known genetic markers on the expanding contig map, to coordinate the isolation of chromosome 19-specific STSs, and to localize them on the cosmid map. * DNA sequence mapping by fluorescence in situ hybridization (FISH). This project exploits the power of FISH on metaphase chromosomes, interphase cells, and pronuclear DNA. FISH will be used to determine the location of genes of interest and the relative order and orientation of the cosmid contigs. * cDNA mapping. The goal of this project is to isolate, sequence, and map cDNAs-expressed in a variety of human tissues_that will become the STSs on which future studies of genetic organization and gene function will be based. * New mapping strategies. New methods useful for library construction, contig closure, and overlap detection will be developed and validated. Focus is on improving Alu-PCR-based technology and pooling schemes to achieve closure of the chromosome 19 map with cosmids and YACs. Enabling technologies. The following groups provide computational, resource, and instrumentation support for research activities: ùComputational support for the Human Genome Center. This group is responsible for mathematical modeling of mapping and sequencing strategies and the development and application of data analysis algorithms and software. They are also responsible for the construction and maintenance of interactive relational databases that enable internal and external data access, including development of graphical visualization tools. * NLGLP. This project, a joint effort with Los Alamos National Laboratory, draws upon LLNL experience in flow instrumentation and chromosome sorting to construct human chromosome-specific libraries in lambda and cosmid vectors for use in physical mapping and other studies. * Instrumentation for cytogenetics and gene mapping. This group is responsible for developing instrumentation to facilitate flow systems analysis and chromosome sorting and to support FISH. Accomplishments The LLNL Human Genome Center has made excellent progress in the construction of an ordered set of cosmids for chromosome 19, the development and application of new biochemical and mathematical approaches for constructing ordered clone maps, the automation of fingerprinting chemistries, and high-resolution imaging of DNA. Major accomplishments are highlighted below. * Considerable progress has been made toward the closure of the chromosome 19 physical map. More than 10,000 cosmids have been analyzed by an automated fluorescence-based fingerprinting approach and assembled into over 870 contigs that span about 80% of the chromosome. FISH has been used to locate over 400 cosmids and 117 contigs on the cytological map, and more than 70 known genetic markers have been located on cosmid contigs. Closure of the gaps between contigs is under way using YACs and cosmids. * Cosmid contigs analyzed in the carcinoembryonic antigen (CEA) gene family region of chromosome 19 were found to be tightly linked over relatively short stretches of DNA. This gene family of about 22 members appears to span a contiguous region of about 1 Mb. With probes made from the ends of these contigs, hybridization techniques were applied to join contigs established by fingerprinting into larger contigs. In addition, almost 2 Mb surrounding the myotonic dystrophy locus were linked with cosmids and YACs. * More than 20 clones containing DNA sequences corresponding to a number of important genes and regions that map to chromosome 19 were isolated from two separate YAC libraries. Among these clones were the region encoding the LDL receptor and ApoE gene, two important components of the regulation of cholesterol and triglyceride metabolism in humans. Similarly, a region was isolated that encodes a family of serine proteases called Kallikreins, whose role is the specific proteolytic activation of peptide hormones and growth factors. Clones of these regions are being used for the structural analysis and mapping of these genes. * A structural defect found in the cloned gene linked to the autosomal dominant disease myotonic dystrophy has been identified through an international collaboration. This chromosome 19 defect, which is characterized by a tandemly repeated segment of DNA within or close to the coding region on q13.3, is similar to that seen in the fragile X syndrome. The extent of the amplified region appears to be associated with the severity of the disease. * The gene for DNA ligase 1 was mapped to the long arm of chromosome 19. A defect in this gene may be associated with increased cancer risk. This is the fourth gene involved in DNA metabolism that has been mapped to this region of chromosome 19. * Significant progress was accomplished in defining the organization of the cytochrome P450 genes mapping to chromosome 19. Multiple members of each of the three subfamilies were identified. The cosmids containing these genes will be useful resources for studies of the function and physiological importance of the genes. * Three levels of resolution of FISH have been developed and applied to localize and orient cosmids. Localizing cosmids to metaphase chromosomes provides a resolution of about 1 to 3 Mb. Localization to somatic interphase cells gives a resolution of from 50 kb to 1 Mb and hybridization to sperm pronuclei a 20-kb to 1-Mb resolution. With FISH, a linear relationship was demonstrated between physical distance and genomic distance of 20 kb up to at least 800 kb in pronuclei derived from human spermatozoa. With a single probe, the presence of multiple copies of the closely related genes of the CEA family has been detected in human sperm pronuclei. Single and multicolor hybridizations are routinely performed. * A reproducible method of mapping YACs by FISH has been developed. This procedure involves isolating YACs with pulsed-field gels, digesting with the restriction enzyme Mbo I, ligating to oligonucleotide linker adapters, and amplifying with PCR. The products are then mapped onto human metaphase chromosomes by standard FISH methods. * The technique of Alu-PCR has been further exploited. To isolate region-specific DNA probes from human-rodent hybrid cell lines, previously developed PCR procedures were expanded. Human sequences are preferentially amplified using PCR primers specific for repeats of the human Alu repeat family. Several new primers have been developed that amplify human DNA sequences very efficiently, further facilitating probe isolation from human genome regions present in the available hybrids. Many different human sequences amplify from the hybrids; individual probe sequences are obtained by subsequent cloning in plasmid vectors in Escherichia coli. To expedite this, ligation-independent cloning has been developed to increase efficiency of cloning and eliminate the common background of clones that do not contain recombinant DNA molecules. In addition, an efficient procedure has been developed to clone the PCR products common to two cell lines. This method _coincidence cloning_permits a further enrichment for sequences derived from defined regions of the genome. * Clone-pooling schemes have been developed to facilitate screening of both cosmid and YAC libraries. Each clone is present in a number of different pools, reducing the number of DNA samples that must be deposited on a high-density filter for hybridization-based screening and the number of tubes needed for PCR-based screening. Since each clone is defined by a unique combination of pools, the screening of pools by probe hybridization permits identification of the recombinants shared by a number of pools. This approach was used very successfully to screen a 10,000-clone cosmid library. The idea also was used to consolidate a 60,000-clone YAC library into about 1800 sample pools. Results demonstrated that hybridization-positive YAC pools can, indeed, be distinguished from hybridization-negative YAC pools, thus allowing the efficient identification of YAC clones. * Human YACs were isolated from a library constructed using a monochromosomal 19 hybrid cell line. The YACs vary in size between 120 and 350 kb. One of the analyzed YACs carries sequences from the telomere region of chromosome 19, and another maps to the centromere region of chromosome 19 by FISH. * A second-generation suite of robust, reliable computer programs was completed for signal preparation and analysis of chromosome 19 restriction fragment fingerprints. These programs implement methods for random noise suppression, background subtraction, and color decorrelation. A new program (TIMEWARP) was also completed to map peak locations in a gel to a common coordinate system by dynamic programming and shape-preserving spline interpolation. * The Sybase database has been enhanced to contain all the laboratory notebook and experimental data important to physical map construction. This includes clone repository information, restriction fragment fingerprinting, and data on probe hybridization and FISH. The database is coupled to the graphical browser so the end user can retrieve many of the experimental results in graphical form. * The graphical database browser was enhanced to run Human Genome Project data remotely over Internet. The browser's ability to link to multiple databases at external collaborator sites has been demonstrated. * In a collaborative effort, automatic transnetwork methods for transferring physical mapping results to the central Genome Data Base (GDB) at Johns Hopkins were built, tested, and implemented by GDB and LLNL. This work was in support of DOE concerns that all laboratories should effect mechanisms to ensure that data are made available to the appropriate public databases after a suitable time period. Prototype methods were implemented, tested, and publicly demonstrated for logically linking our database with the major sequence and mapping databases (GenBankr and GDB). Direct transnetwork queries that logically integrate these data sets are now feasible. * As part of NLGLP, high-speed flow sorting was used to purify individual human chromosomes for cloning. Large-insert phage and cosmid libraries have been made for chromosomes 9, 12, 18, 19, 21, 22, and Y. Several libraries have been distributed to users and evaluation sites. In addition, the high-speed sorter has been rebuilt with new fluidics to optimize sterility and with new electronics to increase the purity of the sorted material. * Construction of a new high-speed chromosome sorter was completed. This instrument has new digital acquisition electronics, a new fluidic system, and a more stable sample stream. The instrument analyzes chromosomes at the rate of up to 20,000/s and can reliably produce 250 to 1000 ng of sorted chromosome DNA equivalents per day. * Using scanning tunneling microscopy (STM), individual images of the bases adenine and thymine were obtained at atomic resolution, indicating that a scanning-probe microscopy technique can discriminate between purines and pyrimidines. * Several technologies have been transferred to industry. They include software for analysis and graphical display of physical map data, sequence information for the commercialization of Alu-PCR primers, and vectors for the construction of cosmid libraries. In addition, collaborative research programs with industry have continued in the areas of fluorescence-based restriction fragment analysis, development of pulsed-field gel systems, development and testing of automated and high-throughput plasmid/cosmid DNA extraction, and development and testing of a robot for high-density colony replication on filters. Future Plans The LLNL genome center's first priority is to complete, to the extent possible, an ordered clone map of chromosome 19; this physical map will likely be a composite linear array of cosmid, lambda, and YAC clones. It will be correlated with the genetic map to assist the scientific community in localizing and isolating all genes from chromosome 19. State-of-the-art technology will be used to sequence selected high-interest regions of the chromosome. Once the technology has been validated for map construction of a large portion of chromosome 19, efforts will be directed to chromosome 2. When Human Genome Project emphasis shifts from mapping to sequencing, exploration will turn to rapid automated DNA sequencing methods that can use large fragments such as cosmids or YACs as templates. STM and X-ray imaging technologies under development at LLNL are expected to contribute to advancements in sequencing. Automation is an essential element of physical mapping. New processes and instruments will be explored to reduce the need for human intervention in highly repetitive tasks. A number of instruments for clone manipulation and biochemical processes will be considered for automation. An effort to map and sequence the cDNAs expressed in a variety of human tissues has recently been initiated. These cDNAs will be used to generate STSs and will serve as the foundation for future studies of gene organization and gene function. Assisting the scientific community in completing ordered clone maps is critical and will remain a high priority. LLNL intends to serve as a resource laboratory for clones and for map information on chromosomes of interest. Ultimately, map and sequence information will be used to study the global architecture of the chromosome and also to evaluate human somatic and genetic variation, both spontaneous and induced. For more information on the LLNL Human Genome Center, contact Anthony Carrano, Director, at 510/422-5698 or Leilani Corell, Administrator, at 510/423-3841. Los Alamos National Laboratory The Center for Human Genome Studies at Los Alamos National Laboratory (LANL) provides direction, coordination, and technical oversight for the LANL portion of the DOE Human Genome Program. The center draws scientific talent from six technical divisions at LANL. Molecular biologists, chemists, physicists, mathematicians, computer scientists, and engineers are contributing to progress in physical mapping, technology development, and informatics. Although a specific goal is the assembly of a complete physical map for human chromosome 16, much of the work is broadly supportive of the worldwide Human Genome Project. Collaborative research and development programs have also been initiated with private-sector and other institutions involved in human genome research. The major technical subdivisions of the center are physical mapping, technology development, and informatics. Activities are also under way at the center to explore ethical, legal, and social issues arising from genome research data and to transfer technology developed within the center's projects. Physical Mapping Physical mapping includes the development of conceptual advances in mapping strategy and the construction of a physical map of chromosome 16. The physical map will be composed of phage, cosmid, and YAC contigs ordered by repetitive sequence fingerprinting. These ordered contigs will be integrated with the genetic linkage map, the cytogenetic map, and known gene sequences on chromosome 16. The final map, along with its eventual translation into a sequence tagged site (STS) map, will provide the means for rapid access to any region of the chromosome for further analysis. In addition, the ordered clone sets will be available for eventual sequencing. Technology Development Technology development efforts include the application of robotics to the handling and storage of DNA fragments, the development and application o f methods for the construction of DNA libraries from flow-sorted chromosomes, and the development of new methods for rapid, inexpensive, large-scale sequencing. All these projects are or will be supportive of the physical mapping of chromosome 16, and they also contribute to the larger genome program. For example, the construction and distribution of various kinds of libraries from sorted chromosomes is playing a significant role at many of the genome research centers. Informatics Informatics efforts involving the collection and analysis of genome-related data will play an increasingly important role in the genome project. LANL has a long history of expertise in this research area and will continue to lead in providing these essential resources. Ethical, Legal, and Social Issues (ELSI) Activities The center also sponsors active participation in ELSI studies related to data produced by human genome research and is compiling a comprehensive literature bibliography in collaboration with Georgetown University. LANL scientists participated in a series of discussions on ELSI issues sponsored by the University of California Humanities Research Institute. Technology Transfer LANL will continue to put a high priority on collaborations with private industry to use the skills and resources of the private sector and to ensure effective technology transfer to the U.S. commercial sector. The first Cooperative Research and Development Agreement (CRADA) involving human genome research activity was signed in 1991 by LANL and Life Technologies, Inc. (LTI). Recent Progress and Future Directions Construction of a physical map of chromosome 16. The chromosome-mapping strategy at LANL involves the rapid generation of cosmid contigs representing around 60% of the target chromosome, followed by directed gap closure with yeast artificial chromosomes (YACs). The first phase of this goal, the rapid generation of nucleation contigs on chromosome 16, has been completed [Stallings et al., Proc. Natl. Acad. Sci. USA 87: 6218-22 (1990)]. An approach for identifying overlapping cosmid clones by exploiting the high density of repetitive sequences in human DNA was used to generate 553 contigs following the fingerprinting of over 4500 individual cosmid clones. These contigs represent more than 80% of the euchromatic arms of chromosome 16 and were constructed with about one-fourth as many cosmid fingerprints as random strategies requiring 50% minimum overlap detection. Nucleating at specific regions allows (a) the rapid generation of large (>100 kb ) contigs in the early stages of contig mapping and (b) the production of a contig map with useful landmarks [i.e., (GT)n repeats] for rapid integration of the genetic and physical maps. All 4500 fingerprinted cosmids in contigs and singlets have been rearrayed on high-density filters. Such filters already provide investigators with access to more than 90% of chromosome 16, with a 60% probability that any region is already present in a contig. These high-density chromosome-specific cosmid filter arrays have also proved useful for YAC fingerprinting with repetitive sequence polymerase chain reaction (PCR) techniques. In collaboration with the laboratories of David Ward (Yale University) and David Callen (Adelaide Children's Hospital, Australia), 130 of these arrayed cosmids have been regionally localized via in situ hybridization or somatic cell hybrid panels. The average gap (containing only singlets), approximately 65 kb in length, can be easily closed with YACs. A single walk from each end of current contigs should, statistically, reduce the number of contigs to approximately 50, one of the 5-year goals of the Human Genome Project (i.e., 1- to 2-Mb contigs; >95% coverage). To facilitate closure, LANL investigators are constructing from monochromosomal hybrids and flow-sorted material both a total genomic YAC library (from cell line GM130, using the vectors pJS97 and pJS98; currently onefold representation) and chromosome 16 YAC clones. One hundred STS markers are being generated to key contigs. Extensive analyses of the DNA sequences obtained from contig ends are in progress using multiple approaches to identify potential coding regions. These approaches include nucleotide and translated amino acid sequence homology searches against GenBank, using BLAST and FASTA, and the new adaptive network program, GRAIL, developed and made available by the Oak Ridge National Laboratory. Current progress with YAC closure indicates that the complete physical map of chromosome 16 will be achieved in the next few years. Low-abundance repetitive DNA sequences identified on chromosome 16. Chromosome 16-specific, low-abundance repetitive DNA sequences (designated CH16LARs) have been identified during construction of the cosmid contig map of this chromosome. CH16LARs were initially identified by in situ hybridization of cosmid and YAC clones to normal human chromosomes (in collaboration with David Ward). The cosmid clones all came from contig 55. The hybridization signals were unusually intense and occurred on four regions of human chromosome 16: bands p13, p12, p11, and q22. Contig 55 contains more clones than any other contig (78 clones or 2% of all clones fingerprinted thus far). Ordering clones within contig 55 is not possible because the presence of these low-abundance repetitive DNA sequences has generated false overlaps. The regions containing CH16LARs may cover as much as 5% of the euchromatic arms of chromosome 16 (~5 Mb of DNA). One CH16LAR sequence (CH16LAR1) was cloned and sequenced, and a minisatellite type of repetitive sequence was identified. The region containing CH16LARs is of biological interest since the pericentric inversion breakpoints commonly found in myelomonocytic leukemia fall within these regions [Mitelman, Hereditas 104: 113 (1986)]. Alternative strategies for mapping and ordering clones from this region are being implemented. Construction and distribution of DNA libraries from flow-sorted chromosomes: National Laboratory Gene Library Project (NLGLP). NLGLP is a cooperative project between LANL and Lawrence Livermore National Laboratory. Investigators at LANL have cloned a set of complete digest libraries into the EcoR I insertion site of Charon 21A; they are available from the American Type Culture Collection, Rockville, Maryland. Sets of partial digest libraries in the cosmid vector sCos1 and in the phage vector Charon 40 are being constructed for human chromosomes 4, 5, 6, 8, 10, 11, 13, 14, 15, 16, 17, 20, and X. Individual human chromosomes are first sorted from rodent-human hybrid cell lines until about 1 æg of DNA has been accumulated. The sorted chromosomes are then examined for purity by in situ hybridization, and the DNA is extracted and partially digested with the restriction enzyme Sau 3AI, dephosphorylated, and cloned into vectors. Partial digest libraries have been constructed for chromosomes 4, 5, 6, 8, 11, 13, 16, 17, and X. Purity estimates from sorted chromosomes, flow-karyotype analysis, and plaque or colony hybridization indicate that most of these libraries are 90 to 95% pure. Additional cosmid library constructions and arrays of libraries having five- to tenfold genomic coverage into microtiter plates are in progress. Libraries have been constructed in M13 or bluescript vectors to generate STS markers for selecting chromosome-specific inserts from a genomic YAC library. LANL has also cloned sorted DNA into YAC vectors and expects to construct a series of YAC libraries representing individual chromosomes (see below). A YAC library for human chromosome 21. YACs have been constructed using DNA isolated from aliquots of flow-sorted human chromosome 21. Chromosomes were prepared from the somatic cell hybrid WAV-17, which contains chromosome 21 as the only human chromosome. DNA isolated from sorted chromosomes was restricted with either Cla I or Eag I or both Not I and Nhe I, ligated to YAC vectors pJS97 and pHS98, and transformed into Saccharomyces cerevisiae strain YPH 250. The transformation efficiency of YACs ranged from 600 to 2500 cfu/æg of sorted DNA. About 1200 human YACs with an average size of 200 kb have been identified. The locations of 20 random YACs on chromosome 21 were confirmed by hybridization to somatic cell hybrid mapping panels. Three YACs that hybridize to D21S55 have been identified and are being used to initiate construction of a physical map of the Down's syndrome region of chromosome 21. Sixty YAC clones from the chromosome 21 library were localized on chromosome 21 by in situ hybridization. The results indicate that the library contains inserts that are well distributed along the length of the chromosome and that the frequency of chimeric inserts is low (below 3%). A collaboration between the genome centers at LANL and Lawrence Berkeley Laboratory (LBL) will use the library for comprehensive physical mapping of chromosome 21 . The ability to construct chromosome-specific YAC libraries from sorted chromosomes will facilitate isolation of disease genes and construction of long-range physical maps of complex genomes. LBL is working on chromosome 21 in cooperation with LANL. Chromosome-specific STS libraries. Specific STSs have been systematically generated using flow-sorted chromosomes. DNA from about 200,000 chromosomes was digested with either one or two restriction enzymes (usually BamH I and Hind III) and cloned directly into bacteriophage M13mp18. One-pass sequencing was conducted, either manually or with a Dupont Genesis 2000 automated sequencer. DNA sequences were analyzed for the presence of sequence similarity to common human repetitive sequences, and appropriate PCR oligomers were synthesized. An acceptable STS-PCR assay yielded the appropriately sized product from both the hybrid cell line DNA containing only the human chromosome of interest and the pools of 384 anonymous YAC clones, spiked with 5 ng/ml total human DNA. To date, over 340 kb of anonymous DNA sequence from human chromosomes 5 and 7 have been analyzed. Two hundred STS markers for chromosome 7 have been generated in collaboration with Maynard Olson's laboratory at Washington University [Green et al., Genomics (in press)], and the first 100 STS markers for chromosome 5 are currently being generated in collaboration with John Wasmuth's laboratory at the University of California, Irvine; 50 STSs for chromosome 5 have been regionally localized. The overall efficiency of PCR reactions yielding appropriate products, with the anonymous genomic sequences from flow-sorted chromosomes, has been approximately 75%. GRAIL analyses indicate that approximately 15% of both the chromosome 16 STSs and the randomly selected STSs for chromosomes 5 and 7 contain putative coding regions. Informatics. The Laboratory Notebook database, designed to manage all information necessary for map assembly, has been expanded to include sequences, STS mapping information, and grid hybridization data, as well as clone fingerprints and completed maps. The forms-based interface is being expanded to provide easy access to the new tables. Graphical interfaces and innovative algorithms to aid map assembly have been prototyped and are being refined. Integrated, multilevel maps are increasing in importance. A strong emphasis for the coming year will be to implement the Software for Integrated Genome Map Assembly (SIGMA) system, which was designed to aid in display, assembly, evaluation, and editing of integrated maps. DNA sequencing based upon single-molecule detection in flow cytometry. This project addresses the problem of rapidly sequencing bases in large fragments of DNA. A DNA fragment of about 40 kb will be labeled with base-identifying tags and suspended in the flow stream of a flow cytometer capable of single-molecule detection. The tagged bases will be sequentially cleaved from the single fragment and identified as the liberated tag passes through the laser beam. A sequencing rate of 100 to 1000 bases/s on DNA strands of around 40 kb is projected [Genet. Anal. 8: 1 (1991)]. Accomplishments of this project are as follows: * Signed CRADA with LTI for joint research on DNA sequencing. LTI will offer expertise in nucleic acid chemistry and enzymology, and LANL will specialize in detection technology and DNA handling. LTI will commercialize the technique [for more information, refer to the figure on p. 25 and to Human Genome News, 3(1): 5 (May 1991)]. * Detected several different kinds of single fluorescing molecules with ~85% efficiency and low error rates [Chem. Phys. Lett. 174: 553 (1990)]. * Observed photon bursts simultaneously from rhodamine-6G and Texas Red, using both a doubled Nd/YAG and a synchronously pumped dye laser for excitation and dual-wavelength detection. * Synthesized DNA fragments up to 500 nucleotides long that contain one fluorescent nucleotide and three normal nucleotides. DNA synthesis was observed with rhodamine-dCTP, rhodamine-dATP, rhodamine-dUTP, fluorescein-dATP, and fluorescein-dUTP. This work was a collaboration with LTI. * Digested the fluoresceinated DNAs described above by six different exonucleases: native T4 polymerase, native T7 polymerase, Klenow fragment of Escherichia coli pol I, exo III, E. coli pol III holoenzyme, and snake venom phosphodiesterase. LTI also participated in these investigations. Robotic workcell for DNA filter array construction. A gantry robot-based workcell has been assembled to array small spots of DNA in an interleaved format. Grid densities on these membrane filters can be varied from 576 to 9216 spots per 22 cm2. The robot picks a microtiter plate from a dispenser, scans a barcode label, removes the plate cover, and inserts a 96-pin gridding tool into the plate wells. The tool is then positioned at the appropriate place on the membrane, and the solutions on the pins are transferred as spots. The gridding tool is washed and sterilized, the lid replaced on the microtiter plate, and the plate placed into a receiving stacker. The entire sequence is repeated with new plates until the desired array has been constructed. For more information on the LANL Center for Human Genome Studies, contact Robert K. Moyzis, Director, or Larry Deaven, Deputy Director, at 505/667-3912. Program Management Infrastructure DOE OHER Mission Genetics and radiation biology have been a long-term concern of the DOE Office of Health and Environmental Research (OHER) and DOE forerunners_the Atomic Energy Commission (AEC) and the Energy Research and Development Administration (ERDA). In the United States, the first federal support for genetics research was through AEC. In the early days of nuclear energy development, the focus was on radiation effects and later broadened under ERDA and DOE to include the health implications of all energy technologies and their by-products (see "Enabling Legislation" in box below). Today, an extensive program of OHER-sponsored research on genomic structure, maintenance, damage, and repair continues at the national laboratories and universities. Some major components of OHER genetics research are (1) molecular cloning and characterization of DNA repair genes, (2) improvement of methodologies and resources for quantitating and characterizing mutations, and (3) the focused resource and technology development needed to map and sequence the human genome_the Human Genome Program. Enabling Legislation The Atomic Energy Act of 1946 (P.L. 79-585) provided the initial charter for a comprehensive program of research and development related to the utilization of fissionable and radioactive materials for medical, biological, and health purposes. The Atomic Energy Act of 1954 (P.L. 83-703) further authorized AEC "to conduct research on the biologic effects of ionizing radiation." The Energy Reorganization Act of 1974 (P.L. 93-438) provided that responsibilities of ERDA shall include "engaging in and supporting environmental, biomedical, physical and safety research related to the development of energy resources and utilization technologies." The Federal Nonnuclear Energy Research and Development Act of 1974 (P.L. 93-577) authorized ERDA to conduct a comprehensive nonnuclear energy research, development, and demonstration program to include the environmental and social consequences of the various technologies. The DOE Organization Act of 1977 (P.L. 95-91) instructed the department "to assure incorporation of national environmental protection goals in the formulation and implementation of energy programs; and to advance the goal of restoring, protecting, and enhancing environmental quality, and assuring public health and safety," and to conduct "a comprehensive program of research and development on the environmental effects of energy technology and programs." Human exposure to environmental factors and the body's response to such factors are a major concern. Unavoidable genome-damaging agents in the environment include natural radiation sources, such as the components of sunlight, cosmic rays from space, and radon from the earth. Both inorganic and organic chemicals, some natural to the environment and others generated by human commerce and energy-related processes, put people at risk. Normal biological functions also contribute to the risk of genetic damage when the body's own cells produce potentially damaging molecules in the course of metabolic processes such as defensive actions against microbes, detoxification of harmful environmental substances, and cell proliferation. Even DNA is not completely stable chemically; its normal methylcytosine constituent has a low but measurable rate of spontaneous mutagenic change. Systems that reverse many types of DNA damage have evolved to include a wide range of repair mechanisms within cells of all species. Humans show great diversity in this capacity, with repair-gene deficiencies showing up as sensitivity to DNA damage from low-level radiation and in diseases such as cancer. Some human genes that contribute to DNA repair processes have been characterized, and others await detection and molecular cloning. A goal of the OHER program is to improve the capabilities for diagnosing individual susceptibility to genome damage. The genome program is providing fundamental information about the linear structure of chromosomes and genes, but understanding gene function requires other types of knowledge. Elucidating the three-dimensional (3-D) structure of proteins is crucial in explicating their functions. To advance these studies, several unique facilities for 3-D microstructure research, developed and maintained at DOE laboratories (see box on DOE facilities), are increasingly in demand by molecular biologists. To carry out its national research and development obligations, OHER conducts the following activities: * Sponsors research and development projects at universities, in the private sector, and at DOE national laboratories; * Uses the unique capabilities of multidisciplinary DOE national laboratories for the nation's benefit; * With advice from the scientific community and other sectors of government, considers novel, beneficial initiatives; and * Provides expertise on various governmental working groups. David J. Galas has directed OHER, an office of the DOE Office of Energy Research, since April 1990. He also serves under the White House Office of Science and Technology Policy as Cochair of the Committee on Life Sciences and Health and as Chairman of its Subcommittee on Biotechnology Research. John C. Wooley became OHER Deputy Associate Director in June 1992. The Human Genome Program, conceived as an Initiative within OHER, is administered primarily through the Health Effects and Life Science Research Division, directed by David A. Smith. The Medical Applications and Biophysical Research Division, directed by Robert W. Wood, monitors the instrumentation sector of the Human Genome Program and, more broadly, sponsors research and development of resources and instrumentation having biomedical and biotechnological applications. Major DOE Facilities and Resources Relevant to Molecular Biology Research Center for X-Ray Optics LBL GenBankr Data Sequence Repository LANL High Flux Beam Reactor BNL Los Alamos Neutron Scattering Center LANL National Flow Cytometry Resource LANL National Laboratory Gene Library Project LANL, LLNL Protein Structure Data Bank BNL National Synchrotron Light Source BNL Scanning Transmission Electron Microscope Resource BNL Stanford Synchrotron Radiation Laboratory Stanford GRAIL, Online Sequence Interpretation Service ORNL Program Management Task Group The Human Genome Program Management Task Group (see box for list of members) reports to the OHER Director and works to coordinate the following within OHER: * peer review of research proposals, using both prospective and retrospective evaluations and * administration of awards, collaboration with all concerned agencies and organizations, organization of periodic workshops, and responses to the needs of the developing program. DOE Human Genome Program Management Task Group in 1992 David A. Smith, Chair Molecular biologist Ann M. Barber Computational biologist Benjamin J. Barnhart Geneticist Daniel W. Drell Biologist Gerald Goldstein Physical scientist Murray Schulman Radiation biologist Jay Snoddy* Molecular biologist Marvin Stodolsky Molecular biologist John C. Wooley Biophysicist *On detail from Argonne National Laboratory. Field Coordination Human Genome Coordinating Committee (HGCC) Another component of the OHER management structure, HGCC was formed in October 1988 to represent DOE genome program researchers along with observers from other government and private agencies (see box for list of HGCC members). Members of the Human Genome Program Management Task Group are ex-officio members of HGCC, and they participate in the regularly scheduled HGCC meetings. HGCC responsibilities include the following: * assisting OHER with overall coordination of DOE-funded genome research; * facilitating the development and dissemination of novel genome technologies; * ensuring proper management and sharing of data and samples; * participating with other national and international efforts; and * recommending establishment of ad hoc task groups to analyze specific areas, such as ethical, legal, and social issues; informatics requirements; mapping and sequencing technologies; use of the mouse as a model organism; cost of resource distribution; and use of chromosome flow-sorting facilities. Human Genome Coordinating Committee Members in 1992 Elbert W. Branscomb, Computational Biologist, Human Genome Center, Lawrence Livermore National Laboratory Charles R. Cantor, Principal Scientist, DOE Human Genome Program, Lawrence Berkeley Laboratory Anthony V. Carrano, Director, Human Genome Center and Leader, Biomedical Sciences Division, Lawrence Livermore National Laboratory C. Thomas Caskey, Director, Institute for Molecular Genetics, Baylor College of Medicine David J. Galas, Office of Health and Environmental Research, DOE Raymond F. Gesteland, Professor and Cochair, Department of Human Genetics, University of Utah; Investigator, Howard Hughes Medical Institute Laboratory for Genetic Studies at the Eccles Institute, University of Utah Leroy E. Hood, Director, Center for Integrated Protein and Nucleic Acid Chemistry and Biological Computation; Director, Cancer Center, California Institute of Technology Robert K. Moyzis, Director, Center for Human Genome Studies, Los Alamos National Laboratory Jasper Rine, Director, Human Genome Center, Lawrence Berkeley Laboratory Robert J. Robbins, Director, Welch Medical Library for Applied Research in Academic Information, Johns Hopkins University David A. Smith, Office of Health and Environmental Research, DOE Lloyd M. Smith, Assistant Professor, Analytical Division, Department of Chemistry, University of Wisconsin, Madison John C. Wooley, Office of Health and Environmental Research, DOE ______________ HGCC Executive Officer: Sylvia J. Spengler, Deputy Director Human Genome Center, Lawrence Berkeley Laboratory A Principal Scientist is a member of HGCC, reports to the Human Genome Program Task Group regarding the responsibility of keeping the program at the leading edge of genome research, and conveys recommendations on broad scientific policies to HGCC. Currently serving as a Principal Scientist is Charles R. Cantor, Lawrence Berkeley Laboratory. Human Genome Management Information System (HGMIS) As an aid to the DOE Human Genome Program Task Group, communication and information services are provided by HGMIS at Oak Ridge National Laboratory. In this role HGMIS facilitates international communication among management and research personnel and informs other interested persons about genome research. HGMIS publications, such as the bimonthly newsletter Human Genome News and technical and program reports, are available to anyone interested in the genome project. Human Genome News is jointly supported by OHER and the NIH National Center for Human Genome Research (NCHGR). Subscribers to the newsletter number over 13,000 and include genome and basic researchers at national laboratories, universities, and other research institutions; professors and teachers; industry representatives; legal personnel; ethicists; students; genetic counselors; physicians; the press; and other interested individuals. In the first quarter of 1992, over 5000 Genome Data Base users were added to the mailing list. Subscribers outside the United States include more than 3000 individuals and institutions in 48 countries. Human Genome Distinguished Postdoctoral Fellowships In 1990 OHER established the Human Genome Distinguished Postdoctoral Research Program to support research on projects related to the DOE Human Genome Program. The postdoctoral program developed from a 1988 recommendation of the DOE Energy Research Advisory Board to "increase support through expansion of the targeted (science and engineering) graduate and postgraduate research fellowship programs with emphasis given to energy-related areas of greatest projected human resource shortages." Recipients of the first fellowships, awarded in FY 1991, are listed below. 1991 DOE Human Genome Distinguished Postdoctoral Fellows* Xiaohua Huang (Stanford University, Biophysical Chemistry) Host: University of California, Berkeley Ben Koop (Wayne State University, Molecular Biology and Genetics) Host: California Institute of Technology Carol Soderlund (New Mexico State University, Computer Science) Host: Los Alamos National Laboratory Harold Swerdlow (University of Utah, Bioengineering) Host: University of Utah *Contact: Linda Holmes: 615/576-3192, Fax: 615/576-0202. Fellowship appointments are tenable at DOE and university laboratories having substantial DOE-sponsored research projects supportive of the Human Genome Program. Fellows will participate in advanced genetics-related research, interact with outstanding professionals, and become familiar with major issues while making personal contributions to the program's goal of mapping and sequencing the human genome. This interaction, involving the exchange of ideas, skills, and technologies, will benefit the fellow, the host laboratory, and the DOE program. These fellowships complement the Alexander Hollaender Distinguished Postdoctoral Fellowships initiated by OHER. The Hollaender Fellowships, established in memory of the 1983 recipient of the prestigious DOE Enrico Fermi Award, provide support in all areas of OHER-sponsored research. Both postdoctoral programs are administered by Oak Ridge Associated Universities, which is a university consortium and DOE contractor. Resource Allocation Reports by the Health and Environmental Research Advisory Committee (HERAC) and the National Research Council (NRC) recommended that national funding for the Human Genome Project increase to a sustaining yearly level of $200 million. DOE program expenditures were $5.5 million in FY 1987, $10.7 million in FY 1988, $17.5 million in FY 1989, $25.9 million in FY 1990, $46 million in FY 1991, and $59 million in FY 1992. The proposed presidential budget for the DOE Human Genome Program in FY 1993 is $64.7 million (graph). DOE-sponsored research is conducted in a variety of institutions (upper table). The lower table categorizes research expenditures for FY 1992. Types of Institutions Conducting DOE-Sponsored Genome Research 8 National laboratories 3 Other federal organizations 41 Academic institutions 10 Private-sector institutions 12 Nonacademic, commercial organizations Human Genome Program Funds Distribution in FY 1992 (in $K) (Commitments as of May 1, 1992) ---------------------------------------------------------------------------- | Organization Mapping Instrumenta Informa ELSI Totals Percent | | Type & tion tics of | | Sequencing Development 568001 | |--------------------------------------------------------------------------| | DOE Labs 23671 7559 5122 236 36588 64.4 | | | | Academic 5462 3341 4528 736 14067 24.8 | | | | Institutions 2173 0 602 847 3622 6.4 | | (nonprofit) | | | | NIH Labs 680 0 0 0 680 1.2 | | | | Companies 1550 0 314 392 2256 3.9 | | and SBIR2 | | | | All 33536 10900 10566 2211 57213 | | Organizations | | | | [Percent [59.0] [19.2] [18.6] [3.9] [100.7]^3 | | of 56800] | ---------------------------------------------------------------------------- 1 Total allocation of $59 million less capital equipment funds of $2.2 million. 2 Small Business Innovation Research grants. 3 Excess occurs because funding for genome SBIR projects is received from the DOE-wide SBIR program, to which OHER contributes. Interagency Coordination Joint DOE-NIH Activities The NIH Human Genome Program, led by NIH NCHGR, has emphasized the study of disease genes in the construction of complete genetic and physical maps of the genomes of humans and selected model organisms. NIH is also developing new technologies and information systems to manage mapping and sequencing data. In the fall of 1988 DOE and NIH began coordinating their human genome research programs under the Memorandum of Understanding, an outgrowth of the HERAC and NRC reports, "to foster interagency cooperation that will enhance the human genome research capabilities of both agencies." More information on NCHGR-sponsored projects and infrastructure may be obtained by contacting the NCHGR Office of Communications at 301/402-0911. Joint DOE-NIH Subcommittee on the Human Genome in 1992 Cochairs: Paul Berg (PACHG) Stanford University School of Medicine Sheldon Wolff (HERAC) University of California, San Francisco Charles R. Cantor Lawrence Berkeley Laboratory (HGCC) Anthony V. Carrano Lawrence Livermore National Laboratory (HGCC) Joseph L. Goldstein University of Texas Southwestern Medical Center Leroy E. Hood California Institute of Technology Leonard S. Lerman Massachusetts Institute of Technology (HERAC) Victor A. McKusick Johns Hopkins Hospital Robert K. Moyzis Los Alamos National Laboratory (HGCC) Maynard V. Olson Washington University School of Medicine (PACHG) MaryLou Pardue Massachusetts Institute of Technology (HERAC) Mark L. Pearson E. I. du Pont de Nemours & Company (PACHG) Diane C. Smith Xerox Corporation (PACHG) Robert T. Tjian University of California, Berkeley Nancy S. Wexler Columbia University (PACHG) John C. Wooley Office of Health and Environmental Research, DOE Ex Officio Members: David J. Galas Office of Health and Environmental Research, DOE Mark S. Guyer National Center for Human Genome Research, NIH Elke Jordan National Center for Human Genome Research, NIH David A. Smith Office of Health and Environmental Research, DOE Michael Gottesman National Center for Human Genome Research, NIH A national plan, primarily authored by NIH and DOE, for a coordinated multiyear research project was presented to Congress in early 1990. Understanding Our Genetic Inheritance, The U.S. Human Genome Project: The First Five Years (1991-1995) detailed a comprehensive spending plan and optimal strategies for mapping and sequencing the human genome. Referred to as the Five Year Plan, it calls for open biannual meetings of the DOE-NIH Joint Subcommittee on the Human Genome. The joint subcommittee invites reports from experts, including those on national and international genome efforts; medical genetics; and related ethical, legal, and social issues as they pertain to data produced in the project. The subcommittee is made up of members from the NIH Program Advisory Committee on the Human Genome (PACHG) and from the DOE HERAC or the HGCC members appointed by HERAC. The subcommittee reports to its parent committees_PACHG and HERAC. Many workshops and meetings have since been cosponsored by the two agencies (see Appendix B). In addition, the Joint Subcommittee on the Human Genome has established five joint working groups that meet regularly to address specific areas of genome research and make recommendations to the joint subcommittee. The objectives of these five joint working groups, listed below, include establishing research priorities; identifying research, training, and technical needs; and coordinating U.S. research activities with those of other countries. Members of the working groups represent various disciplines. (Membership lists of the working groups are included in Appendix D.) Joint Mapping Working Group. The mapping working group encourages development and use of methodologies to integrate genetic linkage and physical maps, meet project mapping goals, and identify informatics needs associated with map generation and completion. Joint Informatics Task Force (JITF). An ad hoc committee, JITF prepared a comprehensive report on genome information needs and data analysis tools. The report was presented to the DOE-NIH Joint Subcommittee on the Human Genome in January 1992. Joint Sequencing Working Group. The sequencing working group investigates and makes recommendations on research and technology development priorities to enable the sequencing of 3 billion nucleotides of human DNA within 15 years. Joint Working Group on Ethical, Legal, and Social Issues (ELSI). ELSI identifies and addresses the social concerns that may arise as genome technology is developed and genetic data becomes available; stimulates bioethics research; promotes education of professional and lay groups; and collaborates with international groups such as the Human Genome Organization (HUGO), United Nations Educational, Scientific, and Cultural Organization (UNESCO), and the European Community (see next section). Joint Working Group on the Mouse. The mouse working group was established to develop a strategy for efficiently using the mouse to accomplish mapping project goals as outlined in the Five Year Plan. This strategy will take advantage of the extensive genetic map data amassed on the mouse. Because of numerous similarities between mouse and human genomes, these studies are considered essential to understanding human biology and to interpreting more complex data obtained in studies of humans. Other U.S. Genome Research U.S. Department of Agriculture (USDA). USDA has implemented a Plant Genome Research Program to foster and coordinate research on single and multigenic traits related to agricultural, forestry, and environmental concerns. The goal of this 5-year program is to improve plant varieties by locating important genes and markers on chromosomes, determining gene structure, and transferring genes to improve the performance of economically important crops such as corn, wheat, soybeans, and pine. Use of these "molecular breeding" techniques will increase U.S. competitiveness in the world marketplace. National Science Foundation (NSF). NSF coordinates an interagency research effort to map and sequence the small genome of Arabidopsis thaliana, a simple weed that provides an ideal model for studying plant biochemistry, genetics, and physiology. Knowledge of the function of every Arabidopsis gene will be applicable to the understanding and manipulation of higher plants and to genome research in general. These studies are also supported by DOE, NIH, and USDA as part of their own genome initiatives, and the four agencies coordinate their Arabidopsis activities. NSF also has instrumentation, computational, and informatics programs that support genomics research, in addition to individual awards in genetics and molecular biology. Howard Hughes Medical Institute (HHMI). HHMI, a private medical research organization, contributes to the genome effort through its support of biomedical research primarily at university molecular biology and genetics laboratories. In addition, HHMI has cosponsored several genomics conferences and, between 1985 and September 1991, supported the collection and dissemination of genome mapping data through a network of databases. International Coordination Genomic research is being carried out in countries throughout the world. The two international organizations described on the next two pages are working to coordinate and facilitate national efforts. HUGO includes a number of DOE and NIH genome investigators and administrators. HUGO and UNESCO have been informed of dedicated genome programs in the following nations and international agencies: Commonwealth of Independent States (formerly U.S.S.R.), Denmark, European Community, France, Germany, Hungary, Italy, Japan, Netherlands, United Kingdom, and United States. HUGO: Worldwide Genome Research Coordination HUGO, formed by scientists to coordinate worldwide genome mapping and sequencing, now has regional offices in the United States (Bethesda, Maryland) and Europe (London) and a satellite office in Moscow. A Pacific office is under development in Osaka, Japan. HUGO offices were funded initially by several charitable organizations. In 1990 HHMI awarded HUGO a 4-year, $1 million grant to support the HUGO Americas office; in that same year The Wellcome Trust provided a 3-year grant, with the first year's funds amounting to over $400,000, to assist with activities in the European office. The Imperial Cancer Research Fund (U.K.) provides support for the HUGO president's office, and the Osaka office has received private support as well. To support future activities, HUGO directors intend to raise funds from various countries that have active genome research programs. HUGO members are elected; there are over 400 members from 32 countries. The international officers in 1992: Sir Walter Bodmer (United Kingdom), President; Charles R. Cantor (United States), Vice-President; Andrei Mirzabekov (Russia), Vice-President; Kenichi Matsubara (Japan), Vice-President; Bronwen Loder (United Kingdom), Secretary; and Robert Sparkes (United States), Treasurer. Each office operates with its own trustees. The objectives of HUGO include * fostering collaboration to avoid unnecessary competition or duplication of effort and to coordinate human genome research with model organism studies; * coordinating exchanges of relevant data and materials; * educating researchers and the public on the scientific, ethical, social, legal, and commercial implications of the research; and * acting as a clearinghouse for genome-related information, such as relevant conferences, worldwide genome programs and researchers, and database and material availability. A training program may be initiated to encourage the spread of new and promising technologies. HUGO has established expert international ad hoc advisory committees on mapping workshops and databases, informatics, ethics, mouse mapping, and intellectual property and ownership. Single-chromosome workshops are crucial to the success of the Human Genome Project. Working with the funding agencies, HUGO is playing a central role in the coordinated development of such meetings and has assisted in planning workshops for chromosomes 2, 3, 13, 16, 19, and X in 1992. HUGO expects to work with the scientific community to select workshop chairs and to assist in fundraising and organizing and running these and future meetings. Chromosome workshops and other meetings are listed in Appendix B. UNESCO: Promoting the Interests of Developing Countries A UNESCO Human Genome Program was approved for 1990-91 at the 25th session of the UNESCO General Conference. Attendees concluded that full knowledge of the human genome is vitally important and that UNESCO could be influential in stimulating governments and agencies to support coordinated programs. UNESCO expects to play a key role in promoting the interests of developing countries. The Scientific Coordinating Committee (SCC), composed of 13 scientists, plans and implements the program, which was budgeted at $350,000 for the first year; SCC members include representatives selected from geographic regions and from international genome organizations such as HUGO. Members of SCC and of the UNESCO Secretariat agreed that UNESCO will concentrate its activities on access to and use of data obtained from human genome mapping and sequencing research, as well as on related ethical and social issues. UNESCO emphasizes the use of training programs as one of the best means of obtaining cooperation and diminishing the gap between developed and developing countries. The Third World Academy of Sciences (TWAS) joined UNESCO in sponsoring a training program that provided 19 fellowships in 1991 to awardees from Algeria, Argentina, Cameroon, Chile, China, Costa Rica, Cyprus, Czechoslovakia, Egypt, Guinea, India, Indonesia, Myanmar, the Republic of Korea, Peru, Spain, Ukraine, Russia, and Yugoslavia. The 1- to 3-month fellowships enable scientists from developing countries to carry out research in well-established scientific centers and to learn new research techniques. UNESCO and TWAS are also jointly compiling a directory to identify third-world genome researchers and their needs. To avoid overlap with other genome projects, UNESCO focuses on communication among countries about major trends and regional efforts, one of which, the Latin American Human Genome Program, was established during a UNESCO-supported symposium in Chile in 1990. The first annual UNESCO South-North Human Genome Conference was held in 1992 in Caxambu, Brazil, to increase interaction between scientists from developed countries and those of the third world. The second conference is planned for Thailand in 1993, and the third will probably take place in China in 1994. Appendices Appendix A: Primer on Molecular Genetics Appendix B: Conferences, Meetings, and Workshops Sponsored by DOE Appendix C: Members of the DOE Health and Environmental Research Advisory Committee Appendix D: Members of DOE-NIH Joint Working Groups Appendix E: Glossary Contents Appendix A: Primer on Molecular Genetics Revised and expanded by Denise Casey (HGMIS) from the primer contributed by Charles Cantor and Sylvia Spengler (Lawrence Berkeley Laboratory) and published in the Human Genome 1989_90 Program Report. Introduction DNA Genes Chromosomes Mapping and Sequencing the Human Genome Mapping Strategies Genetic Linkage Maps Physical Maps Low-Resolution Physical Mapping Chromosomal map cDNA map High-Resolution Physical Mapping Macrorestriction maps: Top-down mapping Contig maps: Bottom-up mapping Sequencing Technologies Current Sequencing Technologies Sequencing Technologies Under Development Partial Sequencing to Facilitate Mapping, Gene Identification End Games: Completing Maps and Sequences; Finding Specific Genes Model Organism Research Informatics: Data Collection and Interpretation Collecting and Storing Data Interpreting Data Mapping Databases Sequence Databases Nucleic Acids (DNA and RNA) Proteins Impact of the Human Genome Project Introduction The complete set of instructions for making an organism is called its genome. It contains the master blueprint for all cellular structures and activities for the lifetime of the cell or organism. Found in every nucleus of a person's many trillions of cells, the human genome consists of tightly coiled threads of deoxyribonucleic acid (DNA) and associated protein molecules, organized into structures called chromosomes (Fig. 1). If unwound and tied together, the strands of DNA would stretch more than 5 feet but would be only 50 trillionths of an inch wide. For each organism, the components of these slender threads encode all the information necessary for building and maintaining life, from simple bacteria to remarkably complex human beings. Understanding how DNA performs this function requires some knowledge of its structure and organization. DNA In humans, as in other higher organisms, a DNA molecule consists of two strands that wrap around each other to resemble a twisted ladder whose sides, made of sugar and phosphate molecules, are connected by "rungs" of nitrogen-containing chemicals called bases. Each strand is a linear arrangement of repeating similar units called nucleotides, which are each composed of one sugar, one phosphate, and a nitrogenous base (Fig. 2). Four different bases are present in DNA_adenine (A), thymine (T), cytosine (C), and guanine (G). The particular order of the bases arranged along the sugar-phosphate backbone is called the DNA sequence; the sequence specifies the exact genetic instructions required to create a particular organism with its own unique traits. The two DNA strands are held together by weak bonds between the bases on each strand, forming base pairs (bp). Genome size is usually stated as the total number of base pairs; the human genome contains roughly 3_billion bp (Fig. 3). Each time a cell divides into two daughter cells, its full genome is duplicated; for humans and other complex organisms, this duplication occurs in the nucleus. During cell division the DNA molecule unwinds and the weak bonds between the base pairs break, allowing the strands to separate. Each strand directs the synthesis of a complementary new strand, with free nucleotides matching up with their complementary bases on each of the separated strands. Strict base-pairing rules are adhered to_adenine will pair only with thymine (an A-T pair) and cytosine with guanine (a C-G pair). Each daughter cell receives one old and one new DNA strand (Figs. 1 and 4). The cell's adherence to these base-pairing rules ensures that the new strand is an exact copy of the old one. This minimizes the incidence of errors (mutations) that may greatly affect the resulting organism or its offspring. Genes Each DNA molecule contains many genes_the basic physical and functional units of heredity. A gene is a specific sequence of nucleotide bases, whose sequences carry the information required for constructing proteins, which provide the structural components of cells and tissues as well as enzymes for essential biochemical reactions. The human genome is estimated to comprise at least 100,000 genes. Human genes vary widely in length, often extending over thousands of bases, but only about 10% of the genome is known to include the protein-coding sequences (exons) of genes. Interspersed within many genes are intron sequences, which have no coding function. The balance of the genome is thought to consist of other noncoding regions (such as control sequences and intergenic regions), whose functions are obscure. All living organisms are composed largely of proteins; humans can synthesize at least 100,000 different kinds. Proteins are large, complex molecules made up of long chains of subunits called amino acids. Twenty different kinds of amino acids are usually found in proteins. Within the gene, each specific sequence of three DNA bases (codons) directs the cell's protein-synthesizing machinery to add specific amino acids. For example, the base sequence ATG codes for the amino acid methionine. Since 3 bases code for 1_amino acid, the protein coded by an average-sized gene (3000 bp) will contain 1000 amino acids. The genetic code is thus a series of codons that specify which amino acids are required to make up specific proteins. The protein-coding instructions from the genes are transmitted indirectly through messenger ribonucleic acid (mRNA), a transient intermediary molecule similar to a single strand of DNA. For the information within a gene to be expressed, a complementary RNA strand is produced (a process called transcription) from the DNA template in the nucleus. This mRNA is moved from the nucleus to the cellular cytoplasm, where it serves as the template for protein synthesis. The cell's protein-synthesizing machinery then translates the codons into a string of amino acids that will constitute the protein molecule for which it codes (Fig. 5). In the laboratory, the mRNA molecule can be isolated and used as a template to synthesize a complementary DNA (cDNA) strand, which can then be used to locate the corresponding genes on a chromosome map. The utility of this strategy is described in the section on physical mapping, p. 201. Chromosomes The 3 billion bp in the human genome are organized into 24 distinct, physically separate microscopic units called chromosomes. All genes are arranged linearly along the chromosomes. The nucleus of most human cells contains 2 sets of chromosomes, 1 set given by each parent. Each set has 23 single chromosomes_22 autosomes and an X or Y sex chromosome. (A normal female will have a pair of X chromosomes; a male will have an X and Y pair.) Chromosomes contain roughly equal parts of protein and DNA; chromosomal DNA contains an average of 150 million bases. DNA molecules are among the largest molecules now known. Chromosomes can be seen under a light microscope and, when stained with certain dyes, reveal a pattern of light and dark bands reflecting regional variations in the amounts of A and T vs G and C. Differences in size and banding pattern allow the 24 chromosomes to be distinguished from each other, an analysis called a karyotype. A few types of major chromosomal abnormalities, including missing or extra copies of a chromosome or gross breaks and rejoinings (translocations), can be detected by microscopic examination; Down's syndrome, in which an individual's cells contain a third copy of chromosome 21, is diagnosed by karyotype analysis (Fig. 6). Most changes in DNA, however, are too subtle to be detected by this technique and require molecular analysis. These subtle DNA abnormalities (mutations) are responsible for many inherited diseases such as cystic fibrosis and sickle cell anemia or may predispose an individual to cancer, major psychiatric illnesses, and other complex diseases. Mapping and Sequencing the Human Genome A primary goal of the Human Genome Project is to make a series of descriptive diagrams _maps_of each human chromosome at increasingly finer resolutions. Mapping involves (1) dividing the chromosomes into smaller fragments that can be propagated and characterized and (2) ordering (mapping) them to correspond to their respective locations on the chromosomes. After mapping is completed, the next step is to determine the base sequence of each of the ordered DNA fragments. The ultimate goal of genome research is to find all the genes in the DNA sequence and to develop tools for using this information in the study of human biology and medicine. Improving the instrumentation and techniques required for mapping and sequencing_a major focus of the genome project_will increase efficiency and cost-effectiveness. Goals include automating methods and optimizing techniques to extract the maximum useful information from maps and sequences. A genome map describes the order of genes or other markers and the spacing between them on each chromosome. Human genome maps are constructed on several different scales or levels of resolution. At the coarsest resolution are genetic linkage maps, which depict the relative chromosomal locations of DNA markers (genes and other identifiable DNA sequences) by their patterns of inheritance. Physical maps describe the chemical characteristics of the DNA molecule itself. Geneticists have already charted the approximate positions of over 2300 genes, and a start has been made in establishing high-resolution maps of the genome (Fig. 7). More-precise maps are needed to organize systematic sequencing efforts and plan new research directions. Mapping Strategies Genetic Linkage Maps A genetic linkage map shows the relative locations of specific DNA markers along the chromosome. Any inherited physical or molecular characteristic that differs among individuals and is easily detectable in the laboratory is a potential genetic marker. Markers can be expressed DNA regions (genes) or DNA segments that have no known coding function but whose inheritance pattern can be followed. DNA sequence differences are especially useful markers because they are plentiful and easy to characterize precisely. Markers must be polymorphic to be useful in mapping; that is, alternative forms must exist among individuals so that they are detectable among different members in family studies. Polymorphisms are variations in DNA sequence that occur on average once every 300 to 500 bp. Variations within exon sequences can lead to observable changes, such as differences in eye color, blood type, and disease susceptibility. Most variations occur within introns and have little or no effect on an organism's appearance or function, yet they are detectable at the DNA level and can be used as markers. Examples of these types of markers include (1)_restriction fragment length polymorphisms (RFLPs), which reflect sequence variations in DNA sites that can be cleaved by DNA restriction enzymes (see box, p. 203), and (2)_variable number of tandem repeat sequences, which are short repeated sequences that vary in the number of repeated units and, therefore, in length (a characteristic easily measured). The human genetic linkage map is constructed by observing how frequently two markers are inherited together. Two markers located near each other on the same chromosome will tend to be passed together from parent to child. During the normal production of sperm and egg cells, DNA strands occasionally break and rejoin in different places on the same chromosome or on the other copy of the same chromosome (i.e., the homologous chromosome). This process (called meiotic recombination) can result in the separation of two markers originally on the same chromosome (Fig. 8). The closer the markers are to each other_the more "tightly linked"_the less likely a recombination event will fall between and separate them. Recombination frequency thus provides an estimate of the distance between two markers. On the genetic map, distances between markers are measured in terms of centimorgans (cM), named after the American geneticist Thomas Hunt Morgan. Two markers are said to be 1_cM apart if they are separated by recombination 1% of the time. A genetic distance of 1_cM is roughly equal to a physical distance of 1 million bp (1 Mb). The current resolution of most human genetic map regions is about 10 Mb. The value of the genetic map is that an inherited disease can be located on the map by following the inheritance of a DNA marker present in affected individuals (but absent in unaffected individuals), even though the molecular basis of the disease may not yet be understood nor the responsible gene identified. Genetic maps have been used to find the exact chromosomal location of several important disease genes, including cystic fibrosis, sickle cell disease, Tay-Sachs disease, fragile X syndrome, and myotonic dystrophy. One short-term goal of the genome project is to develop a high-resolution genetic map (2 to 5_cM); recent consensus maps of some chromosomes have averaged 7 to 10_cM between genetic markers. Genetic mapping resolution has been increased through the application of recombinant DNA technology, including in vitro radiation-induced chromosome fragmentation and cell fusions (joining human cells with those of other species to form hybrid cells) to create panels of cells with specific and varied human chromosomal components. Assessing the frequency of marker sites remaining together after radiation-induced DNA fragmentation can establish the order and distance between the markers. Because only a single copy of a chromosome is required for analysis, even nonpolymorphic markers are useful in radiation hybrid mapping. [In meiotic mapping (described above), two copies of a chromosome must be distinguished from each other by polymorphic markers.] Restriction Enzymes: Microscopic Scalpels Isolated from various bacteria, restriction enzymes recognize short DNA sequences and cut the DNA molecules at those specific sites. (A natural biological function of these enzymes is to protect bacteria by attacking viral and other foreign DNA.) Some restriction enzymes (rare-cutters) cut the DNA very infrequently, generating a small number of very large fragments (several thousand to a million bp). Most enzymes cut DNA more frequently, thus generating a large number of small fragments (less than a hundred to more than a thousand bp). On average, restriction enzymes with * 4-base recognition sites will yield pieces 256 bases long, * 6-base recognition sites will yield pieces 4000 bases long, and * 8-base recognition sites will yield pieces 64,000 bases long. Since hundreds of different restriction enzymes have been characterized, DNA can be cut into many different small fragments. Physical Maps Different types of physical maps vary in their degree of resolution. The lowest-resolution physical map is the chromosomal (sometimes called cytogenetic) map, which is based on the distinctive banding patterns observed by light microscopy of stained chromosomes. A cDNA map shows the locations of expressed DNA regions (exons) on the chromosomal map. The more detailed cosmid contig map depicts the order of overlapping DNA fragments spanning the genome. A macrorestriction map describes the order and distance between enzyme cutting (cleavage) sites. The highest-resolution physical map is the complete elucidation of the DNA base-pair sequence of each chromosome in the human genome. Physical maps are described in greater detail below. Low-Resolution Physical Mapping Chromosomal map. In a chromosomal map, genes or other identifiable DNA fragments are assigned to their respective chromosomes, with distances measured in base pairs. These markers can be physically associated with particular bands (identified by cytogenetic staining) primarily by in situ hybridization, a technique that involves tagging the DNA marker with an observable label (e.g., one that fluoresces or is radioactive). The location of the labeled probe can be detected after it binds to its complementary DNA strand in an intact chromosome. As with genetic linkage mapping, chromosomal mapping can be used to locate genetic markers defined by traits observable only in whole organisms. Because chromosomal maps are based on estimates of physical distance, they are considered to be physical maps. The number of base pairs within a band can only be estimated. Until recently, even the best chromosomal maps could be used to locate a DNA fragment only to a region of about 10 Mb, the size of a typical band seen on a chromosome. Improvements in fluorescence in situ hybridization (FISH) methods allow orientation of DNA sequences that lie as close as 2 to 5 Mb. Modifications to in situ hybridization methods, using chromosomes at a stage in cell division (interphase) when they are less compact, increase map resolution to around 100,000 bp. Further banding refinement might allow chromosomal bands to be associated with specific amplified DNA fragments, an improvement that could be useful in analyzing observable physical traits associated with chromosomal abnormalities. cDNA map. A cDNA map shows the positions of expressed DNA regions (exons) relative to particular chromosomal regions or bands. (Expressed DNA regions are those transcribed into mRNA.) cDNA is synthesized in the laboratory using the mRNA molecule as a template; base-pairing rules are followed (i.e., an A on the mRNA molecule will pair with a T on the new DNA strand). This cDNA can then be mapped to genomic regions. Because they represent expressed genomic regions, cDNAs are thought to identify the parts of the genome with the most biological and medical significance. A cDNA map can provide the chromosomal location for genes whose functions are currently unknown. For disease-gene hunters, the map can also suggest a set of candidate genes to test when the approximate location of a disease gene has been mapped by genetic linkage techniques. High-Resolution Physical Mapping The two current approaches to high-resolution physical mapping are termed "top-down" (producing a macrorestriction map) and "bottom-up" (resulting in a contig map). With either strategy (described below) the maps represent ordered sets of DNA fragments that are generated by cutting genomic DNA with restriction enzymes (see previously discussed Restriction Enzymes). The fragments are then amplified by cloning or by polymerase chain reaction (PCR) methods (see DNA Amplification below). Electrophoretic techniques are used to separate the fragments according to size into different bands, which can be visualized by direct DNA staining or by hybridization with DNA probes of interest. The use of purified chromosomes separated either by flow sorting from human cell lines or in hybrid cell lines allows a single chromosome to be mapped (see Separating Chromosomes below). A number of strategies can be used to reconstruct the original order of the DNA fragments in the genome. Many approaches make use of the ability of single strands of DNA and/or RNA to hybridize_to form double-stranded segments by hydrogen bonding between complementary bases. The extent of sequence homology between the two strands can be inferred from the length of the double-stranded segment. Fingerprinting uses restriction map data to determine which fragments have a specific sequence (fingerprint) in common and therefore overlap. Another approach uses linking clones as probes for hybridization to chromosomal DNA cut with the same restriction enzyme. Macrorestriction maps: Top-down mapping. In top-down mapping, a single chromosome is cut (with rare-cutter restriction enzymes) into large pieces, which are ordered and subdivided; the smaller pieces are then mapped further. The resulting macro-restriction maps depict the order of and distance between sites at which rare-cutter enzymes cleave (Fig. 9a). This approach yields maps with more continuity and fewer gaps between fragments than contig maps (see below), but map resolution is lower and may not be useful in finding particular genes; in addition, this strategy generally does not produce long stretches of mapped sites. Currently, this approach allows DNA pieces to be located in regions measuring about 100,000 bp to 1_Mb. The development of pulsed-field gel (PFG) electrophoretic methods has improved the mapping and cloning of large DNA molecules. While conventional gel electrophoretic methods separate pieces less than 40 kb (1 kb = 1000 bases) in size, PFG separates molecules up to 10 Mb, allowing the application of both conventional and new mapping methods to larger genomic regions. Contig maps: Bottom-up mapping. The bottom-up approach involves cutting the chromosome into small pieces, each of which is cloned and ordered. The ordered fragments form contiguous DNA blocks (contigs). Currently, the resulting "library" of clones varies in size from 10,000 bp to 1 Mb (Fig. 9b). An advantage of this approach is the accessibility of these stable clones to other researchers. Contig construction can be verified by FISH, which localizes cosmids to specific regions within chromosomal bands. Contig maps thus consist of a linked library of small overlapping clones representing a complete chromosomal segment. While useful for finding genes localized to a small area (under 2 Mb), contig maps are difficult to extend over large stretches of a chromosome because all regions are not clonable. DNA probe techniques can be used to fill in the gaps, but they are time consuming. Figure 10 is a diagram relating the different types of maps. Technological improvements now make possible the cloning of large DNA pieces, using artificially constructed chromosome vectors that carry human DNA fragments as large as 1 Mb. These vectors are maintained in yeast cells as artificial chromosomes (YACs). (For more explanation, see DNA Amplification below) Before YACs were developed, the largest cloning vectors (cosmids) carried inserts of only 20 to 40 kb. YAC methodology drastically reduces the number of clones to be ordered; many YACs span entire human genes. A more detailed map of a large YAC insert can be produced by subcloning, a process in which fragments of the original insert are cloned into smaller-insert vectors. Because some YAC regions are unstable, large-capacity bacterial vectors (i.e., those that can accommodate large inserts) are also being developed. Separating Chromosomes Flow sorting Pioneered at Los Alamos National Laboratory (LANL), flow sorting employs flow cytometry to separate, according to size, chromosomes isolated from cells during cell division when they are condensed and stable. As the chromosomes flow singly past a laser beam, they are differentiated by analyzing the amount of DNA present, and individual chromosomes are directed to specific collection tubes. Somatic cell hybridization In somatic cell hybridization, human cells and rodent tumor cells are fused (hybridized); over time, after the chromosomes mix, human chromosomes are preferentially lost from the hybrid cell until only one or a few remain. Those individual hybrid cells are then propagated and maintained as cell lines containing specific human chromosomes. Improvements to this technique have generated a number of hybrid cell lines, each with a specific single human chromosome. Sequencing Technologies The ultimate physical map of the human genome is the complete DNA sequence_the determination of all base pairs on each chromosome. The completed map will provide biologists with a Rosetta stone for studying human biology and enable medical researchers to begin to unravel the mechanisms of inherited diseases. Much effort continues to be spent locating genes; if the full sequence were known, emphasis could shift to determining gene function. The Human Genome Project is creating research tools for 21st-century biology, when the goal will be to understand the sequence and functions of the genes residing therein. Achieving the goals of the Human Genome Project will require substantial improvements in the rate, efficiency, and reliability of standard sequencing procedures. While technological advances are leading to the automation of standard DNA purification, separation, and detection steps, efforts are also focusing on the development of entirely new sequencing methods that may eliminate some of these steps. Sequencing procedures currently involve first subcloning DNA fragments from a cosmid or bacteriophage library into special sequencing vectors that carry shorter pieces of the original cosmid fragments (Fig. 11). The next step is to make the subcloned fragments into sets of nested fragments differing in length by one nucleotide, so that the specific base at the end of each successive fragment is detectable after the fragments have been separated by gel electrophoresis. Current sequencing technologies are discussed later. DNA Amplification: Cloning and Polymerase Chain Reaction Cloning (in vivo DNA amplification) Cloning involves the use of recombinant DNA technology to propagate DNA fragments inside a foreign host. The fragments are usually isolated from chromosomes using restriction enzymes and then united with a carrier (a vector). Following introduction into suitable host cells, the DNA fragments can then be reproduced along with the host cell DNA. Vectors are DNA molecules originating from viruses, bacteria, and yeast cells. They accommodate various sizes of foreign DNA fragments ranging from 12,000 bp for bacterial vectors (plasmids and cosmids) to 1 Mb for yeast vectors (yeast artificial chromosomes). Bacteria are most often the hosts for these inserts, but yeast and mammalian cells are also used. Cloning procedures provide unlimited material for experimental study. A random (unordered) set of cloned DNA fragments is called a library. Genomic libraries are sets of overlapping fragments encompassing an entire genome. Also available are chromosome-specific libraries, which consist of fragments derived from source DNA enriched for a particular chromosome. (See Separating Chromosomes, above.) PCR (in vitro DNA amplification) Described as being to genes what Gutenberg's printing press was to the written word, PCR can amplify a desired DNA sequence of any origin (virus, bacteria, plant, or human) hundreds of millions of times in a matter of hours, a task that would have required several days with recombinant technology. PCR is especially valuable because the reaction is highly specific, easily automated, and capable of amplifying minute amounts of sample. For these reasons, PCR has also had a major impact on clinical medicine, genetic disease diagnostics, forensic science, and evolutionary biology. PCR is a process based on a specialized polymerase enzyme, which can synthesize a complementary strand to a given DNA strand in a mixture containing the 4 DNA bases and 2 DNA fragments (primers, each about 20 bases long) flanking the target sequence. The mixture is heated to separate the strands of double-stranded DNA containing the target sequence and then cooled to allow (1) the primers to find and bind to their complementary sequences on the separated strands and (2) the polymerase to extend the primers into new complementary strands. Repeated heating and cooling cycles multiply the target DNA exponentially, since each new double strand separates to become two templates for further synthesis. In about 1 hour, 20 PCR cycles can amplify the target by a millionfold. Current Sequencing Technologies The two basic sequencing approaches, Maxam-Gilbert and Sanger, differ primarily in the way the nested DNA fragments are produced. Both methods work because gel electrophoresis produces very high resolution separations of DNA molecules; even fragments that differ in size by only a single nucleotide can be resolved. Almost all steps in these sequencing methods are now automated. Maxam-Gilbert sequencing (also called the chemical degradation method) uses chemicals to cleave DNA at specific bases, resulting in fragments of different lengths. A refinement to the Maxam-Gilbert method known as multiplex sequencing enables investigators to analyze about 40 clones on a single DNA sequencing gel. Sanger sequencing (also called the chain termination or dideoxy method) involves using an enzymatic procedure to synthesize DNA chains of varying length in four different reactions, stopping the DNA replication at positions occupied by one of the four bases, and then determining the resulting fragment lengths (Fig. 12). These first-generation gel-based sequencing technologies are now being used to sequence small regions of interest in the human genome. Although investigators could use existing technology to sequence whole chromosomes, time and cost considerations make large-scale sequencing projects of this nature impractical. The smallest human chromosome (Y) contains 50 Mb; the largest (chromosome 1) has 250 Mb. The largest continuous DNA sequence obtained thus far, however, is approximately 350,000 bp, and the best available equipment can sequence only 50,000 to 100,000 bases per year at an approximate cost of $1 to $2 per base. At that rate, an unacceptable 30,000 work-years and at least $3_billion would be required for sequencing alone. Sequencing Technologies Under Development A major focus of the Human Genome Project is the development of automated sequencing technology that can accurately sequence 100,000 or more bases per day at a cost of less than $.50 per base. Specific goals include the development of sequencing and detection schemes that are faster and more sensitive, accurate, and economical. Many novel sequencing technologies are now being explored, and the most promising ones will eventually be optimized for widespread use. Second-generation (interim) sequencing technologies will enable speed and accuracy to increase by an order of magnitude (i.e., 10 times greater) while lowering the cost per base. Some important disease genes will be sequenced with such technologies as (1) high-voltage capillary and ultrathin electrophoresis to increase fragment separation rate and (2) use of resonance ionization spectroscopy to detect stable isotope labels. Third-generation gel-less sequencing technologies, which aim to increase efficiency by several orders of magnitude, are expected to be used for sequencing most of the human genome. These developing technologies include (1) enhanced fluorescence detection of individual labeled bases in flow cytometry, (2) direct reading of the base sequence on a DNA strand with the use of scanning tunneling or atomic force microscopies, (3) enhanced mass spectrometric analysis of DNA sequence, and (4) sequencing by hybridization to short panels of nucleotides of known sequence. Pilot large-scale sequencing projects will provide opportunities to improve current technologies and will reveal challenges investigators may encounter in larger-scale efforts. Partial Sequencing To Facilitate Mapping, Gene Identification Correlating mapping data from different laboratories has been a problem because of differences in generating, isolating, and mapping DNA fragments. A common reference system designed to meet these challenges uses partially sequenced unique regions (200 to 500 bp) to identify clones, contigs, and long stretches of sequence. Called sequence tagged sites (STSs), these short sequences have become standard markers for physical mapping. Because coding sequences of genes represent most of the potentially useful information content of the genome (but are only a fraction of the total DNA), some investigators have begun partial sequencing of cDNAs instead of random genomic DNA. (cDNAs are derived from mRNA sequences, which are the transcription products of expressed genes.) In addition to providing unique markers, these partial sequences [termed expressed sequence tags (ESTs)] also identify expressed genes. This strategy can thus provide a means of rapidly identifying most human genes. Other applications of the EST approach include determining locations of genes along chromosomes and identifying coding regions in genomic sequences. End Games: Completing Maps and Sequences; Finding Specific Genes Starting maps and sequences is relatively simple; finishing them will require new strategies or a combination of existing methods. After a sequence is determined using the methods described above, the task remains to fill in the many large gaps left by current mapping methods. One approach is single-chromosome microdissection, in which a piece is physically cut from a chromosomal region of particular interest, broken up into smaller pieces, and amplified by PCR or cloning (see DNA Amplification above). These fragments can then be mapped and sequenced by the methods previously described. Chromosome walking, one strategy for filling in gaps, involves hybridizing a primer of known sequence to a clone from an unordered genomic library and synthesizing a short complementary strand (called "walking" along a chromosome). The complementary strand is then sequenced and its end used as the next primer for further walking; in this way the adjacent, previously unknown, region is identified and sequenced. The chromosome is thus systematically sequenced from one end to the other. Because primers must be synthesized chemically, a disadvantage of this technique is the large number of different primers needed to walk a long distance. Chromosome walking is also used to locate specific genes by sequencing the chromosomal segments between markers that flank the gene of interest (Fig. 13). The current human genetic map has about 1000 markers, or 1 marker spaced every 3_million bp; an estimated 100 genes lie between each pair of markers. Higher-resolution genetic maps have been made in regions of particular interest. New genes can be located by combining genetic and physical map information for a region. The genetic map basically describes gene order. Rough information about gene location is sometimes available also, but these data must be used with caution because recombination is not equally likely at all places on the chromosome. Thus the genetic map, compared to the physical map, stretches in some places and compresses in others, as though it were drawn on a rubber band. The degree of difficulty in finding a disease gene of interest depends largely on what information is already known about the gene and, especially, on what kind of DNA alterations cause the disease. Spotting the disease gene is very difficult when disease results from a single altered DNA base; sickle cell anemia is an example of such a case, as are probably most major human inherited diseases. When disease results from a large DNA rearrangement, this anomaly can usually be detected as alterations in the physical map of the region or even by direct microscopic examination of the chromosome. The location of these alterations pinpoints the site of the gene. Identifying the gene responsible for a specific disease without a map is analogous to finding a needle in a haystack. Actually, finding the gene is even more difficult, because even close up, the gene still looks like just another piece of hay. However, maps give clues on where to look; the finer the map's resolution, the fewer pieces of hay to be tested. Once the neighborhood of a gene of interest has been identified, several strategies can be used to find the gene itself. An ordered library of the gene neighborhood can be constructed if one is not already available. This library provides DNA fragments that can be screened for additional polymorphisms, improving the genetic map of the region and further restricting the possible gene location. In addition, DNA fragments from the region can be used as probes to search for DNA sequences that are expressed (transcribed to RNA) or conserved among individuals. Most genes will have such sequences. Then individual gene candidates must be examined. For example, a gene responsible for liver disease is likely to be expressed in the liver and less likely in other tissues or organs. This type of evidence can further limit the search. Finally, a suspected gene may need to be sequenced in both healthy and affected individuals. A consistent pattern of DNA variation when these two samples are compared will show that the gene of interest has very likely been found. The ultimate proof is to correct the suspected DNA alteration in a cell and show that the cell's behavior reverts to normal. Model Organism Research Most mapping and sequencing technologies were developed from studies of nonhuman genomes, notably those of the bacterium Escherichia coli, the yeast Saccharomyces cerevisiae, the fruit fly Drosophila melanogaster, the roundworm Caenorhabditis elegans, and the laboratory mouse Mus musculus. These simpler systems provide excellent models for developing and testing the procedures needed for studying the much more complex human genome. A large amount of genetic information has already been derived from these organisms, providing valuable data for the analysis of normal gene regulation, genetic diseases, and evolutionary processes. Physical maps have been completed for E. coli, and extensive overlapping clone sets are available for S. cerevisiae and C. elegans. In addition, sequencing projects have been initiated by the NIH genome program for E. coli, S. cerevisiae, and C. elegans. Mouse genome research will provide much significant comparative information because of the many biological and genetic similarities between mouse and man. Comparisons of human and mouse DNA sequences will reveal areas that have been conserved during evolution and are therefore important. An extensive database of mouse DNA sequences will allow counterparts of particular human genes to be identified in the mouse and extensively studied. Conversely, information on genes first found to be important in the mouse will lead to associated human studies. The mouse genetic map, based on morphological markers, has already led to many insights into human biology. Mouse models are being developed to explore the effects of mutations causing human diseases, including diabetes, muscular dystrophy, and several cancers. A genetic map based on DNA markers is presently being constructed, and a physical map is planned to allow direct comparison with the human physical map. Informatics: Data Collection and Interpretation Collecting and Storing Data The reference map and sequence generated by genome research will be used as a primary information source for human biology and medicine far into the future. The vast amount of data produced will first need to be collected, stored, and distributed. If compiled in books, the data would fill an estimated 200 volumes the size of a Manhattan telephone book (at 1000 pages each), and reading it would require 26 years working around the clock (Fig. 14). Because handling this amount of data will require extensive use of computers, database development will be a major focus of the Human Genome Project. The present challenge is to improve database design, software for database access and manipulation, and data-entry procedures to compensate for the varied computer procedures and systems used in different laboratories. Databases need to be designed that will accurately represent map information (linkage, STSs, physical location, disease loci) and sequences (genomic, cDNAs, proteins) and link them to each other and to bibliographic text databases of the scientific and medical literature. Interpreting Data New tools will also be needed for analyzing the data from genome maps and sequences. Recognizing where genes begin and end and identifying their exons, introns, and regulatory sequences may require extensive comparisons with sequences from related species such as the mouse to search for conserved similarities (homologies). Searching a database for a particular DNA sequence may uncover these homologous sequences in a known gene from a model organism, revealing insights into the function of the corresponding human gene. Correlating sequence information with genetic linkage data and disease gene research will reveal the molecular basis for human variation. If a newly identified gene is found to code for a flawed protein, the altered protein must be compared with the normal version to identify the specific abnormality that causes disease. Once the error is pinpointed, researchers must try to determine how to correct it in the human body, a task that will require knowledge about how the protein functions and in which cells it is active. Correct protein function depends on the three-dimensional (3D), or folded, structure the proteins assume in biological environments; thus, understanding protein structure will be essential in determining gene function. DNA sequences will be translated into amino acid sequences, and researchers will try to make inferences about functions either by comparing protein sequences with each other or by comparing their specific 3-D structures (Fig. 15). Because the 3-D structure patterns (motifs) that protein molecules assume are much more evolutionarily conserved than amino acid sequences, this type of homology search could prove more fruitful. Particular motifs may serve similar functions in several different proteins, information that would be valuable in genome analyses. Currently, however, only a few protein motifs can be recognized at the sequence level. Continued development of analytic capabilities to facilitate grouping protein sequences into motif families will make homology searches more successful. Mapping Databases The Genome Data Base (GDB), located at Johns Hopkins University (Baltimore, Maryland), provides location, ordering, and distance information for human genetic markers, probes, and contigs linked to known human genetic disease. GDB is presently working on incorporating physical mapping data. Also at Hopkins is the Online Mendelian Inheritance in Man database, a catalog of inherited human traits and diseases. The Human and Mouse Probes and Libraries Database (located at the American Type Culture Collection in Rockville, Maryland) and the GBASE mouse database (located at Jackson Laboratory, Bar Harbor, Maine) include data on RFLPs, chromosomal assignments, and probes from the laboratory mouse. Sequence Databases Nucleic Acids (DNA and RNA) GenBank, the European Molecular Biology Laboratory (EMBL) sequence database, and the DNA Database of Japan (DDBJ) house over 70 Mb of sequence from more than 2500 different organisms. Compiled from both direct submissions and journal scans, GenBank is supported at IntelliGenetics (Mountain View, California) and LANL through a contract from the NIH National Institute of General Medical Sciences. Although responsibility for GenBank will move to the National Center for Biotechnology Information (NCBI) of the National Library of Medicine in September 1992, LANL will continue to handle direct data submissions from authors. International collaborations with EMBL and DDBJ will also continue. NCBI is also developing GenInfo, a data archive that will eventually offer integrated access to other databases. Proteins The major protein sequence databases are the Protein Identification Resource (National Biomedical Research Foundation), Swissprot, and GenPept (both distributed with GenBank). In addition to sequence information, they contain information on protein motifs and other features of protein structure. Impact of the Human Genome Project The atlas of the human genome will revolutionize medical practice and biological research into the 21st century and beyond. All human genes will eventually be found, and accurate diagnostics will be developed for most inherited diseases. In addition, animal models for human disease research will be more easily developed, facilitating the understanding of gene function in health and disease. Researchers have already identified single genes associated with a number of diseases, such as cystic fibrosis, Duchenne muscular dystrophy, myotonic dystrophy, neurofibromatosis, and retinoblastoma. As research progresses, investigators will also uncover the mechanisms for diseases caused by several genes or by a gene interacting with environmental factors. Genetic susceptibilities have been implicated in many major disabling and fatal diseases including heart disease, stroke, diabetes, and several kinds of cancer. The identification of these genes and their proteins will pave the way to more-effective therapies and preventive measures. Investigators determining the underlying biology of genome organization and gene regulation will also begin to understand how humans develop from single cells to adults, why this process sometimes goes awry, and what changes take place as people age. New technologies developed for genome research will also find myriad applications in industry, as well as in projects to map (and ultimately improve) the genomes of economically important farm animals and crops. While human genome research itself does not pose any new ethical dilemmas, the use of data arising from these studies presents challenges that need to be addressed before the data accumulate significantly. To assist in policy development, the ethics component of the Human Genome Project is funding conferences and research projects to identify and consider relevant issues, as well as activities to promote public awareness of these topics. Appendix B: Conferences, Meetings, and Workshops Sponsored by DOE 4/89 Second Cold Spring Harbor Meeting on Genome Mapping and Sequencing: Cold Spring Harbor, NY 6/89 Chromosome 16 Workshop: New Haven, CT 10/89 First Annual Genome Sequencing Conference: Wolf Trap, Vienna, VA 12/89 Large Insert Cloning Workshop: Houston, TX 12/89 Human X Chromosome Workshop: Houston, TX 2/90 Chromosome 3 Workshop: San Antonio, TX 3/90 First Conference on Genetics, Religion, and Ethics: Houston, TX 4/90 Application of Mass Spectrometry to DNA Sequencing Workshop: Seattle, WA 4/90 Chromosome 21 Workshop: Bethesda, MD 4/90 Workshop on Mapping Human Chromosome 22: Paris 8/90 DOE-NIH Annual Planning and Evaluation Retreat: Hunt Valley, MD 8/90 Chromosome 19 Workshop: Charleston, SC 9/90 Genome Sequencing Conference II: Hilton Head, SC 9/90 First International Workshop on Human Chromosome 5: London 11/90 Fourth International Workshop on Mouse Genome Mapping: Annapolis, MD 1/91 Second X Chromosome Workshop: Oxford, England 2/91 Second DOE Contractor-Grantee Workshop: Santa Fe, NM 3/91 Chromosome 17 Workshop: Salt Lake City, UT 4/91 Workshop on Computational Molecular Biology: Seattle, WA 4/91 Chromosome 3 Workshop: Denver, CO 4/91 Chromosome 21 Workshop: Denver, CO 5/91 Sequencing by Hybridization Workshop: Gaithersburg, MD 5/91 Chromosome 11 Workshop: Paris 6/91 Workshop on Open Problems of Computational Molecular Biology: Telluride, CO 6/91 Chromosome 4 Workshop: Philadelphia, PA 9/91 DOE-NIH Annual Planning and Evaluation Retreat: Lafayette, CA 9/91 ELSI Working Group Meeting on Privacy: Bethesda, MD 9/91 First Panel Meeting "Predicting Future Diseases" at the National Academy of Sciences Institute of Medicine: Washington, DC 9/91 Genome Sequencing III: Hilton Head, SC 9/91 Workshop on Informatics Needs of Large-Scale Sequencing Projects: Hilton Head, SC 10/91 Conference on Identification of Transcribed Sequences in the Human Genome: Bethesda, MD 10/91 Workshop on DNA Sequence Acquisition and Interpretation: Cold Spring Harbor, NY 11/91 Conference on Justice and the Human Genome: Chicago, IL 11/91 Sequencing By Hybridization Workshop: Moscow 12/91 Human Genetics and Genome Analysis: A Practical Workshop for the Nonscientist: Cold Spring Harbor, NY 1/92 Chromosome 19 Workshop: Nijmegen, Netherlands 2/92 Chromosome 16 Workshop: Adelaide, Australia 3/92 Second Conference on Genetics, Religion, and Ethics: Houston, TX 3/92 Chromosome 17 Workshop: Salt Lake City, UT 3/92 Chromosome 3 Workshop: Tokyo, Japan 3/92 Chromosome 9 Workshop: Cambridge, England 5/92 Chromosome 5 Workshop: Chicago, IL 6/92 Chromosome 4 Workshop: Leiden, Netherlands 6/92 Chromosome 6 Workshop: Ann Arbor, MI 6/92 Chromosome 15 Workshop: Tucson, AZ 6/92 Chromosome 18 Workshop: Chicago, IL 6/92 DOE/NIH Annual Planning and Evaluation Retreat: Bethesda, MD Partial Listing of Future DOE-Sponsored Workshops 9/92 Chromosome 11 Workshop; San Diego, CA 9/92 Chromosome 12 Workshop: Oxford, England 9/92 Chromosome 13 Workshop: New York, NY 11/92 Chromosome 2 Workshop: Lake Tahoe, CA 2/93 Third DOE Contractor-Grantee Workshop: Santa Fe, NM Appendix C: Members of the DOE Health and Environmental Research Advisory Committee Sheldon Wolff (Chair) University of California, San Francisco E. Morton Bradbury Los Alamos National Laboratory Eville Gorham University of Minnesota Jonathan Greer Abbott Laboratories Barbara Ann Hamkalo University of California, Irvine Sam Hurst Atom Sciences, Inc. Kenneth K. Kidd Yale University Leonard S. Lerman Massachusetts Institute of Technology Gordon J. MacDonald University of California, San Diego J. Justin McCormick Michigan State University Mortimer L. Mendelsohn Lawrence Livermore National Laboratory Mary Lou Pardue Massachusetts Institute of Technology Theodore L. Phillips University of California, San Francisco Richard C. Reba University of Chicago Melvin I. Simon California Institute of Technology Warren M. Washington National Center for Atmospheric Research Audrey Wegst Diagnostic Technology Consultants, Inc. Harel Weinstein Mt. Sinai School of Medicine Appendix D: Members of NIH-DOE Joint Working Groups Joint Working Group on Ethical, Legal, and Social Issues (First met September 1989; first workshop held February 5-6, 1990) Nancy Wexler (Chair) Columbia University Jonathan R. Beckwith Harvard Medical School Robert Cook-Deegan National Academy of Sciences Institute Patricia King Georgetown University Law Center Victor A. McKusick Johns Hopkins University Hospital Robert F. Murray Howard University Thomas H. Murray Case Western Reserve University Joint Mapping Working Group (First met December 1989) David Botstein Stanford University Anthony V. Carrano Lawrence Livermore National Laboratory C. Thomas Caskey Baylor College of Medicine David R. Cox University of California, San Francisco Robert K. Moyzis Los Alamos National Laboratory Maynard V. Olson Washington University Joint Informatics Task Force (ad hoc) (First met March 7-9, 1990; final meeting January 3, 1992) Dieter Soll (Chair) Yale University George I. Bell Los Alamos National Laboratory David Botstein Stanford University Elbert Branscomb Lawrence Livermore National Laboratory John Devereux Genetics Computer Group Nathan Goodman Whitehead Institute Gregory Hamm Rutgers University Waksman Institute Eric Lander Massachusetts Institute of Technology Frank Olken Lawrence Berkeley Laboratory Mark L. Pearson E. I. du Pont de Nemours & Company Sylvia J. Spengler Lawrence Berkeley Laboratory Michael Waterman University of Southern California Joint Sequencing Working Group (First met May 10, 1990) Ellson Chen Genentech, Inc. Ronald Davis Stanford University John Devereux Genetics Computer Group Walter Gilbert Harvard University Leroy E. Hood California Institute of Technology Mark L. Pearson E.I. du Pont de Nemours & Company Joseph Sambrook University of Texas Phillip A. Sharp Massachusetts Institute of Technology William Studier Brookhaven National Laboratory Joint Working Group on the Mouse (First met May 6, 1991) Verne Chapman (Chair) Roswell Park Memorial Institute Frank Constantini Columbia University Neal Copeland National Cancer Institute-Frederick Cancer Research and Development Center William Dove University of Wisconsin, Madison Joseph Nadeau Jackson Laboratory Roger Reeves Johns Hopkins University Janet Rossant Mt. Sinai Hospital Oliver Smithies University of North Carolina, Chapel Hill Richard Woychik Oak Ridge National Laboratory Appendix E: Glossary Portions of the glossary text were taken directly or modified from definitions in the U.S. Congress Office of Technology Assessment document: Mapping Our Genes_The Genome Projects: How Big, How Fast? OTA-BA-373, Washington, D.C.: U.S. Government Printing Office, April 1988. Adenine (A): A nitrogenous base, one member of the base pair A-T (adenine-thymine). Alleles: Alternative forms of a genetic locus; a single allele for each locus is inherited separately from each parent (e.g., at a locus for eye color the allele might result in blue or brown eyes). Amino acid: Any of a class of 20 molecules that are combined to form proteins in living things. The sequence of amino acids in a protein and hence protein function are determined by the genetic code. Amplification: An increase in the number of copies of a specific DNA fragment; can be in vivo or in vitro. See cloning, polymerase chain reaction. Arrayed library: Individual primary recombinant clones (hosted in phage, cosmid, YAC, or other vector) that are placed in two-dimensional arrays in microtiter dishes. Each primary clone can be identified by the identity of the plate and the clone location (row and column) on that plate. Arrayed libraries of clones can be used for many applications, including screening for a specific gene or genomic region of interest as well as for physical mapping. Information gathered on individual clones from various genetic linkage and physical map analyses is entered into a relational database and used to construct physical and genetic linkage maps simultaneously; clone identifiers serve to interrelate the multilevel maps. Compare library, genomic library. Autoradiography: A technique that uses X-ray film to visualize radioactively labeled molecules or fragments of molecules; used in analyzing length and number of DNA fragments after they are separated by gel electrophoresis. Autosome: A chromosome not involved in sex determination. The diploid human genome consists of 46 chromosomes, 22 pairs of autosomes, and 1 pair of sex chromosomes (the X and Y chromosomes). Bacteriophage: See phage. Base pair (bp): Two nitrogenous bases (adenine and thymine or guanine and cytosine) held together by weak bonds. Two strands of DNA are held together in the shape of a double helix by the bonds between base pairs. Base sequence: The order of nucleotide bases in a DNA molecule. Base sequence analysis: A method, sometimes automated, for determining the base sequence. Biotechnology: A set of biological techniques developed through basic research and now applied to research and product development. In particular, the use by industry of recombinant DNA, cell fusion, and new bioprocessing techniques. bp: See base pair. cDNA: See complementary DNA. Centimorgan (cM): A unit of measure of recombination frequency. One centimorgan is equal to a 1% chance that a marker at one genetic locus will be separated from a marker at a second locus due to crossing over in a single generation. In human beings, 1 centimorgan is equivalent, on average, to 1 million base pairs. Centromere: A specialized chromosome region to which spindle fibers attach during cell division. Chromosomes: The self-replicating genetic structures of cells containing the cellular DNA that bears in its nucleotide sequence the linear array of genes. In prokaryotes, chromosomal DNA is circular, and the entire genome is carried on one chromosome. Eukaryotic genomes consist of a number of chromosomes whose DNA is associated with different kinds of proteins. Clone bank: See genomic library. Clones: A group of cells derived from a single ancestor. Cloning: The process of asexually producing a group of cells (clones), all genetically identical, from a single ancestor. In recombinant DNA technology, the use of DNA manipulation procedures to produce multiple copies of a single gene or segment of DNA is referred to as cloning DNA. Cloning vector: DNA molecule originating from a virus, a plasmid, or the cell of a higher organism into which another DNA fragment of appropriate size can be integrated without loss of the vector's capacity for self-replication; vectors introduce foreign DNA into host cells, where it can be reproduced in large quantities. Examples are plasmids, cosmids, and yeast artificial chromosomes; vectors are often recombinant molecules containing DNA sequences from several sources. cM: See centimorgan. Code: See genetic code. Codon: See genetic code. Complementary DNA (cDNA): DNA that is synthesized from a messenger RNA template; the single-stranded form is often used as a probe in physical mapping. Complementary sequences: Nucleic acid base sequences that can form a double-stranded structure by matching base pairs; the complementary sequence to G-T-A-C is C-A-T-G. Conserved sequence: A base sequence in a DNA molecule (or an amino acid sequence in a protein) that has remained essentially unchanged throughout evolution. Contig map: A map depicting the relative order of a linked library of small overlapping clones representing a complete chromosomal segment. Contigs: Groups of clones representing overlapping regions of a genome. Cosmid: Artificially constructed cloning vector containing the cos gene of phage lambda. Cosmids can be packaged in lambda phage particles for infection into E. coli; this permits cloning of larger DNA fragments (up to 45 kb) than can be introduced into bacterial hosts in plasmid vectors. Crossing over: The breaking during meiosis of one maternal and one paternal chromosome, the exchange of corresponding sections of DNA, and the rejoining of the chromosomes. This process can result in an exchange of alleles between chromosomes. Compare recombination. Cytosine (C): A nitrogenous base, one member of the base pair G-C (guanine and cytosine). Deoxyribonucleotide: See nucleotide. Diploid: A full set of genetic material, consisting of paired chromosomes_ one chromosome from each parental set. Most animal cells except the gametes have a diploid set of chromosomes. The diploid human genome has 46 chromosomes. Compare haploid. DNA (deoxyribonucleic acid): The molecule that encodes genetic information. DNA is a double-stranded molecule held together by weak bonds between base pairs of nucleotides. The four nucleotides in DNA contain the bases: adenine (A), guanine (G), cytosine (C), and thymine (T). In nature, base pairs form only between A and T and between G and C; thus the base sequence of each single strand can be deduced from that of its partner. DNA probes: See probe. DNA replication: The use of existing DNA as a template for the synthesis of new DNA strands. In humans and other eukaryotes, replication occurs in the cell nucleus. DNA sequence: The relative order of base pairs, whether in a fragment of DNA, a gene, a chromosome, or an entire genome. See base sequence analysis. Domain: A discrete portion of a protein with its own function. The combination of domains in a single protein determines its overall function. Double helix: The shape that two linear strands of DNA assume when bonded together. E. coli: Common bacterium that has been studied intensively by geneticists because of its small genome size, normal lack of pathogenicity, and ease of growth in the laboratory. Electrophoresis: A method of separating large molecules (such as DNA fragments or proteins) from a mixture of similar molecules. An electric current is passed through a medium containing the mixture, and each kind of molecule travels through the medium at a different rate, depending on its electrical charge and size. Separation is based on these differences. Agarose and acrylamide gels are the media commonly used for electrophoresis of proteins and nucleic acids. Endonuclease: An enzyme that cleaves its nucleic acid substrate at internal sites in the nucleotide sequence. Enzyme: A protein that acts as a catalyst, speeding the rate at which a biochemical reaction proceeds but not altering the direction or nature of the reaction. EST: Expressed sequence tag. See sequence tagged site. Eukaryote: Cell or organism with membrane-bound, structurally discrete nucleus and other well-developed subcellular compartments. Eukaryotes include all organisms except viruses, bacteria, and blue-green algae. Compare prokaryote. See chromosomes. Evolutionarily conserved: See conserved sequence. Exogenous DNA: DNA originating outside an organism. Exons: The protein-coding DNA sequences of a gene. Compare introns. Exonuclease: An enzyme that cleaves nucleotides sequentially from free ends of a linear nucleic acid substrate. Expressed gene: See gene expression. FISH (fluorescence in situ hybridization): A physical mapping approach that uses fluorescein tags to detect hybridization of probes with metaphase chromosomes and with the less-condensed somatic interphase chromatin. Flow cytometry: Analysis of biological material by detection of the light-absorbing or fluorescing properties of cells or subcellular fractions (i.e., chromosomes) passing in a narrow stream through a laser beam. An absorbance or fluorescence profile of the sample is produced. Automated sorting devices, used to fractionate samples, sort successive droplets of the analyzed stream into different fractions depending on the fluorescence emitted by each droplet. Flow karyotyping: Use of flow cytometry to analyze and/or separate chromosomes on the basis of their DNA content. Gamete: Mature male or female reproductive cell (sperm or ovum) with a haploid set of chromosomes (23 for humans). Gene: The fundamental physical and functional unit of heredity. A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e., a protein or RNA molecule). See gene expression. Gene expression: The process by which a gene's coded information is converted into the structures present and operating in the cell. Expressed genes include those that are transcribed into mRNA and then translated into protein and those that are transcribed into RNA but not translated into protein (e.g., transfer and ribosomal RNAs). Gene families: Groups of closely related genes that make similar products. Gene library: See genomic library. Gene mapping: Determination of the relative positions of genes on a DNA molecule (chromosome or plasmid) and of the distance, in linkage units or physical units, between them. Gene product: The biochemical material, either RNA or protein, resulting from expression of a gene. The amount of gene product is used to measure how active a gene is; abnormal amounts can be correlated with disease-causing alleles. Genetic code: The sequence of nucleotides, coded in triplets (codons) along the mRNA, that determines the sequence of amino acids in protein synthesis. The DNA sequence of a gene can be used to predict the mRNA sequence, and the genetic code can in turn be used to predict the amino acid sequence. Genetic engineering technologies: See recombinant DNA technologies. Genetic map: See linkage map. Genetic material: See genome. Genetics: The study of the patterns of inheritance of specific traits. Genome: All the genetic material in the chromosomes of a particular organism; its size is generally given as its total number of base pairs. Genome projects: Research and technology development efforts aimed at mapping and sequencing some or all of the genome of human beings and other organisms. Genomic library: A collection of clones made from a set of randomly generated overlapping DNA fragments representing the entire genome of an organism. Compare library, arrayed library. Guanine (G): A nitrogenous base, one member of the base pair G-C (guanine and cytosine). Haploid: A single set of chromosomes (half the full set of genetic material), present in the egg and sperm cells of animals and in the egg and pollen cells of plants. Human beings have 23 chromosomes in their reproductive cells. Compare diploid. Heterozygosity: The presence of different alleles at one or more loci on homologous chromosomes. Homeobox: A short stretch of nucleotides whose base sequence is virtually identical in all the genes that contain it. It has been found in many organisms from fruit flies to human beings. In the fruit fly, a homeobox appears to determine when particular groups of genes are expressed during development. Homologies: Similarities in DNA or protein sequences between individuals of the same species or among different species. Homologous chromosomes: A pair of chromosomes containing the same linear gene sequences, each derived from one parent. Human gene therapy: Insertion of normal DNA directly into cells to correct a genetic defect. Human Genome Initiative: Collective name for several projects begun in 1986 by DOE to (1) create an ordered set of DNA segments from known chromosomal locations, (2) develop new computational methods for analyzing genetic map and DNA sequence data, and (3) develop new techniques and instruments for detecting and analyzing DNA. This DOE initiative is now known as the Human Genome Program. The national effort, led by DOE and NIH, is known as the Human Genome Project. Hybridization: The process of joining two complementary strands of DNA or one each of DNA and RNA to form a double-stranded molecule. Informatics: The study of the application of computer and statistical techniques to the management of information. In genome projects, informatics includes the development of methods to search databases quickly, to analyze DNA sequence information, and to predict protein sequence and structure from DNA sequence data. In situ hybridization: Use of a DNA or RNA probe to detect the presence of the complementary DNA sequence in cloned bacterial or cultured eukaryotic cells. Interphase: The period in the cell cycle when DNA is replicated in the nucleus; followed by mitosis. Introns: The DNA base sequences interrupting the protein-coding sequences of a gene; these sequences are transcribed into RNA but are cut out of the message before it is translated into protein. Compare exons. In vitro: Outside a living organism. Karyotype: A photomicrograph of an individual's chromosomes arranged in a standard format showing the number, size, and shape of each chromosome type; used in low-resolution physical mapping to correlate gross chromosomal abnormalities with the characteristics of specific diseases. kb: See kilobase. Kilobase (kb): Unit of length for DNA fragments equal to 1000 nucleotides. Library: An unordered collection of clones (i.e., cloned DNA from a particular organism), whose relationship to each other can be established by physical mapping. Compare genomic library, arrayed library. Linkage: The proximity of two or more markers (e.g., genes, RFLP markers) on a chromosome; the closer together the markers are, the lower the probability that they will be separated during DNA repair or replication processes (binary fission in prokaryotes, mitosis or meiosis in eukaryotes), and hence the greater the probability that they will be inherited together. Linkage map: A map of the relative positions of genetic loci on a chromosome, determined on the basis of how often the loci are inherited together. Distance is measured in centimorgans (cM). Localize: Determination of the original position (locus) of a gene or other marker on a chromosome. Locus (pl. loci): The position on a chromosome of a gene or other chromosome marker; also, the DNA at that position. The use of locus is sometimes restricted to mean regions of DNA that are expressed. See gene expression. Macrorestriction map: Map depicting the order of and distance between sites at which restriction enzymes cleave chromosomes. Mapping: See gene mapping, linkage map, physical map. Marker: An identifiable physical location on a chromosome (e.g., restriction enzyme cutting site, gene) whose inheritance can be monitored. Markers can be expressed regions of DNA (genes) or some segment of DNA with no known coding function but whose pattern of inheritance can be determined. See RFLP, restriction fragment length polymorphism. Mb: See megabase. Megabase (Mb): Unit of length for DNA fragments equal to 1 million nucleotides and roughly equal to 1 cM. Meiosis: The process of two consecutive cell divisions in the diploid progenitors of sex cells. Meiosis results in four rather than two daughter cells, each with a haploid set of chromosomes. Messenger RNA (mRNA): RNA that serves as a template for protein synthesis. See genetic code. Metaphase: A stage in mitosis or meiosis during which the chromosomes are aligned along the equatorial plane of the cell. Mitosis: The process of nuclear division in cells that produces daughter cells that are genetically identical to each other and to the parent cell. mRNA: See messenger RNA. Multifactorial or multigenic disorders: See polygenic disorders. Multiplexing: A sequencing approach that uses several pooled samples simultaneously, greatly increasing sequencing speed. Mutation: Any heritable change in DNA sequence. Compare polymorphism. Nitrogenous base: A nitrogen-containing molecule having the chemical properties of a base. Nucleic acid: A large molecule composed of nucleotide subunits. Nucleotide: A subunit of DNA or RNA consisting of a nitrogenous base (adenine, guanine, thymine, or cytosine in DNA; adenine, guanine, uracil, or cytosine in RNA), a phosphate molecule, and a sugar molecule (deoxyribose in DNA and ribose in RNA). Thousands of nucleotides are linked to form a DNA or RNA molecule. See DNA, base pair, RNA. Nucleus: The cellular organelle in eukaryotes that contains the genetic material. Oncogene: A gene, one or more forms of which is associated with cancer. Many oncogenes are involved, directly or indirectly, in controlling the rate of cell growth. Overlapping clones: See genomic library. PCR: See polymerase_chain reaction. Phage: A virus for which the natural host is a bacterial cell. Physical map: A map of the locations of identifiable landmarks on DNA (e.g., restriction enzyme cutting sites, genes), regardless of inheritance. Distance is measured in base pairs. For the human genome, the lowest-resolution physical map is the banding patterns on the 24 different chromosomes; the highest-resolution map would be the complete nucleotide sequence of the chromosomes. Plasmid: Autonomously replicating, extrachromosomal circular DNA molecules, distinct from the normal bacterial genome and nonessential for cell survival under nonselective conditions. Some plasmids are capable of integrating into the host genome. A number of artificially constructed plasmids are used as cloning vectors. Polygenic disorders: Genetic disorders resulting from the combined action of alleles of more than one gene (e.g., heart disease, diabetes, and some cancers). Although such disorders are inherited, they depend on the simultaneous presence of several alleles; thus the hereditary patterns are usually more complex than those of single-gene disorders. Compare single-gene disorders. Polymerase chain reaction (PCR): A method for amplifying a DNA base sequence using a heat-stable polymerase and two 20-base primers, one complementary to the (+)-strand at one end of the sequence to be amplified and the other complementary to the (-)-strand at the other end. Because the newly synthesized DNA strands can subsequently serve as additional templates for the same primer sequences, successive rounds of primer annealing, strand elongation, and dissociation produce rapid and highly specific amplification of the desired sequence. PCR also can be used to detect the existence of the defined sequence in a DNA sample. Polymerase, DNA or RNA: Enzymes that catalyze the synthesis of nucleic acids on preexisting nucleic acid templates, assembling RNA from ribonucleotides or DNA from deoxyribonucleotides. Polymorphism: Difference in DNA sequence among individuals. Genetic variations occurring in more than 1% of a population would be considered useful polymorphisms for genetic linkage analysis. Compare mutation. Primer: Short preexisting polynucleotide chain to which new deoxyribonucleotides can be added by DNA polymerase. Probe: Single-stranded DNA or RNA molecules of specific base sequence, labeled either radioactively or immunologically, that are used to detect the complementary base sequence by hybridization. Prokaryote: Cell or organism lacking a membrane-bound, structurally discrete nucleus and other subcellular compartments. Bacteria are prokaryotes. Compare eukaryote. See chromosomes. Promoter: A site on DNA to which RNA polymerase will bind and initiate transcription. Protein: A large molecule composed of one or more chains of amino acids in a specific order; the order is determined by the base sequence of nucleotides in the gene coding for the protein. Proteins are required for the structure, function, and regulation of the body's cells, tissues, and organs, and each protein has unique functions. Examples are hormones, enzymes, and antibodies. Purine: A nitrogen-containing, single-ring, basic compound that occurs in nucleic acids. The purines in DNA and RNA are adenine and guanine. Pyrimidine: A nitrogen-containing, double-ring, basic compound that occurs in nucleic acids. The pyrimidines in DNA are cytosine and thymine; in RNA, cytosine and uracil. Rare-cutter enzyme: See restriction enzyme cutting site. Recombinant clones: Clones containing recombinant DNA molecules. See recombinant DNA technologies. Recombinant DNA molecules: A combination of DNA molecules of different origin that are joined using recombinant DNA technologies. Recombinant DNA technologies: Procedures used to join together DNA segments in a cell-free system (an environment outside a cell or organism). Under appropriate conditions, a recombinant DNA molecule can enter a cell and replicate there, either autonomously or after it has become integrated into a cellular chromosome. Recombination: The process by which progeny derive a combination of genes different from that of either parent. In higher organisms, this can occur by crossing over. Regulatory regions or sequences: A DNA base sequence that controls gene expression. Resolution: Degree of molecular detail on a physical map of DNA, ranging from low to high. Restriction enzyme, endonuclease: A protein that recognizes specific, short nucleotide sequences and cuts DNA at those sites. Bacteria contain over 400 such enzymes that recognize and cut over 100 different DNA sequences. See restriction enzyme cutting site. Restriction enzyme cutting site: A specific nucleotide sequence of DNA at which a particular restriction enzyme cuts the DNA. Some sites occur frequently in DNA (e.g., every several hundred base pairs), others much less frequently (rare-cutter; e.g., every 10,000 base pairs). Restriction fragment length polymorphism (RFLP): Variation between individuals in DNA fragment sizes cut by specific restriction enzymes; polymorphic sequences that result in RFLPs are used as markers on both physical maps and genetic linkage maps. RFLPs are usually caused by mutation at a cutting site. See marker. RFLP: See restriction fragment length polymorphism. Ribonucleic acid (RNA): A chemical found in the nucleus and cytoplasm of cells; it plays an important role in protein synthesis and other chemical activities of the cell. The structure of RNA is similar to that of DNA. There are several classes of RNA molecules, including messenger RNA, transfer RNA, ribosomal RNA, and other small RNAs, each serving a different purpose. Ribonucleotides: See nucleotide. Ribosomal RNA (rRNA): A class of RNA found in the ribosomes of cells. Ribosomes: Small cellular components composed of specialized ribosomal RNA and protein; site of protein synthesis. See ribonucleic acid (RNA). RNA: See ribonucleic acid. Sequence: See base sequence. Sequence tagged site (STS): Short (200 to 500 base pairs) DNA sequence that has a single occurrence in the human genome and whose location and base sequence are known. Detectable by polymerase chain reaction, STSs are useful for localizing and orienting the mapping and sequence data reported from many different laboratories and serve as landmarks on the developing physical map of the human genome. Expressed sequence tags (ESTs) are STSs derived from cDNAs. Sequencing: Determination of the order of nucleotides (base sequences) in a DNA or RNA molecule or the order of amino acids in a protein. Sex chromosomes: The X and Y chromosomes in human beings that determine the sex of an individual. Females have two X chromosomes in diploid cells; males have an X and a Y chromosome. The sex chromosomes comprise the 23rd chromosome pair in a karyotype. Compare autosome. Shotgun method: Cloning of DNA fragments randomly generated from a genome. See library, genomic library. Single-gene disorder: Hereditary disorder caused by a mutant allele of a single gene (e.g., Duchenne muscular dystrophy, retinoblastoma, sickle cell disease). Compare polygenic disorders. Somatic cells: Any cell in the body except gametes and their precursors. Southern blotting: Transfer by absorption of DNA fragments separated in electrophoretic gels to membrane filters for detection of specific base sequences by radiolabeled complementary probes. STS: See sequence tagged site. Tandem repeat sequences: Multiple copies of the same base sequence on a chromosome; used as a marker in physical mapping. Technology transfer: The process of converting scientific findings from research laboratories into useful products by the commercial sector. Telomere: The ends of chromosomes. These specialized structures are involved in the replication and stability of linear DNA molecules. See DNA replication. Thymine (T): A nitrogenous base, one member of the base pair A-T (adenine-thymine). Transcription: The synthesis of an RNA copy from a sequence of DNA (a gene); the first step in gene expression. Compare translation. Transfer RNA (tRNA): A class of RNA having structures with triplet nucleotide sequences that are complementary to the triplet nucleotide coding sequences of mRNA. The role of tRNAs in protein synthesis is to bond with amino acids and transfer them to the ribosomes, where proteins are assembled according to the genetic code carried by mRNA. Transformation: A process by which the genetic material carried by an individual cell is altered by incorporation of exogenous DNA into its genome. Translation: The process in which the genetic code carried by mRNA directs the synthesis of proteins from amino acids. Compare transcription. tRNA: See transfer RNA. Uracil: A nitrogenous base normally found in RNA but not DNA; uracil is capable of forming a base pair with adenine. Vector: See cloning vector. Virus: A noncellular biological entity that can reproduce only within a host cell. Viruses consist of nucleic acid covered by protein; some animal viruses are also surrounded by membrane. Inside the infected cell, the virus uses the synthetic capability of the host to produce progeny virus. VLSI: Very large-scale integration allowing over 100,000 transistors on a chip. YAC: See yeast artificial chromosome. Yeast artificial chromosome (YAC): A vector used to clone DNA fragments (up to 400 kb); it is constructed from the telomeric, centromeric, and replication origin sequences needed for replication in yeast cells. Compare cloning vector, cosmid. Index to Principal and Coinvestigators Listed in Abstracts To retrieve these abstracts use the following: --> 8. Search Abstracts of DOE-Funded Genome Research You may search by Author Name, Address, or any word that appears in the abstract. You may narrow your search by using the boolean operators (and. or, not) or by phrase searches ("....."). For example - if you want to see all the mouse work funded by the DOE Genome projuct simply search for mouse But if you want to see only the mouse projects that have proposed to use Fluorescence In Situ Hybridization (FISH) search for: mouse and fish this will narrow the results dramatically. Adams, Mark 97 Bulger, Ruth 156 Adamson, Anne 164 Burks, Christian 141 Alexander, Peter 182 Callen, David 106, 108 Allen, Michael 177 Campbell, Evelyn 83, 89 Allison, David 125 Campbell, Mary 83, 89 Amemiya, Chris 100, 104 Cantor, Charles 111, 163 Anderson, N. Leigh 168 Carrano, Anthony 84, 88, 94, Anderson, Norman 168 100, 103, Anderson, W. Holt 167 104, 109, 139 Andreason, Grai 174 Casey, Denise 164 Antonarakis, Stylianos 172 Caskey, C. Thomas 99, 157 Apostolou, Sinoula 108 Cassatt, James 140 Apsell, Paula 156 Chait, Brian 136 Arenstorf, Hartwig 171 Chedd, Graham 156 Arlinghaus, Heinrich 127, Chen, Chira 104 165, 168, 177 Chen, C. H. Winston 120 Ashworth, Linda 104 Chen, Ed 143 Aslandidis, Charalampos 100 Chen, Jiun 132 Athwal, Raghbir 82 Chen, Liang 108 Bacha, Hamid 185 Chen, Shizhong 174 Baker, Elizabeth 108 Cheng, Jan-Fang 83, 100 Baker, Mark 123 Cherkauer, Kevin 151 Balding, David 152 Church, George 121 Balhorn, Rodney 177, 181 Cinkosky, Michael 141, 141 Balooch, Mehdi 177 Clancy, Suzanne 101 Barker, David 172 Clark, Steven 101 Beckwith, Jonathon 186 Collins, Debra 157 Beeson, Diane 158 Combs, Jesse 104 Benner, W. Henry 128 Copeland, Alex 104 Berg, Claire 137, 178 Corona, Angela 184 Berg, Douglas 178 Crandall, Lee 161 Beugelsdijk, Tony 111, 116 Craven, Mark 151 Birdsall, David 181 Crkvenjakov, Radomir 121 Birren, Bruce 105 Davidson, Jack 122 Black, Lindsay 93 Davidson, K. Alicia 164 Blackwell, Tom 152 Deaven, Larry 83, 84, 89, 106 Bonaldo, Maria 96 de Jong, Pieter 84, 100, 103, Bouma, Hessel III 157 104 Boyartchuk, Victor 83 Denton, M. Bonner 123 Bradbury, E. Morton 82 Djbali, Malek 174 Brandriff, Brigitte 109 Doggett, Norman 106, 108 Branscomb, Elbert 103, 139 Dougherty, Randall 141 Brase, James 181 Douthart, Richard 140 Bremer, Meire 105 Dovichi, Norman 124 Brennan, Thomas 119 Drmanac, Radoje 121 Bridgers, Michael 141 Dubnick, Mark 97 Brody, Linnea 98 Dunn, John 138 Bronstein, Irena 167, 170 Durkin, Scott 92 Brown, Gilbert 119, 125, 127 Duster, Troy 158 Brown, Henry 141 Earle, Colin 123 Brown, Stephen 96 Edmonds, Charles 136 Brule, James 185 Edwards, Brooks 170 Efstratiadis, Agiris 96 Hofmann, Gunter 174 Einstein, Ralph 154 Hollen, Robert 111, 116 Entine, Gerald 173 Holmes, Linda 163 Epling, Gary 96 Holtzman, Neil 159 Eubanks, James 174 Honda, Sandra 144 Evans, Glen 101, 174 Hood, Leroy 126, 143 Faber, Vance 141 Hopkins, Janet 95 Fader, Betsy 186 Hozier, John 88 Fain, Pamela 172 Huang, Henry 178 Fairfield, Frederic 152 Huang, Xiaohua 132 Fawcett, John 89 Huber, Hans 180 Feitshans, Ilise 159 Huhn, Greg 174 Ferrell, Thomas 125 Hunkapiller, Tim 143, 147, Fickett, James 141, 152 148 Fields, Christopher 97, 142 Hurst, Gerald 166 Fischer, Peggy 186 Hutchinson, Marge 145, 155 Flatley, Jay 167 Imara, Mwalimu 161 Fockler, Carita 105 Jackson, Cynthia 86 Foote, Robert 119, 122, 125, Jacobson, K. Bruce 119, 120, 127 125, 127 Francomano, Clare 148 Fullarton, Jane 156 Jaklevic, Joseph 113, 115, Furuya, Frederic 166 117, 128, 128 Gabra, Nashua 175 Jelenc, Pierre 96 Gatewood, Joe 82 Jett, James 114, 129 Generoso, Estela 107 Johnson, Lori 104 Gesteland, Raymond 126 Juo, Rouh-Rong 170 Gibson, William 165 Jurka, Jerzy 144 Giddings, J. Calvin 112 Kandpal, Rajendra 171 Gingrich, Jeffrey 102 Kang, Hee-Chol 179 Giovannini, Marco 174 Kao, Fa-Ten 86 Glazer, Alexander 132 Karger, Barry 114 Goldberg, Mark 141 Katz, Joseph 115, 128 Gong, Kevin 145 Kaufman, Daniel 174 Grad, Frank 159 Keller, Richard 123, 129 Grady, Deborah 84 Kelley, Jenny 97 Gray, Joe 85, 181 Kerlavage, Anthony 97 Greener, Phillip 98 Khan, Akbar 95 Gusfield, Daniel 144, 148 Knoche, Kimberly 166 Hahn, Peter 88 Kolbe, William 113, 115, 128 Hainfeld, James 112, 166 Kopelman, Raoul 130 Hansen, Tony 113 Korenberg, Julie 87 Hart, Reece 174 Kozman, Helen 108 Hartman, John 167, 169 Kozubel, Mark 116 Haugland, Richard 179 Kuo, Wen-Lin 85 Hempfner, Philip 141 Lai, Tran 141 Henderson, Margaret 161 Lane, Michael 88 Hermanson, Gary 101, 174 Lane, Sharon 108 Hewitt, Peter 179 Langmore, John 130 Hieter, Philip 172 Lapp‚, Marc 186 Hildebrand, C. Edgar 106, 108 Larimer, Frank 119, 127 Himawan, Jeff 133 Lawler, Eugene 144, 148 Hoekstra, Merl 101 Lawrence, Charles 144 Hoffman, Lance 186 Lee, Bill 94 Lennon, Gregory 88 Murphy, Timothy 186 Leonard, Lisa 174 Myers, Eugene 144 Lerman, Leonard 115, 175 Nagle, James 97 Lewis, Suzanna 145, 145, 147 Nancarrow, Julie 108 Longmire, Jon 84 Natowicz, Marvin 160 Loo, Joseph 136 Nelson, David 139 Lowery, Robert 166 Nelson, David L. 91, 99 Lowry, Steven 102 Nelson, Debra 141 Lumley, Amanda 163 Nelson, J. Robert 157 Macken, Catherine 152 Nierman, William 92, 171 Maglott, Donna 92, 171 Nikolic, Julia 100 Makarov, Vladimir 130 Noordewier, Michiel 151 Mann, Reinhold 154 Oehler, Chuck 166 Mansfield, Betty 164 Okumura, K. 106 Mark, Hon Fong 86 Olken, Frank 145, 145, 148 Markowitz, Victor 145, 146, Olsen, Anne 104 147 Orpana, Arto 95 Marr, Thomas 148 Orr, Bradford 130 Martin, Christopher 170 Overbeek, Ross 182 Martin, Christopher H. 89, Page, George 160 103, 131 Palazzolo, Michael 89, 103, Martin, John 114, 129 131 Martin, Sheryl 164 Parimoo, Satish 171 Matheson, Nina 148 Patanjali, Sankhavaram 171 Mathies, Richard 132 Payne, Marvin 120 Maurer, Susanne 174 Pearson, Peter 148 Mayeda, Carol 89, 103, 131 Pecherer, Robert 141, 141 McAllister, Douglass 167 Pelkey, Joanne 140 McCarthy, John 145 Peters, Don 85 McCormick, MaryKay 84, 89, Pfeifer, Gerd 135 106 Phillips, Hilary 108 McElligott, David 101, 174 Phoenix, David 161 McInerney, Joseph 159 Pinkel, Dan 85 McKean, Ronald 175 Pirrung, Michael 133 McKusick, Victor 148 Polymeropoulos, Mihael 92 Mead, David 132 Powell, Richard 166 Medvick, Patricia 111, 116 Pratt, Lorien 151 Meincke, Linda 89 Quesada, Mark 132 Merrill, Carl 92 Radspinner, David 123 Meyne, Julie 91 Ramsey, Roswitha 119 Michael, Sharon 97 Rao, Venigalla 93 Micklos, David 161 Ratliff, Robert 91 Milosavljevic, Aleksandar 144 Reilly, Phillip 161 Mohrenweiser, Harvey 88, 103 Reiner, Andrew 148 Moir, Donald 90 Richards, Robert 108 Moore, Stefan 160 Richardson, Charles 133, 180 Moreno, Ruben 97 Riggs, Arthur 135 Mosley, Ray 161 Rinchik, Eugene 93, 94, 107, Moyzis, Robert 84, 89, 91, 153 106 Ringold, Gordon 167 Mucenski, Mike 94 Robbins, Robert 148 Mulley, John 108 Roberts, Randy 111, 116 Mundt, Mark 141 Roman, Maria 174 Mural, Richard 153, 154 Romo, Anthony 174 Roszak-MacDonell, Darlene 167 Sutherland, Grant 106, 108 Rush, John 180 Sutherland, Robert 141 Rye, Hays 132 Swaroop, Anand 171 Sachleben, Richard 119, 125, Szeto, Ernest 146 127 Tabor, Stanley 133, 180 Sainz, Jesus 105 Tan, Weihong 130 Saleh, Mary 174 Tang, Jane 100 Schenk, Karen 152 Thakhar, Vishakha 93 Schimke, R. Neil 157 Theil, Edward 113, 117 Schmitt, Eric 175 Thompson, Andrew 108 Schwartz, Stanley 183 Thonnard, Norbert 127, 165 Searls, David 150 Thundat, Thomas 125 Segebrecht, Linda 157 Thurman, David 140 Selleri, Licia 101, 174 Toliver, Greg 174 Sgro, Peichen 141 Torney, David 141, 152 Shavlik, Jude 151 Towell, Geoffrey 151 Shen, Yang 108 Trask, Barbara 84, 103, 109 Shera, E. Brooks 129 Trebes, James 181 Shi, Zhong You 130 Trimmer, David 116 Shizuya, Hiroaki 105 Trottier, Ralph 161 Shoshani, Arie 146 Troup, Charles 141 Siciliano, Michael 94 Tynan, Katherine 103, 109 Siekhaus, Wigbert 177 Uber, Donald 113, 117 Sikela, James 95 Uberbacher, Edward 153, 154 Simon, Melvin 105 van den Engh, Ger 84, 88, 109 Sindelar, Linda 113 Varghese, Alison 170 Slezak, Tom 139 Vaux, Kenneth 186 Smith, Cassandra 105 Venter, J. Craig 97 Smith, Lloyd 132, 135, 136 Vos, Jean-Michel 97 Smith, Michael 101 Voyta, John 170 Smith, Richard 136 Wagner, Caryn 174 Smith, Steven 130 Wahl, Geoffrey 98 Snider, Ken 174 Walichiewicz, Jolanta 144 Soares, Marcelo 96 Wang, Denan 105 Soderlund, Carol 142 Warburton, Dorothy 159 Solomon, David 169 Ward, David 106, 171 Sorenson, Doug 141, 141 Warmack, Robert 125 Speed, Terence 148 Wassom, John 164 Spengler, Sylvia 164 Waterman, Michael 143 Stallings, Raymond 106, 108 Weiss, Robert 126 Stevens, Tamara 95 Weissman, Sherman 171 Stiegman, Jeffrey 183, 184 Wendroff, Burton 152 Stinnett, Donna 164 West, John 185 Stormon, Charles 185 Whitaker, James 179 Storti, George 176 Whitmore, Scott 108 Stovall, Leonard 116 Whitsitt, Andrew 151 Strathmann, Michael 131 Whittaker, Clive 152 Strausbaugh, Linda 137 Wilcox, Andrea 95 Stricker, Jenny 159 Wilder, Mark 114 Stubbs, Lisa 107 Williams, Peter 138 Studier, F. William 138 Winternitz, Katherine 159 Sudar, Damir 85 Witkowski, Jan 161 Sun, Tian-Qiang 97 Wohlpart, Alfred 163 Sutherland, Betsy 96 Woodbury, Neal 138 Woychik, Richard 93, 94, 119, 127, 153 Wright, James 163 Wyrick, Judy 164 Xiao, Hong 92 Yang, Sherman 144 Yantis, Bonnie 141 Yesley, Michael 164 Yeung, Edward 117 Yokobata, Kathy 100 Yorkey, Thomas 181 Yoshida, Kaoru 105 Youderian, Philip 98 Yu, Jing-Wei 86 Yust, Laura 164 Zhao, Jun 174 Zorn, Manfred 145, 155 Acronym List AEC Atomic Energy Commission ANL* Argonne National Laboratory, Argonne, IL ATCC American Type Culture Collection, Rockville, MD BNL* Brookhaven National Laboratory, Upton, NY CEPH Centre d'Etude du Polymorphisme Humain CRADA Cooperative Research and Development Agreement DKFZ German Cancer Research Center DOE Department of Energy ERDA Energy Research and Development Administration FCCSET Federal Coordinating Council on Science, Engineering, and Technology GDB* Genome Data Base HERAC* Health and Environmental Research Advisory Committee HGCC* Human Genome Coordinating Committee HGMIS* Human Genome Management Information System (ORNL) HUGO Human Genome Organization (international) JHU Johns Hopkins University JITF* Joint Informatics Task Force LANL* Los Alamos National Laboratory, Los Alamos, NM LBL* Lawrence Berkeley Laboratory, Berkeley, CA LLNL* Lawrence Livermore National Laboratory, Livermore, CA MRC Medical Research Council (U.K.) NAS National Academy of Sciences (U.S.) NCHGR National Center for Human Genome Research NIH National Institutes of Health, Bethesda, MD NLGLP* National Laboratory Gene Library Project (LANL, LLNL) NRC National Research Council (NAS) NSF National Science Foundation OHER* Office of Health and Environmental Research ORNL* Oak Ridge National Laboratory, Oak Ridge, TN OSTP Office of Scientific and Technology Policy (White House) OTA Office of Technology Assessment (U.S. Congress) PACHG Program Advisory Committee on the Human Genome PNL* Pacific Northwest Laboratory, Richland, WA SBIR Small Business Innovation Research SCC Scientific Coordinating Committee TWAS Third World Academy of Sciences UNESCO United Nations Educational, Scientific, and Cultural Organization USDA U.S. Department of Agriculture *Denotes U.S. Department of Energy organizations. Figure and Photograph Captions This drawing by Leonardo da Vinci symbolizes the quest for knowledge through exploration of the unknown. In his art, Leonardo concentrated on illustrating fundamental rules governing the physical world to reveal the unity underlying the diversity of nature. Just as the Renaissance brought broadened intellectural horizons and rapid advances in the natural sciences and technology, so will the 21st century, 500 years later, witness a revolution in many sciences as research unlocks the secrets of the molecular structure governing the human body, one of nature's masterpieces. Fig. 1. The Human Genome at Four Levels of Detail. Apart from reproductive cells (gametes) and mature red blood cells, every cell in the human body contains 23 pairs of chromosomes, each a packet of compressed and entwined DNA (1, 2). Each strand of DNA consists of repeating nucleotide units composed of a phosphate group, a sugar (deoxyribose), and a base (guanine, cytosine, thymine, or adenine) (3). Ordinarily, DNA takes the form of a highly regular double-stranded helix, the strands of which are linked by hydrogen bonds between guanine and cytosine and between thymine and adenine. Each such linkage is a base pair (bp); some 3 billion bp constitute the human genome. The specificity of these base-pair linkages underlies the mechanism of DNA replication illustrated here. Each strand of the double helix serves as a template for the synthesis of a new strand; the nucleotide sequence (i.e., linear order of bases) of each strand is strictly determined. Each new double helix is a twin, an exact replica, of its parent. (Figure and caption text provided by the LBL Human Genome Center.) Fig. 2. DNA Structure. The four nitrogenous bases of DNA are arranged along the sugar-phosphate backbone in a particular order (the DNA sequence), encoding all genetic instructions for an organism. Adenine (A) pairs with thymine (T), while cytosine (C) pairs with guanine (G). The two DNA strands are held together by weak bonds between the bases. A gene is a segment of a DNA molecule (ranging from fewer than 1 thousand bases to several million), located in a particular position on a specific chromosome, whose base sequence contains the information necessary for protein synthesis. Fig. 3. Comparison of Largest Known DNA Sequence with Approximate Chromosome and Genome Sizes of Model Organisms and Humans. A major focus of the Human Genome Project is the development of sequencing schemes that are faster and more economical. Comparative Sequence Sizes Bases Largest known continuous DNA 350 Thousand sequence (yeast chromosome 3) Escherichia coli (bacterium) genome 4.6 Million Largest yeast chromosome now mapped 5.8 Million Entire yeast genome 15 Million Smallest human chromosome (Y) 50 Million Largest human chromosome (1) 250 Million Entire human genome 3 Billion Fig. 4. DNA Replication. During replication the DNA molecule unwinds, with each single strand becoming a template for synthesis of a new, complementary strand. Each daughter molecule, consisting of one old and one new DNA strand, is an exact copy of the parent molecule. [Source: adapted from Mapping Our Genes_The Genome Projects: How Big, How Fast? U.S. Congress, Office of Technology Assessment, OTA-BA-373 (Washington, D.C.: U.S. Government Printing Office, 1988).] Fig. 5. Gene Expression. When genes are expressed, the genetic information (base sequence) on DNA is first transcribed (copied) to a molecule of messenger RNA in a process similar to DNA replication. The mRNA molecules then leave the cell nucleus and enter the cytoplasm, where triplets of bases (codons) forming the genetic code specify the particular amino acids that make up an individual protein. This process, called translation, is accomplished by ribosomes (cellular components composed of proteins and another class of RNA) that read the genetic code from the mRNA, and transfer RNAs (tRNAs) that transport amino acids to the ribosomes for attachment to the growing protein. (Source: see Fig. 4.) Fig. 6. Karyotype. Microscopic examination of chromosome size and banding patterns allows medical laboratories to identify and arrange each of the 24 different chromosomes (22 pairs of autosomes and one pair of sex chromosomes) into a karyotype, which then serves as a tool in the diagnosis of genetic diseases. The extra copy of chromosome 21 in this karyotype identifies this individual as having Down's syndrome. Fig. 7. Assignment of Genes to Specific Chromosomes. The number of genes assigned (mapped) to specific chromosomes has greatly increased since the first autosomal (i.e., not on the X or Y chromosome) marker was mapped in 1968. Most of these genes have been mapped to specific bands on chromosomes. The acceleration of chromosome assignments is due to (1) a combination of improved and new techniques in chromosome sorting and band analysis, (2) data from family studies, and (3) the introduction of recombinant DNA technology. [Source: adapted from Victor A. McKusick, "Current Trends in Mapping Human Genes," The FASEB Journal 5(1), 12 (1991).] HUMAN GENOME PROJECT GOALS Resolution * Complete a detailed human genetic map 2 Mb * Complete a physical map 0.1 Mb * Acquire the genome as clones 5 kb * Determine the complete sequence 1 bp * Find all the genes With the data generated by the project, investigators will determine the functions of the genes and develop tools for biological and medical applications. Fig. 8. Constructing a Genetic Linkage Map. Genetic linkage maps of each chromosome are made by determining how frequently two markers are passed together from parent to child. Because genetic material is sometimes exchanged during the production of sperm and egg cells, groups of traits (or markers) originally together on one chromosome may not be inherited together. Closely linked markers are less likely to be separated by spontaneous chromosome rearrangements. In this diagram, the vertical lines represent chromosome 4 pairs for each individual in a family. The father has two traits that can be detected in any child who inherits them: a short known DNA sequence used as a genetic marker (M) and Huntington's disease (HD). The fact that one child received only a single trait (M) from that particular chromosome indicates that the father's genetic material recombined during the process of sperm production. The frequency of this event helps determine the distance between the two DNA sequences on a genetic map . Fig. 9. Physical Mapping Strategies. Top-down physical mapping (a) produces maps with few gaps, but map resolution may not allow location of specific genes. Bottom-up strategies (b) generate extremely detailed maps of small areas but leave many gaps. A combination of both approaches is being used. [Source: Adapted from P. R. Billings et al., "New Techniques for Physical Mapping of the Human Genome," The FASEB Journal 5(1), 29 (1991).] Fig. 10. Types of Genome Maps. At the coarsest resolution, the genetic map measures recombination frequency between linked markers (genes or polymorphisms). At the next resolution level, restriction fragments of 1 to 2 Mb can be separated and mapped. Ordered libraries of cosmids and YACs have insert sizes from 40 to 400 kb. The base sequence is the ultimate physical map. Chromosomal mapping (not shown) locates genetic sites in relation to bands on chromosomes (estimated resolution of 5_Mb); new in situ hybridization techniques can place loci 100 kb apart. This direct strategy links the other four mapping approaches. [Source: see Fig. 9.] Fig. 11. Constructing Clones for Sequencing. Cloned DNA molecules must be made progressively smaller and the fragments subcloned into new vectors to obtain fragments small enough for use with current sequencing technology. Sequencing results are compiled to provide longer stretches of sequence across a chromosome. (Source: adapted from David A. Micklos and Greg A. Freyer, DNA Science, A First Course in Recombinant DNA Technology, Burlington, N.C.: Carolina Biological Supply Company, 1990.) DNA Amplification: Cloning (a) Cloning DNA in Plasmids. By fragmenting DNA of any origin (human, animal, or plant) and inserting it in the DNA of rapidly reproducing foreign cells, billions of copies of a single gene or DNA segment can be produced in a very short time. DNA to be cloned is inserted into a plasmid (a small, self-replicating circular molecule of DNA) that is separate from chromosomal DNA. When the recombinant plasmid is introduced into bacteria, the newly inserted segment will be replicated along with the rest of the plasmid. (b) Constructing an Overlapping Clone Library. A collection of clones of chromosomal DNA, called a library, has no obvious order indicating the original positions of the cloned pieces on the uncut chromosome. To establish that two particular clones are adjacent to each other in the genome, libraries of clones containing partly overlapping regions must be constructed. These clone libraries are ordered by dividing the inserts into smaller fragments and determining which clones share common DNA sequences. Fig. 12. DNA Sequencing. Dideoxy sequencing (also called chain-termination or Sanger method) uses an enzymatic procedure to synthesize DNA chains of varying lengths, stopping DNA replication at one of the four bases and then determining the resulting fragment lengths. Each sequencing reaction tube (T, C, G, and A) in the diagram contains * a DNA template, a primer sequence, and a DNA polymerase to initiate synthesis of a new strand of DNA at the point where the primer is hybridized to the template; * the four deoxynucleotide triphosphates (dATP, dTTP, dCTP, and dGTP) to extend the DNA strand; * one labeled deoxynucleotide triphosphate (using a radioactive element or dye); and * one dideoxynucleotide triphosphate, which terminates the growing chain wherever it is incorporated. Tube A has didATP, tube C has didCTP, etc. For example, in the A reaction tube the ratio of the dATP to didATP is adjusted so that each tube will have a collection of DNA fragments with a didATP incorporated for each adenine position on the template DNA fragments. The fragments of varying length are then separated by electrophoresis (1) and the positions of the nucleotides analyzed to determine sequence. The fragments are separated on the basis of size, with the shorter fragments moving faster and appearing at the bottom of the gel. Sequence is read from bottom to top (2). (Source: see Fig. 11.) Fig. 13. Cloning a Disease Gene by Chromosome Walking. After a marker is linked to within 1_cM of a disease gene, chromosome walking can be used to clone the disease gene itself. A probe is first constructed from a genomic fragment identified from a library as being the closest linked marker to the gene. A restriction fragment isolated from the end of the clone near the disease locus is used to reprobe the genomic library for an overlapping clone. This process is repeated several times to walk across the chromosome and reach the flanking marker on the other side of the disease-gene locus. (Source: see Fig. 11.) HUMAN GENETIC DIVERSITY: The Ultimate Human Genetic Database * Any two individuals differ in about 3 x 106 bases (0.1%). * The population is now about 5 x 109. * A catalog of all sequence differences would require 15 x 1015 entries. * This catalog may be needed to find the rarest or most complex disease genes. Fig. 14. Magnitude of Genome Data. If the DNA sequence of the human genome were compiled in books, the equivalent of 200 volumes the size of a Manhattan telephone book (at 1000 pages each) would be needed to hold it all. New data-analysis tools will be needed for understanding the information from genome maps and sequences. Fig. 15. Understanding Gene Function. Understanding how genes function will require analyses of the 3-D structures of the proteins for which the genes code.


E-Mail Fredric L. Rice / The Skeptic Tank