molecular cell biology lab troubleshooting
Home Forums /Molecular /Cell /Genetics /Proteomics /Neuroscience /Immunology /Bioinformatics /Histology /Pharmacology /Books /Blog /Methods
Search biowww:
CiteULike: Tag bioinformatics
  • What are decision trees?
    Nature Biotechnology, Vol. 26, No. 9., pp. 1011-1013.
  • Bayesian methods in bioinformatics and computational systems biology.
    Brief Bioinform (12 April 2007)

    Bayesian methods are valuable, inter alia, whenever there is a need to extract information from data that are uncertain or subject to any kind of error or noise (including measurement error and experimental error, as well as noise or random variation intrinsic to the process of interest). Bayesian methods offer a number of advantages over more conventional statistical techniques that make them particularly appropriate for complex data. It is therefore no surprise that Bayesian methods are becoming more widely used in the fields of genetics, genomics, bioinformatics and computational systems biology, where making sense of complex noisy data is the norm. This review provides an introduction to the growing literature in this area, with particular emphasis on recent developments in Bayesian bioinformatics relevant to computational systems biology.
  • Evolving research trends in bioinformatics.
    Brief Bioinform (31 October 2006)

    The cross-disciplinary nature of bioinformatics entails co-evolution with other biomedical disciplines, whereby some bioinformatics applications become popular in certain disciplines and, in turn, these disciplines influence the focus of future bioinformatics development efforts. We observe here that the growth of computational approaches within various biomedical disciplines is not merely a reflection of a general extended usage of computers and the Internet, but due to the production of useful bioinformatics databases and methods for the rest of the biomedical scientific community. We have used the abstracts stored both in the MEDLINE database of biomedical literature and in NIH-funded project grants, to quantify two effects. First, we examine the biomedical literature as a whole and find that the use of computational methods has become increasingly prevalent across biomedical disciplines over the past three decades, while use of databases and the Internet have been rapidly increasing over the past decade. Second, we study the recent trends in the use of bioinformatics topics. We observe that molecular sequence databases are a widely adopted contribution in biomedicine from the field of bioinformatics, and that microarray analysis is one of the major new topics engaged by the bioinformatics community. Via this analysis, we were able to identify areas of rapid growth in the use of informatics to aid in curriculum planning, development of computational infrastructure and strategies for workforce education and funding.
  • Gene selection in microarray data: the elephant, the blind men and our algorithms.
    Curr Opin Struct Biol, Vol. 13, No. 3. (June 2003), pp. 370-376.

    Gene expression array data provide shadows of intricate cellular processes. Learning how to make the most of the information present in expression arrays has become a discipline in itself. In recent years, there has been an explosion of methods that analyze gene expression arrays to produce long lists of genes that express differentially in distinct cellular states. These lists will have to be organized, and the algorithms that produced them combined, if we wish to piece together the rich cellular structures probed by this high-throughput technology. Researchers will have to understand the benefits and limitations of the many existing methods to produce the combination of algorithms that best suits their gene expression experiments.
  • TIGRFAMs: a protein family resource for the functional identification of proteins.
    Nucleic Acids Res, Vol. 29, No. 1. (1 January 2001), pp. 41-43.

    TIGRFAMs is a collection of protein families featuring curated multiple sequence alignments, hidden Markov models and associated information designed to support the automated functional identification of proteins by sequence homology. We introduce the term 'equivalog' to describe members of a set of homologous proteins that are conserved with respect to function since their last common ancestor. Related proteins are grouped into equivalog families where possible, and otherwise into protein families with other hierarchically defined homology types. TIGRFAMs currently contains over 800 protein families, available for searching or downloading at www.tigr.org/TIGRFAMs. Classification by equivalog family, where achievable, complements classification by orthology, superfamily, domain or motif. It provides the information best suited for automatic assignment of specific functions to proteins from large-scale genome sequencing projects.
  • Candidate gene approach for pharmacogenetic studies.
    Pharmacogenomics, Vol. 3, No. 1. (January 2002), pp. 47-56.

    Genetic diversity in the form of single nucleotide DNA polymorphisms (SNPs) contributes to variable disease susceptibility and drug response. The candidate gene approach has been widely used to identify the genetic basis for pharmacogenetic traits and becomes increasingly more powerful with the recent advances in genomic technologies. High-throughput sequencing and SNP genotyping technologies allow the study of thousands of candidate genes and the identification of those involved in drug efficacy and toxicity. Expression-based genomic technologies such as DNA microarrays and proteomics also facilitate the understanding of important biological and pharmacological pathways, thus identifying more candidate genes for SNP studies. Candidate gene-based pharmacogenetic studies will lead to improved drug development, improved clinical trial design and therapeutics tailored to individual genotypes.
  • Genome-wide operon prediction in Staphylococcus aureus.
    Nucleic Acids Res, Vol. 32, No. 12. (2004), pp. 3689-3702.

    Identification of operon structure is critical to understanding gene regulation and function, and pathogenesis, and for identifying targets towards the development of new antibiotics in bacteria. Recently, the complete genome sequences of a large number of important human bacterial pathogens have become available for computational analysis, including the major human Gram-positive pathogen Staphylococcus aureus. By annotating the predicted operon structure of the S.aureus genome, we hope to facilitate the exploration of the unique biology of this organism as well as the comparative genomics across a broad range of bacteria. We have integrated several operon prediction methods and developed a consensus approach to score the likelihood of each adjacent gene pair to be co-transcribed. Gene pairs were separated into distinct operons when scores were equal to or below an empirical threshold. Using this approach, we have generated a S.aureus genome map with scores annotated at the intersections of every adjacent gene pair. This approach predicted about 864 monocistronic transcripts and 533 polycistronic operons from the protein-encoding genes in the S.aureus strain Mu50 genome. When compared with a set of experimentally determined S.aureus operons from literature sources, this method successfully predicted at least 91% of gene pairs. At the transcription unit level, this approach correctly identified at least 92% of complete operons in this dataset. This consensus approach has enabled us to predict operons with high accuracy from a genome where limited experimental evidence for operon structure is available.
  • Tools and resources for identifying protein families, domains and motifs.
    Genome Biol, Vol. 3, No. 1. (2002)

    With the large influx of raw sequence data from genome sequencing projects, there is a need for reliable automatic methods for protein sequence analysis and classification. The most useful tools use various methods for identifying motifs or domains found in previously characterized protein families. This article reviews the tools and resources available on the web for identifying signatures within proteins and discusses how they may be used in the analysis of new or unknown protein sequences.
  • Navigating gene expression using microarrays--a technology review.
    Nat Cell Biol, Vol. 3, No. 8. (August 2001)

    Parallel quantification of large numbers of messenger RNA transcripts using microarray technology promises to provide detailed insight into cellular processes involved in the regulation of gene expression. This should allow new understanding of signalling networks that operate in the cell and of the molecular basis and classification of disease. But can the technology deliver such far-reaching promises?
  • Connecting the dots between genes, biochemistry, and disease susceptibility: systems biology modeling in human genetics
    Molecular Genetics and Metabolism, Vol. 84, No. 2. (February 2005), pp. 104-111.

    Understanding how DNA sequence variations impact human health through a hierarchy of biochemical and physiological systems is expected to improve the diagnosis, prevention, and treatment of common, complex human diseases. We have previously developed a hierarchical dynamic systems approach based on Petri nets for generating biochemical network models that are consistent with genetic models of disease susceptibility. This modeling approach uses an evolutionary computation approach called grammatical evolution as a search strategy for optimal Petri net models. We have previously demonstrated that this approach routinely identifies biochemical network models that are consistent with a variety of genetic models in which disease susceptibility is determined by nonlinear interactions between two or more DNA sequence variations. We review here this approach and then discuss how it can be used to model biochemical and metabolic data in the context of genetic studies of human disease susceptibility.
  • Automatic annotation of protein motif function with Gene Ontology terms.
    BMC Bioinformatics, Vol. 5, No. 1. (2 September 2004)

    BACKGROUND: Conserved protein sequence motifs are short stretches of amino acid sequence patterns that potentially encode the function of proteins. Several sequence pattern searching algorithms and programs exist foridentifying candidate protein motifs at the whole genome level. However, a much needed and important task is to determine the functions of the newly identified protein motifs. The Gene Ontology (GO) project is an endeavor to annotate the function of genes or protein sequences with terms from a dynamic, controlled vocabulary and these annotations serve well as a knowledge base. RESULTS: This paper presents methods to mine the GO knowledge base and use the association between the GO terms assigned to a sequence and the motifs matched by the same sequence as evidence for predicting the functions of novel protein motifs automatically. The task of assigning GO terms to protein motifs is viewed as both a binary classification and information retrieval problem, where PROSITE motifs are used as samples for mode training and functional prediction. The mutual information of a motif and aGO term association is found to be a very useful feature. We take advantage of the known motifs to train a logistic regression classifier, which allows us to combine mutual information with other frequency-based features and obtain a probability of correct association. The trained logistic regression model has intuitively meaningful and logically plausible parameter values, and performs very well empirically according to our evaluation criteria. CONCLUSIONS: In this research, different methods for automatic annotation of protein motifs have been investigated. Empirical result demonstrated that the methods have a great potential for detecting and augmenting information about the functions of newly discovered candidate protein motifs.
  • tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.
    Nucleic Acids Res, Vol. 25, No. 5. (1 March 1997), pp. 955-964.

    We describe a program, tRNAscan-SE, which identifies 99-100% of transfer RNA genes in DNA sequence while giving less than one false positive per 15 gigabases. Two previously described tRNA detection programs are used as fast, first-pass prefilters to identify candidate tRNAs, which are then analyzed by a highly selective tRNA covariance model. This work represents a practical application of RNA covariance models, which are general, probabilistic secondary structure profiles based on stochastic context-free grammars. tRNAscan-SE searches at approximately 30 000 bp/s. Additional extensions to tRNAscan-SE detect unusual tRNA homologues such as selenocysteine tRNAs, tRNA-derived repetitive elements and tRNA pseudogenes.
  • The Pfam protein families database.
    Nucleic Acids Res, Vol. 30, No. 1. (1 January 2002), pp. 276-280.

    Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the World Wide Web in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgb.ki.se/Pfam/, in France at http://pfam.jouy.inra.fr/ and in the US at http://pfam.wustl.edu/. The latest version (6.6) of Pfam contains 3071 families, which match 69% of proteins in SWISS-PROT 39 and TrEMBL 14. Structural data, where available, have been utilised to ensure that Pfam families correspond with structural domains, and to improve domain-based annotation. Predictions of non-domain regions are now also included. In addition to secondary structure, Pfam multiple sequence alignments now contain active site residue mark-up. New search tools, including taxonomy search and domain query, greatly add to the functionality and usability of the Pfam resource.
  • Diagnostic and prognostic prediction using gene expression profiles in high-dimensional microarray data.
    Br J Cancer, Vol. 89, No. 9. (3 November 2003), pp. 1599-1604.

    DNA microarrays are a potentially powerful technology for improving diagnostic classification, treatment selection and therapeutics development. There are, however, many potential pitfalls in the use of microarrays that result in false leads and erroneous conclusions. This paper provides a review of the key features to be observed in developing diagnostic and prognostic classification systems based on gene expression profiling and some of the pitfalls to be aware of in reading reports of microarray-based studies.
  • The gene ontology categorizer.
    Bioinformatics, Vol. 20 Suppl 1 (4 August 2004)

    SUMMARY: The Gene Ontology Categorizer, developed jointly by the Los Alamos National Laboratory and Procter & Gamble Corp., provides a capability for the categorization task in the Gene Ontology (GO): given a list of genes of interest, what are the best nodes of the GO to summarize or categorize that list? The motivating question is from a drug discovery process, where after some gene expression analysis experiment, we wish to understand the overall effect of some cell treatment or condition by identifying 'where' in the GO the differentially expressed genes fall: 'clustered' together in one place? in two places? uniformly spread throughout the GO? 'high', or 'low'? In order to address this need, we view bio-ontologies more as combinatorially structured databases than facilities for logical inference, and draw on the discrete mathematics of finite partially ordered sets (posets) to develop data representation and algorithms appropriate for the GO. In doing so, we have laid the foundations for a general set of methods to address not just the categorization task, but also other tasks (e.g. distances in ontologies and ontology merger and exchange) in both the GO and other bio-ontologies (such as the Enzyme Commission database or the MEdical Subject Headings) cast as hierarchically structured taxonomic knowledge systems.
  • Re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes.
    BMC Bioinformatics, Vol. 3, No. 1. (2002)

    BACKGROUND: Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene) is imbedded in the approach. RESULTS: We have developed a new program that automatically identifies biologically significant candidate genes in a bacterial genome. Twenty-six complete prokaryotic genomes were analyzed using this tool, and the accuracy of gene finding was assessed by comparison with existing annotations. This analysis revealed that, despite the enormous effort of genome program annotators, a small but not negligible number of genes annotated within the framework of sequencing projects are likely to be partially inaccurate or plainly wrong. Moreover, the analysis of several putative new genes shows that, as expected, many short genes have escaped annotation. In most cases, these new genes revealed frameshifts that could be either artifacts or genuine frameshifts. Some entirely unexpected new genes have also been identified. This allowed us to get a more complete picture of prokaryotic genomes. The results of this procedure are progressively integrated into the SWISS-PROT reference databank. CONCLUSIONS: The results described in the present study show that our procedure is very satisfactory in terms of gene finding accuracy. Except in few cases, discrepancies between our results and annotations provided by individual authors can be accounted for by the nature of each annotation process or by specific characteristics of some genomes. This stresses that close cooperation between scientists, regular update and curation of the findings in databases are clearly required to reduce the level of errors in genome annotation (and hence in reducing the unfortunate spreading of errors through centralized data libraries).
  • Functional genomics and proteomics--the role of nuclear medicine.
    Eur J Nucl Med Mol Imaging, Vol. 29, No. 1. (January 2002), pp. 115-132.

    Now that the sequencing of the human genome has been completed, the basic challenges are finding the genes, locating their coding regions and predicting their functions. This will result in a new understanding of human biology as well as in the design of new molecular structures as potential novel diagnostic or drug discovery targets. The assessment of gene function may be performed using the tools of the genome program. These tools represent high-throughput methods used to evaluate changes in the expression of many or all genes of an organism at the same time in order to investigate genetic pathways for normal development and disease. This will lead to a shift in the scientific paradigm: In the pre-proteomics era, functional assignments were derived from hypothesis-driven experiments designed to understand specific cellular processes. The new tools describe proteins on a proteome-wide scale, thereby creating a new way of doing cell research which results in the determination of three-dimensional protein structures and the description of protein networks. These descriptions may then be used for the design of new hypotheses and experiments in the traditional physiological, biochemical and pharmacological sense. The evaluation of genetically manipulated animals or newly designed biomolecules will require a thorough understanding of physiology, biochemistry and pharmacology and the experimental approaches will involve many new technologies, including in vivo imaging with single-photon emission tomography and positron emission tomography. Nuclear medicine procedures may be applied for the determination of gene function and regulation using established and new tracers or using in vivo reporter genes such as enzymes, receptors, antigens or transporters. Pharmacogenomics will identify new surrogate markers for therapy monitoring which may represent potential new tracers for imaging. Also, drug distribution studies for new therapeutic biomolecules are needed, at least during preclinical stages of drug development. Finally, new biomolecules will be developed by bioengineering methods which may be used for isotope-based diagnosis and treatment of disease.
  • Bioinformatics and genomic medicine.
    Genet Med, Vol. 4, No. 6 Suppl. (c 2002)

    Bioinformatics is a rapidly emerging field of biomedical research. A flood of large-scale genomic and postgenomic data means that many of the challenges in biomedical research are now challenges in computational science. Clinical informatics has long developed methodologies to improve biomedical research and clinical care by integrating experimental and clinical information systems. The informatics revolution in both bioinformatics and clinical informatics will eventually change the current practice of medicine, including diagnostics, therapeutics, and prognostics. Postgenome informatics, powered by high-throughput technologies and genomic-scale databases, is likely to transform our biomedical understanding forever, in much the same way that biochemistry did a generation ago. This paper describes how these technologies will impact biomedical research and clinical care, emphasizing recent advances in biochip-based functional genomics and proteomics. Basic data preprocessing with normalization and filtering, primary pattern analysis, and machine-learning algorithms are discussed. Use of integrative biochip informatics technologies, including multivariate data projection, gene-metabolic pathway mapping, automated biomolecular annotation, text mining of factual and literature databases, and the integrated management of biomolecular databases, are also discussed.
  • Overview of commonly used bioinformatics methods and their applications.
    Ann N Y Acad Sci, Vol. 1020 (May 2004), pp. 10-21.

    Bioinformatics, in its broad sense, involves application of computer processes to solve biological problems. A wide range of computational tools are needed to effectively and efficiently process large amounts of data being generated as a result of recent technological innovations in biology and medicine. A number of computational tools have been developed or adapted to deal with the experimental riches of complex and multivariate data and transition from data collection to information or knowledge. These include a wide variety of clustering and classification algorithms, including self-organized maps (SOM), artificial neural networks (ANN), support vector machines (SVM), fuzzy logic, and even hyphenated techniques as neuro-fuzzy networks. These bioinformatics tools are being evaluated and applied in various medical areas including early detection, risk assessment, classification, and prognosis of cancer. The goal of these efforts is to develop and identify bioinformatics methods with optimal sensitivity, specificity, and predictive capabilities.
  • Architectures for Java-based bioinformatics applications
    Industrial Management & Data Systems, Vol. 104, No. 7. (1 July 2004), pp. 578-588.

    Bioinformatics projects are currently under way at numerous universities and in industry. These projects typically involve processing large amounts of biological data and comparison of biological signals or sequences. Much of the existing work in bioinformatics software is based on such languages and platforms as Perl and Unix. This paper, proposes software architectures in Java to support biological applications allowing access of biological data using server-side Java programs on the Internet. The architecture follows the standards of unified modeling language (UML). UML architecture diagrams are presented for the Java-based bioinformatics applications. In addition, an overview of the Bio-Soft project under way at The Biomedical Research Institute (BRI) of the University of Wisconsin-Parkside is provided, which includes research and instructional software for bioinformatics applications.
  • Microbial gene identification using interpolated Markov models.
    Nucleic Acids Res, Vol. 26, No. 2. (15 January 1998), pp. 544-548.

    This paper describes a new system, GLIMMER, for finding genes in microbial genomes. In a series of tests on Haemophilus influenzae , Helicobacter pylori and other complete microbial genomes, this system has proven to be very accurate at locating virtually all the genes in these sequences, outperforming previous methods. A conservative estimate based on experiments on H.pylori and H. influenzae is that the system finds >97% of all genes. GLIMMER uses interpolated Markov models (IMMs) as a framework for capturing dependencies between nearby nucleotides in a DNA sequence. An IMM-based method makes predictions based on a variable context; i.e., a variable-length oligomer in a DNA sequence. The context used by GLIMMER changes depending on the local composition of the sequence. As a result, GLIMMER is more flexible and more powerful than fixed-order Markov methods, which have previously been the primary content-based technique for finding genes in microbial DNA.
  • On ontologies for biologists: the Gene Ontology--untangling the web.
    Novartis Found Symp, Vol. 247 (2002)

    The mantra of the 'post-genomic' era is 'gene function'. Yet surprisingly little attention has been given to how functional and other information concerning genes is to be captured, made accessible to biologists or structured in a computable form. The aim of the Gene Ontology (GO) Consortium is to provide a framework for both the description and the organisation of such information. The GO Consortium is presently concerned with three structured controlled vocabularies which can be used to describe three discrete biological domains, building structured vocabularies which can be used to describe the molecular function, biological roles and cellular locations of gene products.
  • The use and analysis of microarray data.
    Nat Rev Drug Discov, Vol. 1, No. 12. (December 2002), pp. 951-960.

    Functional genomics is the study of gene function through the parallel expression measurements of genomes, most commonly using the technologies of microarrays and serial analysis of gene expression. Microarray usage in drug discovery is expanding, and its applications include basic research and target discovery, biomarker determination, pharmacology, toxicogenomics, target selectivity, development of prognostic tests and disease-subclass determination. This article reviews the different ways to analyse large sets of microarray data, including the questions that can be asked and the challenges in interpreting the measurements.
  • A probabilistic view of gene function.
    Nat Genet, Vol. 36, No. 6. (June 2004), pp. 559-564.

    Cells are controlled by the complex and dynamic actions of thousands of genes. With the sequencing of many genomes, the key problem has shifted from identifying genes to knowing what the genes do; we need a framework for expressing that knowledge. Even the most rigorous attempts to construct ontological frameworks describing gene function (e.g., the Gene Ontology project) ultimately rely on manual curation and are thus labor-intensive and subjective. But an alternative exists: the field of functional genomics is piecing together networks of gene interactions, and although these data are currently incomplete and error-prone, they provide a glimpse of a new, probabilistic view of gene function. We outline such a framework, which revolves around a statistical description of gene interactions derived from large, systematically compiled data sets. In this probabilistic view, pleiotropy is implicit, all data have errors and the definition of gene function is an iterative process that ultimately converges on the correct functions. The relationships between the genes are defined by the data, not by hand. Even this comprehensive view fails to capture key aspects of gene function, not least their dynamics in time and space, showing that there are limitations to the model that must ultimately be addressed.
  • University bioinformatics programs on the rise.
    Nat Biotechnol, Vol. 19, No. 3. (March 2001), pp. 285-286.
  • The Pfam protein families database.
    Nucleic Acids Res, Vol. 32 Database issue (1 January 2004)

    Pfam is a large collection of protein families and domains. Over the past 2 years the number of families in Pfam has doubled and now stands at 6190 (version 10.0). Methodology improvements for searching the Pfam collection locally as well as via the web are described. Other recent innovations include modelling of discontinuous domains allowing Pfam domain definitions to be closer to those found in structure databases. Pfam is available on the web in the UK (http://www.sanger.ac.uk/Software/Pfam/), the USA (http://pfam.wustl.edu/), France (http://pfam.jouy.inra.fr/) and Sweden (http://Pfam.cgb.ki.se/).
  • Improved microbial gene identification with GLIMMER.
    Nucleic Acids Res, Vol. 27, No. 23. (1 December 1999), pp. 4636-4641.

    The GLIMMER system for microbial gene identification finds approximately 97-98% of all genes in a genome when compared with published annotation. This paper reports on two new results: (i) significant technical improvements to GLIMMER that improve its accuracy still further, and (ii) a comprehensive evaluation that demonstrates that the accuracy of the system is likely to be higher than previously recognized. A significant proportion of the genes missed by the system appear to be hypothetical proteins whose existence is only supported by the predictions of other programs. When the analysis is restricted to genes that have significant homology to genes in other organisms, GLIMMER misses <1% of known genes.
  • Intrinsic errors in genome annotation
    Trends in Genetics, Vol. 17, No. 8. (01 August 2001), pp. 429-431.

    Genome sequencing is usually followed by routine annotation of protein function based on the assumption that similar sequences will have similar functions. Here, we introduce a simple calculation to estimate the magnitude of any possible annotation errors. We counted the number of discrepancies in the annotation of well-established sets of similar proteins and extrapolated these values to the pairs of similar sequences used for the annotation of different microbial genomes. We conclude that the number of potential errors in the prediction of detailed functions is higher than is usually believed.
  • PromH: Promoters identification using orthologous genomic sequences.
    Nucleic Acids Res, Vol. 31, No. 13. (1 July 2003), pp. 3540-3545.

    Accurate prediction of promoters is fundamental for understanding gene expression patterns, cell specificity and development. In the studies of conserved features of regulatory regions of orthologous genes, it was observed that major promoter functional components such as transcription start points, TATA-boxes and regulatory motifs, are significantly more conservative than the sequences around them (70-100% compared with 30-50%). To improve promoter identification accuracy, we employed these findings in a new program, PromH, created by extending the TSSW program feature set. PromH uses linear discriminant functions that take into account conservation features and nucleotide sequences of promoter regions in pairs of orthologous genes. The program was tested on two sets of pairs of orthologous, mostly human and rodent, sequences with known transcription start sites (TSS), annotated to have TATA (21 genes, 11 orthologous pairs) and TATA-less (38 genes, 19 pairs) promoters, respectively. The program correctly predicted TSS for all 21 genes of the first set with a median deviation of 2 bp from true site location. Only for two genes, was there significant (46 and 105 bp) discrepancy between predicted and annotated TSS positions. For 38 TATA-less promoters from the second set, TSS was predicted for 27 genes, in 14 cases within 10 bp distance from annotated TSS, and in 21 cases--within 100 bp distance. Despite more discrepancies between predicted and annotated TSS for genes from the second set, these results are consistent with observations of much higher occurrence of multiple TSS in TATA-less promoters. In any case, our results show that PromH identifies TSS positions significantly more accurately than any other published promoter prediction method. The PromH program is available at http://www.softberry.com/berry.phtml?topic=promh.
  • Single nucleotide polymorphisms (SNPs) that map to gaps in the human SNP map.
    Nucleic Acids Res, Vol. 31, No. 16. (15 August 2003), pp. 4910-4916.

    An international effort is underway to generate a comprehensive haplotype map (HapMap) of the human genome represented by an estimated 300,000 to 1 million 'tag' single nucleotide polymorphisms (SNPs). Our analysis indicates that the current human SNP map is not sufficiently dense to support the HapMap project. For example, 24.6% of the genome currently lacks SNPs at the minimal density and spacing that would be required to construct even a conservative tag SNP map containing 300,000 SNPs. In an effort to improve the human SNP map, we identified 140,696 additional SNP candidates using a new bioinformatics pipeline. Over 51,000 of these SNPs mapped to the largest gaps in the human SNP map, leading to significant improvements in these regions. Our SNPs will be immediately useful for the HapMap project, and will allow for the inclusion of many additional genomic intervals in the final HapMap. Nevertheless, our results also indicate that additional SNP discovery projects will be required both to define the haplotype architecture of the human genome and to construct comprehensive tag SNP maps that will be useful for genetic linkage studies in humans.
  • The past, present and future of genome-wide re-annotation.
    Genome Biol, Vol. 3, No. 2. (2002)

    Annotation, the process by which structural or functional information is inferred for genes or proteins, is crucial for obtaining value from genome sequences. We define the process of annotating a previously annotated genome sequence as 're-annotation', and examine the strengths and weaknesses of current manual and automatic genome-wide re-annotation approaches.
  • EzCatDB: the Enzyme Catalytic-mechanism Database.
    Nucleic Acids Res, Vol. 33 Database Issue (1 January 2005)

    The EzCatDB (Enzyme Catalytic-mechanism Database) specifically includes catalytic mechanisms of enzymes in terms of sequences and tertiary structures of enzymes, and proposed catalytic mechanisms, along with ligand structures. The EzCatDB groups enzyme data in the Protein Data Bank (PDB) and the SWISS-PROT database with identical domain compositions, Enzyme Commission (EC) numbers and catalytic mechanisms. The EzCatDB can be queried by the type of catalytic residue, name and type of ligand molecule that interacts with an enzyme as a cofactor, substrate or product. It can provide literature information, other database codes and EC numbers. The EzCatDB provides ligand annotation for enzymes in the PDB as well as literature information on structure and catalytic mechanisms. Furthermore, the EzCatDB also provides a hierarchic classification of catalytic mechanisms. This classification incorporates catalytic mechanisms and active-site structures of enzymes as well as basic reactions and reactive parts of ligand molecules. The EzCatDB is available at http://mbs.cbrc.jp/EzCatDB/.
  • A novel method for prokaryotic promoter prediction based on DNA stability.
    BMC Bioinformatics, Vol. 6, No. 1. (5 January 2005)

    BACKGROUND: In the post-genomic era, correct gene prediction has become one of the biggest challenges in genome annotation. Improved promoter prediction methods can be one step towards developing more reliable ab initio gene prediction methods. This work presents a novel prokaryotic promoter prediction method based on DNA stability. RESULTS: The promoter region is less stable and hence more prone to melting as compared to other genomic regions. Our analysis shows that a method of promoter prediction based on the differences in the stability of DNA sequences in the promoter and non-promoter region works much better compared to existing prokaryotic promoter prediction programs, which are based on sequence motif searches. At present the method works optimally for genomes such as that of Escherichia coli, which have near 50 % G+C composition and also performs satisfactorily in case of other prokaryotic promoters. CONCLUSIONS: Our analysis clearly shows that the change in stability of DNA seems to provide a much better clue than usual sequence motifs, such as Pribnow box and -35 sequence, for differentiating promoter region from non-promoter regions. To a certain extent, it is more general and is likely to be applicable across organisms. Hence incorporation of such features in addition to the signature motifs can greatly improve the presently available promoter prediction programs.
  • Experiments using microarray technology: limitations and standard operating procedures.
    J Endocrinol, Vol. 178, No. 2. (August 2003), pp. 195-204.

    Microarrays are a powerful method for the global analysis of gene or protein content and expression, opening up new horizons in molecular and physiological systems. This review focuses on the critical aspects of acquiring meaningful data for analysis following fluorescence-based target hybridisation to arrays. Although microarray technology is adaptable to the analysis of a range of biomolecules (DNA, RNA, protein, carbohydrates and lipids), the scheme presented here is applicable primarily to customised DNA arrays fabricated using long oligomer or cDNA probes. Rather than provide a comprehensive review of microarray technology and analysis techniques, both of which are large and complex areas, the aim of this paper is to provide a restricted overview, highlighting salient features to provide initial guidance in terms of pitfalls in planning and executing array projects. We outline standard operating procedures, which help streamline the analysis of microarray data resulting from a diversity of array formats and biological systems. We hope that this overview will provide practical initial guidance for those embarking on microarray studies.
  • The PROSITE database, its status in 2002.
    Nucleic Acids Res, Vol. 30, No. 1. (1 January 2002), pp. 235-238.

    PROSITE [Bairoch and Bucher (1994) Nucleic Acids Res., 22, 3583-3589; Hofmann et al. (1999) Nucleic Acids Res., 27, 215-219] is a method of identifying the functions of uncharacterized proteins translated from genomic or cDNA sequences. The PROSITE database (http://www.expasy.org/prosite/) consists of biologically significant patterns and profiles designed in such a way that with appropriate computational tools it can rapidly and reliably help to determine to which known family of proteins (if any) a new sequence belongs, or which known domain(s) it contains.
  • ExPASy: The proteomics server for in-depth protein knowledge and analysis.
    Nucleic Acids Res, Vol. 31, No. 13. (1 July 2003), pp. 3784-3788.

    The ExPASy (the Expert Protein Analysis System) World Wide Web server (http://www.expasy.org), is provided as a service to the life science community by a multidisciplinary team at the Swiss Institute of Bioinformatics (SIB). It provides access to a variety of databases and analytical tools dedicated to proteins and proteomics. ExPASy databases include SWISS-PROT and TrEMBL, SWISS-2DPAGE, PROSITE, ENZYME and the SWISS-MODEL repository. Analysis tools are available for specific tasks relevant to proteomics, similarity searches, pattern and profile searches, post-translational modification prediction, topology prediction, primary, secondary and tertiary structure analysis and sequence alignment. These databases and tools are tightly interlinked: a special emphasis is placed on integration of database entries with related resources developed at the SIB and elsewhere, and the proteomics tools have been designed to read the annotations in SWISS-PROT in order to enhance their predictions. ExPASy started to operate in 1993, as the first WWW server in the field of life sciences. In addition to the main site in Switzerland, seven mirror sites in different continents currently serve the user community.
  • The ENZYME data bank.
    Nucleic Acids Res, Vol. 22, No. 17. (September 1994), pp. 3626-3627.

    The ENZYME data bank is a repository of information relative to the nomenclature of enzymes. It is primarily based on the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) and it contains the following data for each type of characterized enzyme for which an EC (Enzyme Commission) number has been provided: EC number Recommended name Alternative names (if any) Catalytic activity Cofactors (if any) Pointers to the SWISS-PROT protein sequence entrie(s) that correspond to the enzyme (if any) Pointers to human disease(s) associated with a deficiency of the enzyme (if any).
  • Post-analysis follow-up and validation of microarray experiments.
    Nat Genet, Vol. 32 Suppl (December 2002), pp. 509-514.

    Measurement of gene-expression profiles using microarray technology is becoming increasingly popular among the biomedical research community. Although there has been great progress in this field, investigators are still confronted with a difficult question after completing their experiments: how to validate the large data sets that are generated? This review summarizes current approaches to verifying global expression results, discusses the caveats that must be considered, and describes some methods that are being developed to address outstanding problems.
  • Recent improvements to the PROSITE database.
    Nucleic Acids Res, Vol. 32, No. Database issue. (1 January 2004)

    The PROSITE database consists of a large collection of biologically meaningful signatures that are described as patterns or profiles. Each signature is linked to documentation that provides useful biological information on the protein family, domain or functional site identified by the signature. The PROSITE web page has been redesigned and several tools have been implemented to help the user discover new conserved regions in their own proteins and to visualize domain arrangements. We also introduced the facility to search PDB with a PROSITE entry or a user's pattern and visualize matched positions on 3D structures. The latest version of PROSITE (release 18.17 of November 30, 2003) contains 1676 entries. The database is accessible at http://www.expasy.org/prosite/.
  • Having a BLAST with bioinformatics (and avoiding BLASTphemy).
    Genome Biol, Vol. 2, No. 10. (2001)

    Searching for similarities between biological sequences is the principal means by which bioinformatics contributes to our understanding of biology. Of the various informatics tools developed to accomplish this task, the most widely used is BLAST, the basic local alignment search tool. This article discusses the principles, workings, applications and potential pitfalls of BLAST, focusing on the implementation developed at the National Center for Biotechnology Information.
  • Multiple structural alignment by secondary structures: algorithm and applications.
    Protein Sci, Vol. 12, No. 11. (November 2003), pp. 2492-2507.

    We present MASS (Multiple Alignment by Secondary Structures), a novel highly efficient method for structural alignment of multiple protein molecules and detection of common structural motifs. MASS is based on a two-level alignment, using both secondary structure and atomic representation. Utilizing secondary structure information aids in filtering out noisy solutions and achieves efficiency and robustness. Currently, only a few methods are available for addressing the multiple structural alignment task. In addition to using secondary structure information, the advantage of MASS as compared to these methods is that it is a combination of several important characteristics: (1) While most existing methods are based on series of pairwise comparisons, and thus might miss optimal global solutions, MASS is truly multiple, considering all the molecules simultaneously; (2) MASS is sequence order-independent and thus capable of detecting nontopological structural motifs; (3) MASS is able to detect not only structural motifs, shared by all input molecules, but also motifs shared only by subsets of the molecules. Here, we show the application of MASS to various protein ensembles. We demonstrate its ability to handle a large number (order of tens) of molecules, to detect nontopological motifs and to find biologically meaningful alignments within nonpredefined subsets of the input. In particular, we show how by using conserved structural motifs, one can guide protein-protein docking, which is a notoriously difficult problem. MASS is freely available at http://bioinfo3d.cs.tau.ac.il/MASS/.
  • Advantages and limitations of microarray technology in human cancer.
    Oncogene, Vol. 22, No. 42. (29 September 2003), pp. 6497-6507.

    Cancer is a highly variable disease with multiple heterogeneous genetic and epigenetic changes. Functional studies are essential to understanding the complexity and polymorphisms of cancer. The final deciphering of the complete human genome, together with the improvement of high throughput technologies, is causing a fundamental transformation in cancer research. Microarray is a new powerful tool for studying the molecular basis of interactions on a scale that is impossible using conventional analysis. This technique makes it possible to examine the expression of thousands of genes simultaneously. This technology promises to lead to improvements in developing rational approaches to therapy as well as to improvements in cancer diagnosis and prognosis, assuring its entry into clinical practice in specialist centers and hospitals within the next few years. Predicting who will develop cancer and how this disease will behave and respond to therapy after diagnosis will be one of the potential benefits of this technology within the next decade. In this review, we highlight some of the recent developments and results in microarray technology in cancer research, discuss potentially problematic areas associated with it, describe the eventual use of microarray technology for clinical applications and comment on future trends and issues.
  • A multiple alignment algorithm for metabolic pathway analysis using enzyme hierarchy.
    Proc Int Conf Intell Syst Mol Biol, Vol. 8 (2000), pp. 376-383.

    In many of the chemical reactions in living cells, enzymes act as catalysts in the conversion of certain compounds (substrates) into other compounds (products). Comparative analyses of the metabolic pathways formed by such reactions give important information on their evolution and on pharmacological targets (Dandekar et al. 1999). Each of the enzymes that constitute a pathway is classified according to the EC (Enzyme Commission) numbering system, which consists of four sets of numbers that categorize the type of the chemical reaction catalyzed. In this study, we consider that reaction similarities can be expressed by the similarities between EC numbers of the respective enzymes. Therefore, in order to find a common pattern among pathways, it is desirable to be able to use the functional hierarchy of EC numbers to express the reaction similarities. In this paper, we propose a multiple alignment algorithm utilizing information content that is extended to symbols having a hierarchical structure. The effectiveness of our method is demonstrated by applying the method to pathway analyses of sugar, DNA and amino acid metabolisms.
  • Prediction of prokaryotic promoters based on prediction of transcriptional units.
    Sheng Wu Hua Xue Yu Sheng Wu Wu Li Xue Bao (Shanghai), Vol. 35, No. 4. (April 2003), pp. 317-324.

    Identification of promoters is very important in understanding gene regulating relationships in an organism, and computational identification of promoters has been a long standing problem in computational biology. A new method was presented to predict promoter regions in prokaryotic organism. The method predicted transcription unit (TU) first and the TU was divided into singlet that contains only one single gene in a TU, and operon that contains more than one gene. Based on these predicted TUs, promoter was predicted for each TU using hidden Markov model including explicit state duration density. Both predicted TUs and promoters were satisfying.
  • Incorporating structure to predict microRNA targets.
    Proc Natl Acad Sci U S A, Vol. 102, No. 11. (15 March 2005), pp. 4006-4009.

    MicroRNAs (miRNAs) are a recently discovered set of regulatory genes that constitute up to an estimated 1% of the total number of genes in animal genomes, including Caenorhabditis elegans, Drosophila, mouse, and humans [Lagos-Quintana, M., Rauhut, R., Lendeckel, W. & Tuschl, T. (2001) Science 294, 853-858; Lai, E. C., Tomancak, P., Williams, R. W. & Rubin, G.M. (2003) Genome Biol. 4, R42; Lau, N. C., Lim, L. P., Weinstein, E. G. & Bartel, D. P. (2001) Science 294, 858-862; Lee, R. C. & Ambros, V. (2001) Science 294, 862-8644; and Lee, R. C., Feinbaum, R. L. & Ambros, V. (1993) Cell 115, 787-798]. In animals, miRNAs regulate genes by attenuating protein translation through imperfect base pair binding to 3' UTR sequences of target genes. A major challenge in understanding the regulatory role of miRNAs is to accurately predict regulated targets. We have developed an algorithm for predicting targets that does not rely on evolutionary conservation. As one of the features of this algorithm, we incorporate the folded structure of mRNA. By using Drosophila miRNAs as a test case, we have validated our predictions in 10 of 15 genes tested. One of these validated genes is mad as a target for bantam. Furthermore, our computational and experimental data suggest that miRNAs have fewer targets than previously reported.
  • The impact of next-generation sequencing technology on genetics
    Trends in Genetics, Vol. In Press, Corrected Proof

    If one accepts that the fundamental pursuit of genetics is to determine the genotypes that explain phenotypes, the meteoric increase of DNA sequence information applied toward that pursuit has nowhere to go but up. The recent introduction of instruments capable of producing millions of DNA sequence reads in a single run is rapidly changing the landscape of genetics, providing the ability to answer questions with heretofore unimaginable speed. These technologies will provide an inexpensive, genome-wide sequence readout as an endpoint to applications ranging from chromatin immunoprecipitation, mutation mapping and polymorphism discovery to noncoding RNA discovery. Here I survey next-generation sequencing technologies and consider how they can provide a more complete picture of how the genome shapes the organism.
  • Protein-protein interaction networks and biology—what's the connection?
    Nature Biotechnology, Vol. 26, No. 1., pp. 69-72.
  • Computational biology and high-performance computing
    Commun. ACM, Vol. 47, No. 11. (November 2004), pp. 34-41.
  • Characterization and prediction of protein-protein interactions within and between complexes.
    Proc Natl Acad Sci U S A (26 September 2006)

    Databases of experimentally determined protein interactions provide information on binary interactions and on involvement in multiprotein complexes. These data are valuable for understanding the general properties of the interaction between proteins as well as for the development of prediction schemes for unknown interactions. Here we analyze experimentally determined protein interactions by measuring various sequence, genomic, transcriptomic, and proteomic attributes of each interacting pair in the yeast Saccharomyces cerevisiae. We find that dividing the data into two groups, one that includes binary interactions within protein complexes (stable) and another that includes binary interactions that are not within complexes (transient), enables better characterization of the interactions by the different attributes and improves the prediction of new interactions. This analysis revealed that most attributes were more indicative in the set of intracomplex interactions. Using this data set for training, we integrated the different attributes by logistic regression and developed a predictive scheme that distinguishes between interacting and noninteracting protein pairs. Analysis of the logistic-regression model showed that one of the strongest contributors to the discrimination between interacting and noninteracting pairs is the presence of distinct pairs of domain signatures that were suggested previously to characterize interacting proteins. The predictive algorithm succeeds in identifying both intracomplex and other interactions (possibly the more stable ones), and its correct identification rate is 2-fold higher than that of large-scale yeast two-hybrid experiments.
  • The minimum information required for reporting a molecular interaction experiment (MIMIx)
    Nature Biotechnology, Vol. 25, No. 8. (08 August 2007), pp. 894-898.
CiteULike: Tag microarray
  • A High-Resolution Root Spatiotemporal Map Reveals Dominant Expression Patterns
    Science, Vol. 318, No. 5851. (2 November 2007), pp. 801-806.

    Transcriptional programs that regulate development are exquisitely controlled in space and time. Elucidating these programs that underlie development is essential to understanding the acquisition of cell and tissue identity. We present microarray expression profiles of a high-resolution set of developmental time points within a single Arabidopsis root and a comprehensive map of nearly all root cell types. These cell typespecific transcriptional signatures often predict previously unknown cellular functions. A computational pipeline identified dominant expression patterns that demonstrate transcriptional similarity between disparate cell types. Dominant expression patterns along the root's longitudinal axis do not strictly correlate with previously defined developmental zones, and in many cases, we observed expression fluctuation along this axis. Both robust co-regulation of gene expression and potential phasing of gene expression were identified between individual roots. Methods that combine these profiles demonstrate transcriptionally rich and complex programs that define Arabidopsis root development in both space and time. 10.1126/science.1146265
  • Domain-enhanced analysis of microarray data using GO annotations
    Bioinformatics, Vol. 23, No. 10. (15 May 2007), pp. 1225-1234.

    Motivation: New biological systems technologies give scientists the ability to measure thousands of bio-molecules including genes, proteins, lipids and metabolites. We use domain knowledge, e.g. the Gene Ontology, to guide analysis of such data. By focusing on domain-aggregated results at, say the molecular function level, increased interpretability is available to biological scientists beyond what is possible if results are presented at the gene level. Results: We use a top-down' approach to perform domain aggregation by first combining gene expressions before testing for differentially expressed patterns. This is in contrast to the more standard bottom-up' approach, where genes are first tested individually then aggregated by domain knowledge. The benefits are greater sensitivity for detecting signals. Our method, domain-enhanced analysis (DEA) is assessed and compared to other methods using simulation studies and analysis of two publicly available leukemia data sets. Availability: Our DEA method uses functions available in R (http://www.r-project.org/) and SAS (http://www.sas.com/). The two experimental data sets used in our analysis are available in R as Bioconductor packages, ALL' and golubEsets' (http://www.bioconductor.org/). Contact: jliu6@stat.ncsu.edu Supplementary information: Supplementary data are available at Bioinformatics online. 10.1093/bioinformatics/btm092
  • Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae
    Nucleic Acids Research, Vol. 35, No. 1. (January 2007), pp. 279-287.
  • Assessing the Significance of Conserved Genomic Aberrations Using High Resolution Genomic Microarrays
    PLoS Genetics, Vol. 3, No. 8. (1 August 2007), e143.

    Genomic aberrations recurrent in a particular cancer type can be important prognostic markers for tumor progression. Typically in early tumorigenesis, cells incur a breakdown of the DNA replication machinery that results in an accumulation of genomic aberrations in the form of duplications, deletions, translocations, and other genomic alterations. Microarray methods allow for finer mapping of these aberrations than has previously been possible; however, data processing and analysis methods have not taken full advantage of this higher resolution. Attention has primarily been given to analysis on the single sample level, where multiple adjacent probes are necessarily used as replicates for the local region containing their target sequences. However, regions of concordant aberration can be short enough to be detected by only one, or very few, array elements. We describe a method called Multiple Sample Analysis for assessing the significance of concordant genomic aberrations across multiple experiments that does not require a-priori definition of aberration calls for each sample. If there are multiple samples, representing a class, then by exploiting the replication across samples our method can detect concordant aberrations at much higher resolution than can be derived from current single sample approaches. Additionally, this method provides a meaningful approach to addressing population-based questions such as determining important regions for a cancer subtype of interest or determining regions of copy number variation in a population. Multiple Sample Analysis also provides single sample aberration calls in the locations of significant concordance, producing high resolution calls per sample, in concordant regions. The approach is demonstrated on a dataset representing a challenging but important resource: breast tumors that have been formalin-fixed, paraffin-embedded, archived, and subsequently UV-laser capture microdissected and hybridized to two-channel BAC arrays using an amplification protocol. We demonstrate the accurate detection on simulated data, and on real datasets involving known regions of aberration within subtypes of breast cancer at a resolution consistent with that of the array. Similarly, we apply our method to previously published datasets, including a 250K SNP array, and verify known results as well as detect novel regions of concordant aberration. The algorithm has been fully implemented and tested and is freely available as a Java application at http://www.cbil.upenn.edu/MSA.
  • Comparing whole genomes using DNA microarrays
    Nat Rev Genet, Vol. 9, No. 4. (April 2008), pp. 291-302.
  • Ubiquitination screen using protein microarrays for comprehensive identification of Rsp5 substrates in yeast
    Mol Syst Biol, Vol. 3 (5 June 2007)
  • The use and analysis of microarray data.
    Nat Rev Drug Discov, Vol. 1, No. 12. (December 2002), pp. 951-960.

    Functional genomics is the study of gene function through the parallel expression measurements of genomes, most commonly using the technologies of microarrays and serial analysis of gene expression. Microarray usage in drug discovery is expanding, and its applications include basic research and target discovery, biomarker determination, pharmacology, toxicogenomics, target selectivity, development of prognostic tests and disease-subclass determination. This article reviews the different ways to analyse large sets of microarray data, including the questions that can be asked and the challenges in interpreting the measurements.
  • Post-analysis follow-up and validation of microarray experiments.
    Nat Genet, Vol. 32 Suppl (December 2002), pp. 509-514.

    Measurement of gene-expression profiles using microarray technology is becoming increasingly popular among the biomedical research community. Although there has been great progress in this field, investigators are still confronted with a difficult question after completing their experiments: how to validate the large data sets that are generated? This review summarizes current approaches to verifying global expression results, discusses the caveats that must be considered, and describes some methods that are being developed to address outstanding problems.
  • Experiments using microarray technology: limitations and standard operating procedures.
    J Endocrinol, Vol. 178, No. 2. (August 2003), pp. 195-204.

    Microarrays are a powerful method for the global analysis of gene or protein content and expression, opening up new horizons in molecular and physiological systems. This review focuses on the critical aspects of acquiring meaningful data for analysis following fluorescence-based target hybridisation to arrays. Although microarray technology is adaptable to the analysis of a range of biomolecules (DNA, RNA, protein, carbohydrates and lipids), the scheme presented here is applicable primarily to customised DNA arrays fabricated using long oligomer or cDNA probes. Rather than provide a comprehensive review of microarray technology and analysis techniques, both of which are large and complex areas, the aim of this paper is to provide a restricted overview, highlighting salient features to provide initial guidance in terms of pitfalls in planning and executing array projects. We outline standard operating procedures, which help streamline the analysis of microarray data resulting from a diversity of array formats and biological systems. We hope that this overview will provide practical initial guidance for those embarking on microarray studies.
  • Diagnostic and prognostic prediction using gene expression profiles in high-dimensional microarray data.
    Br J Cancer, Vol. 89, No. 9. (3 November 2003), pp. 1599-1604.

    DNA microarrays are a potentially powerful technology for improving diagnostic classification, treatment selection and therapeutics development. There are, however, many potential pitfalls in the use of microarrays that result in false leads and erroneous conclusions. This paper provides a review of the key features to be observed in developing diagnostic and prognostic classification systems based on gene expression profiling and some of the pitfalls to be aware of in reading reports of microarray-based studies.
  • Gene selection in microarray data: the elephant, the blind men and our algorithms.
    Curr Opin Struct Biol, Vol. 13, No. 3. (June 2003), pp. 370-376.

    Gene expression array data provide shadows of intricate cellular processes. Learning how to make the most of the information present in expression arrays has become a discipline in itself. In recent years, there has been an explosion of methods that analyze gene expression arrays to produce long lists of genes that express differentially in distinct cellular states. These lists will have to be organized, and the algorithms that produced them combined, if we wish to piece together the rich cellular structures probed by this high-throughput technology. Researchers will have to understand the benefits and limitations of the many existing methods to produce the combination of algorithms that best suits their gene expression experiments.
  • Advantages and limitations of microarray technology in human cancer.
    Oncogene, Vol. 22, No. 42. (29 September 2003), pp. 6497-6507.

    Cancer is a highly variable disease with multiple heterogeneous genetic and epigenetic changes. Functional studies are essential to understanding the complexity and polymorphisms of cancer. The final deciphering of the complete human genome, together with the improvement of high throughput technologies, is causing a fundamental transformation in cancer research. Microarray is a new powerful tool for studying the molecular basis of interactions on a scale that is impossible using conventional analysis. This technique makes it possible to examine the expression of thousands of genes simultaneously. This technology promises to lead to improvements in developing rational approaches to therapy as well as to improvements in cancer diagnosis and prognosis, assuring its entry into clinical practice in specialist centers and hospitals within the next few years. Predicting who will develop cancer and how this disease will behave and respond to therapy after diagnosis will be one of the potential benefits of this technology within the next decade. In this review, we highlight some of the recent developments and results in microarray technology in cancer research, discuss potentially problematic areas associated with it, describe the eventual use of microarray technology for clinical applications and comment on future trends and issues.
  • Navigating gene expression using microarrays--a technology review.
    Nat Cell Biol, Vol. 3, No. 8. (August 2001)

    Parallel quantification of large numbers of messenger RNA transcripts using microarray technology promises to provide detailed insight into cellular processes involved in the regulation of gene expression. This should allow new understanding of signalling networks that operate in the cell and of the molecular basis and classification of disease. But can the technology deliver such far-reaching promises?
  • AMDA: an R package for the automated microarray data analysis
    BMC Bioinformatics, Vol. 7 (06 July 2006), 335.
  • GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor.
    Bioinformatics (12 May 2007)

    Microarray technology has become a standard molecular biology tool. Experimental data have been generated on a huge number of organisms, tissue types, treatment conditions, and disease states. The Gene Expression Omnibus (Barrett et al., 2005), developed by the National Center for Bioinformatics (NCBI) at the National Institutes of Health is a repository of nearly 140,000 gene expression experiments. The BioConductor project (Gentleman et al., 2004) is an open-source and open-development software project built in the R statistical programming environment (R Development Core Team, 2005) for the analysis and comprehension of genomic data. The tools contained in the BioConductor project represent many state-of-theart methods for the analysis of microarray and genomics data. We have developed a software tool that allows access to the wealth of information within GEO directly from BioConductor, eliminating many the formatting and parsing problems that have made such analyses labor-intensive in the past. The software, called GEOquery, effectively establishes a bridge between GEO and BioConductor. Easy access to GEO data from BioConductor will likely lead to new analyses of GEO data using novel and rigorous statistical and bioinformatic tools. Facilitating analyses and meta-analyses of microarray data will increase the efficiency with which biologically important conclusions can be drawn from published genomic data. AVAILABILITY: GEOquery is available as part of the BioConductor project.
  • Mining microarray data at NCBI's Gene Expression Omnibus (GEO)*.
    Methods Mol Biol, Vol. 338 (2006), pp. 175-190.

    The Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI) has emerged as the leading fully public repository for gene expression data. This chapter describes how to use Web-based interfaces, applications, and graphics to effectively explore, visualize, and interpret the hundreds of microarray studies and millions of gene expression patterns stored in GEO. Data can be examined from both experiment-centric and gene-centric perspectives using user-friendly tools that do not require specialized expertise in microarray analysis or time-consuming download of massive data sets. The GEO database is publicly accessible through the World Wide Web at http://www.ncbi.nlm.nih.gov/geo.
  • MARS: microarray analysis, retrieval, and storage system.
    BMC Bioinformatics, Vol. 6 (2005)

    BACKGROUND: Microarray analysis has become a widely used technique for the study of gene-expression patterns on a genomic scale. As more and more laboratories are adopting microarray technology, there is a need for powerful and easy to use microarray databases facilitating array fabrication, labeling, hybridization, and data analysis. The wealth of data generated by this high throughput approach renders adequate database and analysis tools crucial for the pursuit of insights into the transcriptomic behavior of cells. RESULTS: MARS (Microarray Analysis and Retrieval System) provides a comprehensive MIAME supportive suite for storing, retrieving, and analyzing multi color microarray data. The system comprises a laboratory information management system (LIMS), a quality control management, as well as a sophisticated user management system. MARS is fully integrated into an analytical pipeline of microarray image analysis, normalization, gene expression clustering, and mapping of gene expression data onto biological pathways. The incorporation of ontologies and the use of MAGE-ML enables an export of studies stored in MARS to public repositories and other databases accepting these documents. CONCLUSION: We have developed an integrated system tailored to serve the specific needs of microarray based research projects using a unique fusion of Web based and standalone applications connected to the latest J2EE application server technology. The presented system is freely available for academic and non-profit institutions. More information can be found at http://genome.tugraz.at.
  • Cross-species and cross-platform gene expression studies with the Bioconductor-compliant R package annotationTools
    BMC Bioinformatics, Vol. 9, No. 1. (2008)

    BACKGROUND:The variety of DNA microarray formats and datasets presently available offers an unprecedented opportunity to perform insightful comparisons of heterogeneous data. Cross-species studies, in particular, have the power of identifying conserved, functionally important molecular processes. Validation of discoveries can now often be performed in readily available public data which frequently requires cross-platform studies. Cross-platform and cross-species analyses require matching probes on different microarray formats. This can be achieved using the information in microarray annotations and additional molecular biology databases, such as orthology databases. Although annotations and other biological information are stored using modern database models (e.g. relational), they are very often distributed and shared as tables in text files, i.e. flat file databases. This common flat database format thus provides a simple and robust solution to flexibly integrate various sources of information and a basis for the combined analysis of heterogeneous gene expression profiles.RESULTS:We provide annotationTools, a Bioconductor-compliant R package to annotate microarray experiments and integrate heterogeneous gene expression profiles using annotation and other molecular biology information available as flat file databases. First, annotationTools contains a specialized set of functions for mining this widely used database format in a systematic manner. It thus offers a straightforward solution for annotating microarray experiments. Second, building on these basic functions and relying on the combination of information from several databases, it provides tools to easily perform cross-species analyses of gene expression data. Here, we present two example applications of annotationTools that are of direct relevance for the analysis of heterogeneous gene expression profiles, namely a cross-platform mapping of probes and a cross-species mapping of orthologous probes using different orthology databases. We also show how to perform an explorative comparison of disease-related transcriptional changes in human patients and in a genetic mouse model.CONCLUSION:The R package annotationTools provides a simple solution to handle microarray annotation and orthology tables, as well as other flat molecular biology databases. Thereby, it allows easy integration and analysis of heterogeneous microarray experiments across different technological platforms or species.
  • Microarray retriever: a web-based tool for searching and large scale retrieval of public microarray data.
    Nucleic acids research (7 May 2008)

    The major public microarray repositories Gene Expression Omnibus and ArrayExpress are growing rapidly. This enables meta-analysis studies, in which expression data from multiple individual studies are combined. To facilitate these types of studies, we developed Microarray Retriever for searching and retrieval of data from GEO and ArrayExpress. The tool allows access to the two repositories simultaneously, to search in the repositories using complex queries, to retrieve microarray data for published articles and to download data in one structured archive. The tool is available on the web at: http://www.lgtc.nl/MaRe/
  • MicroRNA expression profiles classify human cancers
    Nature, Vol. 435, No. 7043., pp. 834-838.
  • EzArray: a web-based highly automated Affymetrix expression array data management and analysis system
    BMC Bioinformatics, Vol. 9, No. 1. (2008)

    BACKGROUND:Though microarray experiments are very popular in life science research, managing and analyzing microarray data are still challenging tasks for many biologists. Most microarray programs require users to have sophisticated knowledge of mathematics, statistics and computer skills for usage. With accumulating microarray data deposited in public databases, easy-to-use programs to re-analyze previously published microarray data are in high demand.RESULTS:EzArray is a web-based Affymetrix expression array data management and analysis system for researchers who need to organize microarray data efficiently and get data analyzed instantly. EzArray organizes microarray data into projects that can be analyzed online with predefined or custom procedures. EzArray performs data preprocessing and detection of differentially expressed genes with statistical methods. All analysis procedures are optimized and highly automated so that even novice users with limited pre-knowledge of microarray data analysis can complete initial analysis quickly. Since all input files, analysis parameters, and executed scripts can be downloaded, EzArray provides maximum reproducibility for each analysis. In addition, EzArray integrates with Gene Expression Omnibus (GEO) and allows instantaneous re-analysis of published array data.CONCLUSIONS:EzArray is a novel Affymetrix expression array data analysis and sharing system. EzArray provides easy-to-use tools for re-analyzing published microarray data and will help both novice and experienced users perform initial analysis of their microarray data from the location of data storage. We believe EzArray will be a useful system for facilities with microarray services and laboratories with multiple members involved in microarray data analysis. EzArray is freely available from http://www.ezarray.com/.
  • Gene expression of topoisomerase II alpha (TOP2A) by microarray analysis is highly prognostic in estrogen receptor (ER) positive breast cancer.
    Breast cancer research and treatment (14 March 2008)

    Introduction Overexpression of Topoisomerase II alpha (TOP2A) has been implicated with gene amplification of the 17q21 amplicon and consecutively with ErbB2 overexpression and amplification. However, gene amplification does not necessarily correlate with RNA and protein expression. There is growing evidence that TOP2A protein expression is a strong prognostic and TOP2A gene amplification might be a predictive marker (particularly for the use of anthracyclines). Methods Large scale analysis was performed using Affymetrix microarray data from n = 1,681 breast cancer patients to evaluate TOP2A expression. Results TOP2A expression showed a strong correlation with tumor size (chi(2)-test, P < 0.001), grading (P < 0.001), ErbB2 (P < 0.001) and Ki67 expression (P < 0.001) as well as nodal status (P = 0.042). Survival analysis revealed a significant prognostic value in ER positive (n = 994; log rank P < 0.001), but not in ER negative breast cancer patients (n = 369, P = 0.35). The prognostic impact of TOP2A expression was independent of Ki67 expression in ER positive tumors (P = 0.002 and P = 0.007 for high and low Ki67, respectively). Moreover a worse prognosis of high TOP2A expressing tumors was found in the subgroup of ErbB2 negative tumors (P < 0.001) and a trend among ErbB2 positive tumors (P = 0.11). The prognostic value of TOP2A was independent of whether the patients were untreated or had received adjuvant therapy. In multivariate Cox regression analysis including standard parameters TOP2A emerged to be the top prognostic marker (HR 2.40, 95% CI 1.68-3.43, P < 0.001). Conclusion TOP2A expression is an independent prognostic factor in ER positive breast cancer and could be helpful for risk assessment in ER positive breast cancer patients.
  • Clinical validation of a customized multiple signature microarray for breast cancer.
    Clinical cancer research : an official journal of the American Association for Cancer Research, Vol. 14, No. 2. (15 January 2008), pp. 461-469.

    PURPOSE: Current histopathologic systems for classifying breast tumors require evaluation of multiple variables and are often associated with significant interobserver variability. Recent studies suggest that gene expression profiles may represent a promising alternative for clinical cancer classification. Here, we investigated the use of a customized microarray as a potential tool for clinical practice. EXPERIMENTAL DESIGN: We fabricated custom 188-gene microarrays containing expression signatures for three breast cancer molecular subtypes [luminal/estrogen receptor (ER) positive, human epidermal growth factor receptor 2 (HER2), and "basaloid"], the Nottingham prognostic index (NPI-ES), and low histologic grade (TuM1). The reliability of these multiple-signature arrays (MSA) was tested in a prospective cohort of 165 patients with primary breast cancer. RESULTS: The MSA-ER signature exhibited a high concordance of 90% with ER immunohistochemistry reported on diagnosis (P < 0.001). This remained unchanged at 89% (P < 0.001) when the immunohistochemistry was repeated using current laboratory standards. Expression of the HER2 signature showed a good correlation of 76% with HER2 fluorescence in situ hybridization (FISH; ratio > or =2.2; P < 0.001), which further improved to 89% when the ratio cutoff was raised to > or =5. A proportion of low-level FISH-amplified samples (ratio, 2.2-5) behaved comparably to FISH-negative samples by HER2 signature expression, HER2 quantitative reverse transcription-PCR, and HER2 immunohistochemistry. Luminal/ER+ tumors with high NPI-ES expression were associated with high NPI scores (P = 0.001), and luminal/ER+ TuM1-expressing tumors were significantly correlated with low histologic grade (P = 0.002) and improved survival outcome in an interim analysis (hazard ratio, 0.2; P = 0.019). CONCLUSION: The consistency of the MSA platform in an independent patient population suggests that custom microarrays could potentially function as an adjunct to standard immunohistochemistry and FISH in clinical practice.
  • Meta-analysis of human cancer microarrays reveals GATA3 is integral to the estrogen receptor alpha pathway
    Molecular Cancer, Vol. 7 (04 June 2008), 49.

    BACKGROUND: The transcription factor GATA3 has recently been shown to be necessary for mammary gland morphogenesis and luminal cell differentiation. There is also an increasing body of data linking GATA3 to the estrogen receptor alpha (ERalpha) pathway. Among these it was shown that GATA3 associates with the promoter of the ERalpha gene and ERalpha can reciprocally associate with the GATA3 gene. GATA3 has also been directly implicated in a differentiated phenotype in mouse models of mammary tumourigenesis. The purpose of our study was to compare coexpressed genes, by meta-analysis, of GATA3 and relate these to a similar analysis for ERalpha to determine the depth of overlap. RESULTS: We have used a newly described method of meta-analysis of multiple cancer studies within the Oncomine database, focusing here predominantly upon breast cancer studies. We demonstrate that ERalpha and GATA3 reciprocally have the highest overlap with one another. Furthermore, we show that when both coexpression meta-analysis lists for ERalpha and GATA3 are compared there is a significant overlap between both and, like ERalpha, GATA3 coexpresses with ERalpha pathway partners such as pS2 (TFF1), TFF3, FOXA1, BCL2, ERBB4, XBP1, NRIP1, IL6ST, keratin 18(KRT18) and cyclin D1 (CCND1). Moreover, as these data are derived from human tumour samples this adds credence to previous cell-culture or murine based studies. CONCLUSION: GATA3 is hypothesized to be integral to the ERalpha pathway given the following: (1) The large overlap of coexpressed genes as seen by meta-analysis, between GATA3 and ERalpha, (2) The highest coexpressing gene for GATA3 was ERalpha and vice-versa, (3) GATA3, like ERalpha, coexpresses with many well-known ERalpha pathway partners such as pS2.
  • The use of genomic tools for the molecular understanding of breast cancer and to guide personalized medicine.
    Drug discovery today, Vol. 13, No. 11-12. (June 2008), pp. 481-487.

    The use of gene-expression microarray analysis to assess the expression levels of all the genes in the genome has tremendous potential. Important information has been obtained about many disease processes, particularly in classifying tumors in different subtypes and risk groups. Combining gene-expression data with other genomic information and the use of sophisticated bioinformatic tools enables the discovery of potential new targets for treatment, and is helpful for high-throughput drug screening and for designing new classes of drugs for targeted therapy. Here, we provide a short overview of the recent, promising developments in the field with emphasis on breast cancer.
  • Merging microarray data from separate breast cancer studies provides a robust prognostic test
    BMC Bioinformatics, Vol. 9 (27 February 2008), 125.

    BACKGROUND: DNA microarray technology has emerged as a major tool for exploring cancer biology and solving clinical issues. Predicting a patient's response to chemotherapy is one such issue; successful prediction would make it possible to give patients the most appropriate chemotherapy regimen. Patient response can be classified as either a pathologic complete response (PCR) or residual disease (NoPCR), and these strongly correlate with patient outcome. Microarrays can be used as multigenic predictors of patient response, but probe selection remains problematic. In this study, each probe set was considered as an elementary predictor of the response and was ranked on its ability to predict a high number of PCR and NoPCR cases in a ratio similar to that seen in the learning set. We defined a valuation function that assigned high values to probe sets according to how different the expression of the genes was and to how closely the relative proportions of PCR and NoPCR predictions to the proportions observed in the learning set was. Multigenic predictors were designed by selecting probe sets highly ranked in their predictions and tested using several validation sets. RESULTS: Our method defined three types of probe sets: 71% were mono-informative probe sets (59% predicted only NoPCR, and 12% predicted only PCR), 25% were bi-informative, and 4% were non-informative. Using a valuation function to rank the probe sets allowed us to select those that correctly predicted the response of a high number of patient cases in the training set and that predicted a PCR/NoPCR ratio for validation sets that was similar to that of the whole learning set. Based on DLDA and the nearest centroid method, bi-informative probes proved more successful predictors than probes selected using a t test. CONCLUSION: Prediction of the response to breast cancer preoperative chemotherapy was significantly improved by selecting DNA probe sets that were successful in predicting outcomes for the entire learning set, both in terms of accurately predicting a high number of cases and in correctly predicting the ratio of PCR to NoPCR cases.
  • Prediction of the outcome of preoperative chemotherapy in breast cancer by DNA probes that convey information on both complete and non complete responses
    BMC Bioinformatics, Vol. 9 (15 March 2008), 149.

    BACKGROUND: DNA microarray technology has emerged as a major tool for exploring cancer biology and solving clinical issues. Predicting a patient's response to chemotherapy is one such issue; successful prediction would make it possible to give patients the most appropriate chemotherapy regimen. Patient response can be classified as either a pathologic complete response (PCR) or residual disease (NoPCR), and these strongly correlate with patient outcome. Microarrays can be used as multigenic predictors of patient response, but probe selection remains problematic. In this study, each probe set was considered as an elementary predictor of the response and was ranked on its ability to predict a high number of PCR and NoPCR cases in a ratio similar to that seen in the learning set. We defined a valuation function that assigned high values to probe sets according to how different the expression of the genes was and to how closely the relative proportions of PCR and NoPCR predictions to the proportions observed in the learning set was. Multigenic predictors were designed by selecting probe sets highly ranked in their predictions and tested using several validation sets. RESULTS: Our method defined three types of probe sets: 71% were mono-informative probe sets (59% predicted only NoPCR, and 12% predicted only PCR), 25% were bi-informative, and 4% were non-informative. Using a valuation function to rank the probe sets allowed us to select those that correctly predicted the response of a high number of patient cases in the training set and that predicted a PCR/NoPCR ratio for validation sets that was similar to that of the whole learning set. Based on DLDA and the nearest centroid method, bi-informative probes proved more successful predictors than probes selected using a t test. CONCLUSION: Prediction of the response to breast cancer preoperative chemotherapy was significantly improved by selecting DNA probe sets that were successful in predicting outcomes for the entire learning set, both in terms of accurately predicting a high number of cases and in correctly predicting the ratio of PCR to NoPCR cases.
  • Identification of cancer genes using a statistical framework for multiexperiment analysis of nondiscretized array CGH data.
    Nucleic acids research, Vol. 36, No. 2. (February 2008)

    Tumor formation is in part driven by DNA copy number alterations (CNAs), which can be measured using microarray-based Comparative Genomic Hybridization (aCGH). Multiexperiment analysis of aCGH data from tumors allows discovery of recurrent CNAs that are potentially causal to cancer development. Until now, multiexperiment aCGH data analysis has been dependent on discretization of measurement data to a gain, loss or no-change state. Valuable biological information is lost when a heterogeneous system such as a solid tumor is reduced to these states. We have developed a new approach which inputs nondiscretized aCGH data to identify regions that are significantly aberrant across an entire tumor set. Our method is based on kernel regression and accounts for the strength of a probe's signal, its local genomic environment and the signal distribution across multiple tumors. In an analysis of 89 human breast tumors, our method showed enrichment for known cancer genes in the detected regions and identified aberrations that are strongly associated with breast cancer subtypes and clinical parameters. Furthermore, we identified 18 recurrent aberrant regions in a new dataset of 19 p53-deficient mouse mammary tumors. These regions, combined with gene expression microarray data, point to known cancer genes and novel candidate cancer genes.
  • Spectral biclustering of microarray data: coclustering genes and conditions.
    Genome Res, Vol. 13, No. 4. (April 2003), pp. 703-716.

    Global analyses of RNA expression levels are useful for classifying genes and overall phenotypes. Often these classification problems are linked, and one wants to find "marker genes" that are differentially expressed in particular sets of "conditions." We have developed a method that simultaneously clusters genes and conditions, finding distinctive "checkerboard" patterns in matrices of gene expression data, if they exist. In a cancer context, these checkerboards correspond to genes that are markedly up- or downregulated in patients with particular types of tumors. Our method, spectral biclustering, is based on the observation that checkerboard structures in matrices of expression data can be found in eigenvectors corresponding to characteristic expression patterns across genes or conditions. In addition, these eigenvectors can be readily identified by commonly used linear algebra approaches, in particular the singular value decomposition (SVD), coupled with closely integrated normalization steps. We present a number of variants of the approach, depending on whether the normalization over genes and conditions is done independently or in a coupled fashion. We then apply spectral biclustering to a selection of publicly available cancer expression data sets, and examine the degree to which the approach is able to identify checkerboard structures. Furthermore, we compare the performance of our biclustering methods against a number of reasonable benchmarks (e.g., direct application of SVD or normalized cuts to raw data).
  • Final words: cell age and cell cycle are unlinked.
    Trends Biotechnol, Vol. 22, No. 6. (June 2004), pp. 277-278.

    Cooper has a simple belief: that the cell cycle is connected to age and size. Furthermore, as a result of this connection in his mind he believes that there are no possible manipulations that can operate on a batch culture to synchronize cells within the cell cycle, such that those cells can undergo a semblance of a normal cell cycle. His formulation of this argument is as a 'fundamental law', the law of conservation of cell-age order (LCCAO). The first part of this law - 'there is no batch treatment of the culture that can lead to an alteration of the cell-age order' - can probably be proved true, in the mathematical sense, and certainly makes intuitive sense. Unfortunately the corollaries of this law are rather suspect, drawing inferences from cell age to cell size to the cell cycle.
  • Reply: whole-culture synchronization - effective tools for cell cycle studies.
    Trends Biotechnol, Vol. 22, No. 6. (June 2004), pp. 270-273.

    Studies of gene expression during the eukaryotic cell cycle in whole-culture synchronized cultures have been published using many methodologies. These procedures alter the state of the cell cycle for a population of cells, rather than purifying a population of cells that are in the same state. Criticism of these methods (e.g. see Cooper, this issue, pp. 266-269, ) suggests that these studies are flawed, and posits that such methodologies cannot be used to study the cell cycle because they alter the size and age distributions of the cultures. We believe that whole-culture cell cycle studies work even though they alter the size and age distributions: these cells still progress through the cell cycle and although we do not suggest that the methods are perfect, we will explain how these microarray studies have successfully identified cell cycle regulated genes and why these results are biologically meaningful.
  • Is whole-culture synchronization biology's 'perpetual-motion machine'?
    Trends Biotechnol, Vol. 22, No. 6. (June 2004), pp. 266-269.

    Whole-culture or batch synchronization cannot, in theory, produce a synchronized culture because it violates a fundamental law that proposes that no batch treatment can alter the cell-age order of a culture. In analogy with the history of perpetual-motion machines, it is suggested that the study of these whole-culture 'synchronization' methods might lead to an understanding of general biological principles even though these methods cannot be used to study the normal cell cycle.
  • Rejoinder: whole-culture synchronization cannot, and does not, synchronize cells.
    Trends Biotechnol, Vol. 22, No. 6. (June 2004), pp. 274-276.

    There have been numerous proposals suggesting that whole-culture methods - in which all cells in a growing culture are treated identically - can synchronize cells. An explicit defense of these methods has been presented (Spellman and Sherlock, this issue, pp. 270-273, ). Here, this defense of whole-culture 'synchronization' is subjected to a critical evaluation leading to the conclusion that whole-culture synchronization cannot synchronize cells - at all. Whole-culture methods cannot produce a set of cells that reflects the size and genome composition of cells of any particular cell-cycle age during the normal cell cycle. Thus, in addition to the well-recognized problem of artifacts, it is proposed that experiments using whole-culture treatments (usually starvation or inhibition methods) are not suitable for cell-cycle analysis because these methods do not produce a synchronized culture.
  • Ontological analysis of gene expression data: current tools, limitations, and open problems.
    Bioinformatics, Vol. 21, No. 18. (15 September 2005), pp. 3587-3595.

    Independent of the platform and the analysis methods used, the result of a microarray experiment is, in most cases, a list of differentially expressed genes. An automatic ontological analysis approach has been recently proposed to help with the biological interpretation of such results. Currently, this approach is the de facto standard for the secondary analysis of high throughput experiments and a large number of tools have been developed for this purpose. We present a detailed comparison of 14 such tools using the following criteria: scope of the analysis, visualization capabilities, statistical model(s) used, correction for multiple comparisons, reference microarrays available, installation issues and sources of annotation data. This detailed analysis of the capabilities of these tools will help researchers choose the most appropriate tool for a given type of analysis. More importantly, in spite of the fact that this type of analysis has been generally adopted, this approach has several important intrinsic drawbacks. These drawbacks are associated with all tools discussed and represent conceptual limitations of the current state-of-the-art in ontological analysis. We propose these as challenges for the next generation of secondary data analysis tools.
  • Gene expression profiling and differentiation assessment in primary human hepatocyte cultures, established hepatoma cell lines, and human liver tissues.
    Toxicol Appl Pharmacol, Vol. 222, No. 1. (1 July 2007), pp. 42-56.

    Frequently, primary hepatocytes are used as an in vitro model for the liver in vivo. However, the culture conditions reported vary considerably, with associated variability in performance. In this study, we characterized the differentiation character of primary human hepatocytes cultured using a highly defined, serum-free two-dimensional sandwich system, one that configures hepatocytes with collagen I as the substratum together with a dilute extracellular matrix (Matrigeltrade mark) overlay combined with a defined serum-free medium containing nanomolar levels of dexamethasone. Gap junctional communication, indicated by immunochemical detection of connexin 32 protein, was markedly enhanced in hepatocytes cultured in the Matrigel sandwich configuration. Whole genome expression profiling enabled direct comparison of liver tissues to hepatocytes and to the hepatoma-derived cell lines, HepG2 and Huh7. PANTHER database analyses were used to identify biological processes that were comparatively over-represented among probe sets expressed in the in vitro systems. The robustness of the primary hepatocyte cultures was reflected by the extent of unchanged expression character when compared directly to liver, with more than 77% of the probe sets unchanged in each of the over-represented categories, representing such genes as C/EBPalpha, HNF4alpha, CYP2D6, and ABCB1. In contrast, HepG2 and Huh7 cells were unchanged from the liver tissues for fewer than 48% and 55% of these probe sets, respectively. Further, hierarchical clustering of the hepatocytes, but not the cell lines, shifted from donor-specific to treatment-specific when the probe sets were filtered to focus on phenobarbital-inducible genes, indicative of the highly differentiated nature of the hepatocytes when cultured in a highly defined two-dimensional sandwich system.
  • Protein microarrays for highly parallel detection and quantitation of specific proteins and antibodies in complex solutions
    Genome Biology, Vol. 2, No. 2. (2001), pp. research0004.1-research0004.13.

    BACKGROUND:We have developed and tested a method for printing protein microarrays and using these microarrays in a comparative fluorescence assay to measure the abundance of many specific proteins in complex solutions. A robotic device was used to print hundreds of specific antibody or antigen solutions in an array on the surface of derivatized microscope slides. Two complex protein samples, one serving as a standard for comparative quantitation, the other representing an experimental sample in which the protein quantities were to be measured, were labeled by covalent attachment of spectrally resolvable fluorescent dyes.RESULTS:Specific antibody-antigen interactions localized specific components of the complex mixtures to defined cognate spots in the array, where the relative intensity of the fluorescent signal representing the experimental sample and the reference standard provided a measure of each protein's abundance in the experimental sample. To test the specificity, sensitivity and accuracy of this assay, we analyzed the performance of 115 antibody/antigen pairs. 50% of the arrayed antigens and 20% of the arrayed antibodies provided specific and accurate measurements of their cognate ligands at or below concentrations of 0.34 mug/ml and 1.6 mug/ml, respectively. Some of the antibody/antigen pairs allowed detection of the cognate ligands at absolute concentrations below 1 ng/ml, and partial concentrations of 1 part in 106, sensitivities sufficient for measurement of many clinically important proteins in patient blood samples.CONCLUSIONS:These results suggest that protein microarrays can provide a practical means to characterize patterns of variation in hundreds of thousands of different proteins in clinical or research applications.
  • Fluorescent high-density labeling of DNA: error-free substitution for a normal nucleotide
    Journal of Biotechnology, Vol. 86, No. 3. (13 April 2001), pp. 237-253.

    The enzymatic incorporation of deoxyribonucleoside triphosphates by a thermostable, 3'-->5' exonuclease deficient mutant of the Tgo DNA polymerase was studied for PCR-based high-density labeling of 217-bp `natural' DNA in which fluorescent-dUTP was substituted completely for the normal dTTP. The amplified DNA carried two different sorts of tethered dye molecules. The rhodamine-green was used for internal tagging of the DNA. Since high-density incorporation of rhodamine-green-X-dUTP led to a substantial reduction (quenching) of the rhodamine-green fluorescence, a second `high' quantum yield label, Cy5, was inserted via a 5'-tagged primer in order to identify the two-color product. A theoretical concept of fluorescence auto- and cross-correlation spectroscopy developed here was applied to quantify the DNA sequence formed in terms of both the number of two-color fluorescent molecules and the number of covalently incorporated rhodamine-green-X-dUMP residues. The novel approach allowed to separate optically the specific DNA product. After complete, exonucleolytic degradation of the two-color DNA we determined 82-88 fluorescent U* labels incorporated covalently out of 92 maximum possible U* incorporations. The heavily green-labeled DNA was then isolated by preparative mobility-shift electrophoresis, re-amplified in a subsequent PCR with normal deoxyribonucleoside triphosphates, and re-sequenced. By means of this novel methodology for analyzing base-specific incorporations that was first developed here, we found that all fluorescent nucleotides and the normal nucleotides were incorporated at the correct positions. The determined labeling efficiency of 0.89-0.96 indicated that a fraction of the substrate analog was not bearing the fluorophore. The results were used to guide developments in single-molecule DNA sequencing. The labeling strategy (principal approach) for PCR-based high-density tagging of DNA, which included an appropriate thermostable DNA polymerase and a suitable fluorescent dye-dNTP, was developed here.
  • Role of human hepatocyte nuclear factor 4alpha in the expression of drug-metabolizing enzymes and transporters in human hepatocytes assessed by use of small interfering RNA.
    Drug Metab Pharmacokinet, Vol. 22, No. 4. (August 2007), pp. 287-298.

    Hepatocyte nuclear factor 4alpha (HNF4alpha) is an important transcription factor in hepatic gene expression. Here, we have investigated the role of HNF4alpha in the expression of drug-metabolizing enzymes and transporters in human hepatocytes using an adenovirus expressing human HNF4alpha-small interfering RNA (hHNF4alpha-siRNA). The hHNF4alpha-siRNA effectively reduced the mRNA and nuclear protein levels of hHNF4alpha in a concentration-dependent manner. The hHNF4alpha-siRNA also decreased the mRNA levels of CYP2A6, CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, CYP3A4, UGT1A1, UGT1A9, SULT2A1, ABCB1, ABCB11, ABCC2, OATP1B1 and OCT1, as well as those of PXR and CAR. To discern the role of these nuclear receptors, we co-infected hepatocytes with hHNF4alpha-siRNA and PXR- or CAR-expressing adenovirus. The hHNF4alpha-siRNA-induced reductions of the enzyme and transporter mRNA levels were not restored except CYP2B6 mRNA levels, which were returned to the control level by overexpressing CAR. Furthermore, although hHNF4alpha-siRNA did not significantly affect the fold-induction of CYP2B6, CYP2C8, CYP2C9, or CYP3A4 mRNA levels following treatment with CYP inducers, the levels in hHNF4alpha-suppressed cells fell significantly compared to the control. These results suggest that HNF4alpha plays a dominant role in the expression of drug-metabolizing enzymes and transporters in human hepatocytes, and that HNF4alpha expression levels is a possible determinant for inter-individual variations in the expression of these enzymes and transporters.
  • Pre-filtering improves reliability of Affymetrix GeneChips results when used to analyze gene expression in complex tissues.
    Mol Cell Probes (28 November 2007)

    Affymetrix GeneChip represents a very reliable and standardized technology for genome-wide gene expression screening. However, in experiments carried out on complex biological samples (e.g. brain tissues composed of several diverse cell types), significant noise can arise due to important transcripts being expressed in a relatively small number of cells. This noise results in many observations coming from unreliable hybridization reactions. Here we propose a method for pre-filtering Affymetrix data according to measures of hybridization reliability. We used our pre-filtering method on a microarray dataset obtained from the brains of rats chronically treated with a psychostimulant drug. Our pre-filter protocol facilitates selection of biologically relevant candidate genes, which could be validated by real-time PCR with a rate of 98%.
  • Cyanine dye dUTP analogs for enzymatic labeling of DNA probes.
    Nucleic Acids Res, Vol. 22, No. 15. (11 August 1994), pp. 3226-3232.

    Fluorescence in situ hybridization (FISH) has become and indispensable tool in a variety of areas of research and clinical diagnostics. Many applications demand an approach for simultaneous detection of multiple target sequences that is rapid and simple, yet sensitive. In this work, we describe the synthesis of two new cyanine dye-labeled dUTP analogs, Cy3-dUTP and Cy5-dUTP. They are efficient substrates for DNA polymerases and can be incorporated into DNA probes by standard nick translation, random priming and polymerase chain reactions. Optimal labeling conditions have been identified which yield probes with 20-40 dyes per kilobase. The directly labeled DNA probes obtained with these analogs offer a simple approach for multicolor multisequence analysis that requires no secondary detection reagents and steps.
  • Development of a DNA-Labeling System for Array-Based Comparative Genomic Hybridization
    J Biomol Tech, Vol. 16, No. 2. (1 June 2005), pp. 104-111.

    Chromosomal amplifications and deletions are critical components of tumorigenesis and DNA copy-number variations also correlate with changes in mRNA expression levels. Genome-wide microarray comparative genomic hybridization (CGH) has become an important method for detecting and mapping chromosomal changes in tumors. Thus, the ability to detect twofold differences in fluorescent intensity between samples on microarrays depends on the generation of high-quality labeled probes. To enhance array-based CGH analysis, a random prime genomic DNA labeling method optimized for improved sensitivity, signal-to-noise ratios, and reproducibility has been developed. The labeling system comprises formulated random primers, nucleotide mixtures, and notably a high concentration of the double mutant exo-large fragment of DNA polymerase I (exo-Klenow). Microarray analyses indicate that the genomic DNA-labeled templates yield hybridization signals with higher fluorescent intensities and greater signal-to-noise ratios and detect more positive features than the standard random prime and conventional nick translation methods. Also, templates generated by this system have detected twofold differences in gene copy number between male and female genomic DNA and identified amplification and deletions from the BT474 breast cancer cell line in microarray hybridizations. Moreover, alterations in gene copy number were routinely detected with 0.5 microg of genomic DNA starting sample. The method is flexible and performs efficiently with different fluorescently labeled nucleotides. Application of the optimized CGH labeling system may enhance the resolution and sensitivity of array-based CGH analysis in cancer and medical genetic studies.
  • Quality assessment of Affymetrix GeneChip data.
    OMICS, Vol. 10, No. 3. (2006), pp. 358-368.

    Affymetrix GeneChips are one of the best established microarray platforms. This powerful technique allows users to measure the expression of thousands of genes simultaneously. However, a microarray experiment is a sophisticated and time consuming endeavor with many potential sources of unwanted variation that could compromise the results if left uncontrolled. Increasing data volume and data complexity have triggered growing concern and awareness of the importance of assessing the quality of generated microarray data. In this review, we give an overview of current methods and software tools for quality assessment of Affymetrix GeneChip data. We focus on quality metrics, diagnostic plots, probe-level methods, pseudo-images, and classification methods to identify corrupted chips. We also describe RNA quality assessment methods which play an important role in challenging RNA sources like formalin embedded biopsies, laser-micro dissected samples, or single cells. No wet-lab methods are discussed in this paper.
  • Characterization of three growth hormone-responsive transcription factors preferentially expressed in adult female liver.
    Endocrinology, Vol. 148, No. 7. (July 2007), pp. 3327-3337.

    Plasma GH profiles regulate the sexually dimorphic expression of cytochromes P450 and many other genes in rat and mouse liver; however, the proximal transcriptional regulators of these genes are unknown. Presently, we characterize three liver transcription factors that are expressed in adult female rat and mouse liver at levels up to 16-fold [thymus high-mobility group box protein (Tox)], 73-fold [tripartite motif-containing 24 (Trim24)/transcription initiation factor-1alpha (TIF1alpha)], and 125-fold [cut-like 2 (Cutl2)/cut homeobox 2 (Cux2)] higher than in adult males, depending on the strain and species, with Tox expression only detected in mice. In rats, these sex differences first emerged at puberty, when the high prepubertal expression of Cutl2 and Trim24 was extinguished in males but was further increased in females. Rat hepatic expression of Cutl2 and Trim24 was abolished by hypophysectomy and, in the case of Cutl2, was restored to near-female levels by continuous GH replacement. Cutl2 and Trim24 were increased to female-like levels in livers of intact male rats and mice treated with GH continuously (female GH pattern), whereas Tox expression reached only about 40% of adult female levels. Expression of all three genes was also elevated to normal female levels or higher in male mice whose plasma GH profile was feminized secondary to somatostatin gene disruption. Cutl2 and Trim24 both responded to GH infusion in mice within 10-24 h and Tox within 4 d, as compared with at least 4-7 d required for the induced expression of several continuous GH-regulated cytochromes P450 and other female-specific hepatic genes. Cutl2, Trim24, and Tox were substantially up-regulated in livers of male mice deficient in either of two transcription factors implicated in GH regulation of liver sex specificity, namely, signal transducer and activator of transcription 5b (STAT5b) and hepatocyte nuclear factor 4alpha (HNF4alpha), with sex-specific expression being substantially reduced or lost in mice deficient in either nuclear factor. Cutl2 and Trim24 both display transcriptional repressor activity and could thus contribute to the loss of GH-regulated, male-specific liver gene expression seen in male mice deficient in STAT5b or HNF4alpha. Binding sites for Cutl1, whose DNA-binding specificity is close to that of Cutl2, were statistically overrepresented in STAT5b-dependent male-specific mouse genes, lending support to this hypothesis.
  • Options available--from start to finish--for obtaining expression data by microarray.
    Nat Genet, Vol. 21, No. 1 Suppl. (January 1999), pp. 25-32.

    The excitement surrounding microarray technology has been tempered by the limited ability of the general biomedical research community to gain access to it. Given the hardware required for exploitation of the technology is becoming increasingly available, it is an appropriate moment to review options, be they commercially or publically available. Here, we provide a snapshot of the rapidly changing field of microarray-based RNA expression analysis and consider the components and procedures for putting together a complete system.
  • Functional profiling of microarray experiments using text-mining derived bioentities
    Bioinformatics, Vol. 23, No. 22. (15 November 2007), pp. 3098-3099.

    Motivation: The increasing use of microarray technologies brought about a parallel demand in methods for the functional interpretation of the results. Beyond the conventional functional annotations for genes, such as gene ontology, pathways, etc. other sources of information are still to be exploited. Text-mining methods allow extracting informative terms (bioentities) with different functional, chemical, clinical, etc. meanings, that can be associated to genes. We show how to use these associations within an appropriate statistical framework and how to apply them through easy-to-use, web-based environments to the functional interpretation of microarray experiments. Functional enrichment and gene set enrichment tests using bioentities are presented. Availability: Marmite and MarmiteScan can be found in the Babelomics suite: http://www.babelomics.org Contact: jdopazo@cipf.es Supplementary information: Supplementary data are available at Bioinformatics online. 10.1093/bioinformatics/btm445
  • Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation
    BMC Bioinformatics, Vol. 8 (18 January 2007), 14.
  • Improved scoring of functional groups from gene expression data by decorrelating GO graph structure
    Bioinformatics, Vol. 22, No. 13. (1 July 2006), pp. 1600-1607.
  • MILANO - custom annotation of microarray results using automatic literature searches
    BMC Bioinformatics, Vol. 6, No. 1. (2005)

    BACKGROUND:High-throughput genomic research tools are becoming standard in the biologist's toolbox. After processing the genomic data with one of the many available statistical algorithms to identify statistically significant genes, these genes need to be further analyzed for biological significance in light of all the existing knowledge. Literature mining - the process of representing literature data in a fashion that is easy to relate to genomic data - is one solution to this problem.RESULTS:We present a web-based tool, MILANO (Microarray Literature-based Annotation), that allows annotation of lists of genes derived from microarray results by user defined terms. Our annotation strategy is based on counting the number of literature co-occurrences of each gene on the list with a user defined term. This strategy allows the customization of the annotation procedure and thus overcomes one of the major limitations of the functional annotations usually provided with microarray results. MILANO expands the gene names to include all their informative synonyms while filtering out gene symbols that are likely to be less informative as literature searching terms. MILANO supports searching two literature databases: GeneRIF and Medline (through PubMed), allowing retrieval of both quick and comprehensive results. We demonstrate MILANO's ability to improve microarray analysis by analyzing a list of 150 genes that were affected by p53 overproduction. This analysis reveals that MILANO enables immediate identification of known p53 target genes on this list and assists in sorting the list into genes known to be involved in p53 related pathways, apoptosis and cell cycle arrest.CONCLUSIONS:MILANO provides a useful tool for the automatic custom annotation of microarray results which is based on all the available literature. MILANO h