Healthcare Headlines
BMC Bioinformatics - Latest articles
  • Identifying differential correlation in gene/pathway combinations
    Background: An important emerging trend in the analysis of microarray data is to incorporate known pathway information a priori. Expression level "summaries" for pathways, obtained from the expression data for the genes constituting the pathway, permit the inclusion of pathway information, reduce the high dimensionality of microarray data, and have the power to elucidate gene-interaction dependencies which are not already accounted for through known pathway identification. Results: We present a novel method for the analysis of microarray data that identifies joint differential expression in gene-pathway pairs. This method takes advantage of known gene pathway memberships to compute a summary expression level for each pathway as a whole. Correlations between the pathway expression summary and the expression levels of genes not already known to be associated with the pathway provide clues to gene interaction dependencies that are not already accounted for through known pathway identification, and statistically significant differences between gene-pathway correlations in phenotypically different cells (e.g., where the expression level of a single gene and a given pathway summary correlate strongly in normal cells but weakly in tumor cells) may indicate biologically relevant gene-pathway interactions. Here, we detail the methodology and present the results of this method applied to two gene-expression datasets, identifying gene-pathway pairs which exhibit differential joint expression by phenotype. Conclusions: The method described herein provides a means by which interactions between large numbers of genes may be identified by incorporating known pathway information to reduce the dimensionality of gene interactions. The method is efficient and easily applied to data sets of ~10^2 arrays. Application of this method to two publicly-available cancer data sets yields suggestive and promising results. This method has the potential to complement gene-at-a-time analysis techniques for microarray analysis by indicating relationships between pathways and genes that have not previously been identified and which may play a role in disease.

  • Use of machine learning algorithms to classify binary protein sequences as highly-designable or poorly-designable
    Background: By using a standard Support Vector Machine (SVM) with a Sequential Minimal Optimization (SMO) method of training, Naive Bayes and other machine learning algorithms we are able to distinguish between two classes of protein sequences: those folding to highly-designable conformations, or those folding to poorly- or non-designable conformations. Results: First, we generate all possible compact lattice conformations for the specified shape (a hexagon or a triangle) on the 2D triangular lattice. Then we generate all possible binary hydrophobic/polar (H/P) sequences and by using a specified energy function, thread them through all of these compact conformations. If for a given sequence the lowest energy is obtained for a particular lattice conformation we assume that this sequence folds to that conformation. Highly-designable conformations have many H/P sequences folding to them, while poorly-designable conformations have few or no H/P sequences. We classify sequences as folding to either highly - or poorly-designable conformations. We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms. Conclusions: By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy - in some cases exceeding 95%.

  • Expression profiles of switch-like genes accurately classify tissue and infectious disease phenotypes in model-based classification
    Background: Large-scale compilation of gene expression microarray datasets across diverse biological phenotypes has led to the identification and annotation of bimodal genes in the mouse and the human genome. These switch-like genes consist of 15% of known human genes, and are enriched with genes coding for extracellular and membrane proteins. The microarray data is typically comprised of thousands of measurements for a few samples and it is of interest to determine the prediction potential of bimodal genes for class discovery in large-scale datasets. Results: Use of a model-based clustering algorithm accurately classified more than 400 microarray samples into 19 different tissue types on the basis of bimodal gene expression. The algorithm demonstrated similar accuracy in the classification of microarray data corresponding to hepatitis C, influenza, HIV-1 and malaria infection. Classification accuracy was exceptional even with class-specific sample sizes between ten and twenty arrays. Supervised classification with feature selection restricted to switch-like genes also recognized tissue specific and infectious disease specific signatures in independent test datasets reserved for validation. Moreover, our computational simulations with a minimal model of microarray data indicated the validity of our observations in a large number of circumstances. A set of 300 genes out of the 1295 genes annotated in the human as switch-like coded for either extracellular matrix or cell membrane proteins. Conclusions: Switch-like bimodal gene sets capture genome-wide signatures from microarray data in health and infectious disease. A subset of bimodal genes coding for extracellular and membrane proteins identify tissue specificity, indicating a potential role for them as biomarkers provided that expression is altered in the onset of disease. Determination of "on" and "off" states of switch-like genes in various tissues and diseases allow for prediction of activated/deactivated pathways.

  • CrossHybDetector: detection of cross-hybridization events in DNA microarray experiments
    Background: DNA microarrays contain thousands of different probe sequences represented on their surface. These are designed in such a way that potential cross-hybridization reactions with non-target sequences are minimized. However, given the large number of probes, the occurrence of cross hybridization events cannot be excluded. This problem can dramatically affect the data quality and cause false positive/false negative results. Results: CrossHybDetector is a software package aimed at the identification of cross-hybridization events occurred during individual array hybridization, by using the probe sequences and the array intensity values. As output, the software provides the user with a list of array spots potentially 'corrupted' and their associated p-values calculated by Monte Carlo simulations. Graphical plots are also generated, which provide a visual and global overview of the quality of the microarray experiment with respect to cross-hybridization issues. Conclusions: CrossHybDetector is implemented as a package for the statistical computing environment R and is freely available under the LGPL license within the CRAN project.

  • Genome-scale study of the importance of binding site context for transcription factor binding and gene regulation
    Background: The rate of mRNA transcription is controlled by transcription factors that bind to specific DNA motifs in promoter regions upstream of protein coding genes. Recent results indicate that not only the presence of a motif but also motif context (for example the orientation of a motif or its location relative to the coding sequence) is important for gene regulation. Results: In this study we present ContextFinder, a tool that can identify cases where motif context is likely to affect gene regulation. We used ContextFinder to examine the role of motif context in S. cerevisiae both for DNA binding by transcription factors and effects on gene expression. For DNA binding we found significant patterns of motif location bias, whereas motif orientations did not seem to matter. Motif context appears to affect gene expression even more than it affects DNA binding, as biases in both motif location and orientation were more frequent in promoters of co-expressed genes. We validated our results against data on nucleosome positioning, and found a negative correlation between preferred motif locations and nucleosome occupancy. Conclusions: We conclude that the requirement for stable binding of transcription factors to DNA and their subsequent function in gene regulation both can impose constraints on motif context.

  • mtDNAmanager: a Web-based tool for the management and quality analysis of mitochondrial DNA control-region sequences
    Background: For the past few years, scientific controversy has surrounded the large number of errors in forensic and literature mitochondrial DNA (mtDNA) data. However, recent research has shown that using mtDNA phylogeny and referring to known mtDNA haplotypes can be useful for checking the quality of sequence data. Results: We developed a Web-based bioinformatics resource "mtDNAmanager" that offers a convenient interface supporting the management and quality analysis of mtDNA sequence data. The mtDNAmanager performs computations on mtDNA control-region sequences to estimate the most-probable mtDNA haplogroups and retrieves similar sequences from a selected database. By the phased designation of the most-probable haplogroups (both expected and estimated haplogroups), mtDNAmanager enables users to systematically detect errors whilst allowing for confirmation of the presence of clear key diagnostic mutations and accompanying mutations. The query tools of mtDNAmanager also facilitate database screening with two options of "match" and "include the queried nucleotide polymorphism". In addition, mtDNAmanager provides Web interfaces for users to manage and analyse their own data in batch mode. Conclusions: The mtDNAmanager will provide systematic routines for mtDNA sequence data management and analysis via easily accessible Web interfaces, and thus should be very useful for population, medical and forensic studies that employ mtDNA analysis. mtDNAmanager can be accessed at http://mtmanager.yonsei.ac.kr.

  • CellProfiler Analyst: data exploration and analysis software for complex image-based screens
    Background: Image-based screens can produce hundreds of measured features for each of hundreds of millions of individual cells in a single experiment. Results: Here, we describe CellProfiler Analyst, open-source software for the interactive exploration and analysis of multidimensional data, particularly data from high-throughput, image-based experiments. Conclusions: The system enables interactive data exploration for image-based screens and automated scoring of complex phenotypes that require combinations of multiple measured features per cell.

  • Gene set enrichment analysis for non-monotone association and multiple experimental categories
    Background: Recently, microarray data analyses using functional pathway information, e.g., gene set enrichment analysis (GSEA) and significance analysis of function and expression (SAFE), have gained recognition as a way to identify biological pathways/processes associated with a phenotypic endpoint. In these analyses, a local statistic is used to assess the association between the expression level of a gene and the value of a phenotypic endpoint. Then these gene-specific local statistics are combined to evaluate association for pre-selected sets of genes. Commonly used local statistics include t-statistics for binary phenotypes and correlation coefficients that assume a linear or monotone relationship between a continuous phenotype and gene expression level. Methods applicable to continuous non-monotone relationships are needed. Furthermore, for multiple experimental categories, methods that combine multiple GSEA/SAFE analyses are needed. Results: For continuous or ordinal phenotypic outcome, we propose to use as the local statistic the coefficient of multiple determination (i.e., the square of multiple correlation coefficient) $R^2$ from fitting natural cubic spline models to the phenotype-expression relationship. Next, we incorporate this association measure into the GSEA/SAFE framework to identify significant gene sets. Unsigned local statistics, signed global statistics and one-sided p-values are used to reflect our inferential interest. Furthermore, we describe a procedure for inference across multiple GSEA/SAFE analyses. We illustrate our approach using gene expression and liver injury data from liver and blood samples from rats treated with eight hepatotoxicants under multiple time and dose combinations. We set out to identify biological pathways/processes associated with liver injury as manifested by increased blood levels of alanine transaminase in common for most of the eight compounds. Potential statistical dependency resulting from the experimental design is addressed in permutation based hypothesis testing. Conclusion: The proposed framework captures both linear and non-linear association between gene expression level and a phenotypic endpoint and thus can be viewed as extending the current GSEA/SAFE methodology. The framework for combining results from multiple GSEA/SAFE analyses is flexible to address practical inference interests. Our methods can be applied to microarray data with continuous phenotypes with multi-level design or the meta-analysis of multiple microarray data sets.

  • A Web-based and Grid-enabled dChip version for the analysis of large sets of gene expression data
    Background: Microarray techniques are one of the main methods used to investigate thousands of gene expression profiles for enlightening complex biological processes responsible for serious diseases, with a great scientific impact and a wide application area. Several standalone applications had been developed in order to analyze microarray data. Two of the most known free analysis software packages are the R-based Bioconductor and dChip. The part of dChip software concerning the calculation and the analysis of gene expression has been modified to permit its execution on both cluster environments (supercomputers) and Grid infrastructures (distributed computing). This work is not aimed at replacing existing tools, but it provides researchers with a method to analyze large datasets without any hardware or software constraints. Results: An application able to perform the computation and the analysis of gene expression on large datasets has been developed using algorithms provided by dChip. Different tests have been carried out in order to validate the results and to compare the performances obtained on different infrastructures. Validation tests have been performed using a small dataset related to the comparison of HUVEC (Human Umbilical Vein Endothelial Cells) and Fibroblasts, derived from same donors, treated with IFN-alpha. Moreover performance tests have been executed just to compare performances on different environments using a large dataset including about 1000 samples related to Breast Cancer patients. Conclusions: A Grid-enabled software application for the analysis of large Microarray datasets has been proposed. DChip software has been ported on Linux platform and modified, using appropriate parallelization strategies, to permit its execution on both cluster environments and Grid infrastructures. The added value provided by the use of Grid technologies is the possibility to exploit both computational and data Grid infrastructures to analyze large datasets of distributed data. The software has been validated and performances on cluster and Grid environments have been compared obtaining quite good scalability results.

  • Extension of the COG and arCOG databases by amino acid and nucleotide sequences
    Background: The current versions of the COG and arCOG databases, both excellent frameworks for studies in comparative and functional genomics, do not contain the nucleotide sequences corresponding to their protein or protein domain entries. Results: Using sequence information obtained from GenBank flat files covering the completely sequenced genomes of the COG and arCOG databases, we constructed NUCOCOG (nucleotide sequences containing COG databases) as an extended version including all nucleotide sequences and in addition the amino acid sequences originally utilized to construct the current COG and arCOG databases. We make available three comprehensive single XML files containing the complete databases including all sequence information. In addition, we provide a web interface as a utility suitable to browse the NUCOCOG database for sequence retrieval. The database is accessible at http://www.uni-wh.de/nucocog. Conclusions: NUCOCOG offers the possibility to analyze any sequence related property in the context of the COG and arCOG framework simply by using script languages such as PERL applied to a large but single XML document.

  • RIPCAL: a tool for alignment-based analysis of repeat-induced point mutations in fungal genomic sequences
    Background: Repeat-induced point mutation (RIP) is a fungal-specific genome defence mechanism that alters the sequences of repetitive DNA, thereby inactivating coding genes. Repeated DNA sequences align between mating and meiosis and both sequences undergo C-G to T-A transitions. In most fungi these transitions preferentially affect CpA di-nucleotides thus altering the frequency of certain di-nucleotides in the affected sequences. The majority of previously published in silico analyses were limited to the comparison of ratios of pre- and post-RIP di-nucleotides in putatively RIP-affected sequences - so-called RIP indices. The analysis of RIP is significantly more informative when comparing sequence alignments of repeated sequences. There is, however, a dearth of bioinformatics tools available to the fungal research community for alignment-based RIP analysis of repeat families. Results: We present RIPCAL (www.sourceforge.net/projects/ripcal), a software tool for the automated analysis of RIP in fungal genomic DNA repeats, which performs both RIP index and alignment-based analyses. We demonstrate the ability of RIPCAL to detect RIP within known RIP-affected sequences of Neurospora crassa and other fungi. We also predict and delineate the presence of RIP in the genome of Stagonospora nodorum - a Dothideomycete pathogen of wheat. We show that RIP has affected different members of the S. nodorum rDNA tandem repeat to different extents depending on their genomic contexts. Conclusions: The RIPCAL alignment-based method has considerable advantages over RIP indices for the analysis of whole genomes. We demonstrate its application to the recently published genome assembly of S. nodorum.


Robyne Wilkerson
Our other Physiatry Related Sites by PM&R Resources R. Wilkerson