-
rCUR:an R package for CUR matrix
decomposition
Background:
Many methods for dimensionality reduction of large data sets such as those generated inmicroarray studies boil down to the Singular Value Decomposition (SVD). Althoughsingular vectors associated with the largest singular values have strong optimalityproperties and can often be quite useful as a tool to summarize the data, they are linearcombinations of up to all of the data points, and thus it is typically quite hard to interpretthose vectors in terms of the application domain from which the data are drawn. Recently,an alternative dimensionality reduction paradigm, CUR matrix decompositions, has beenproposed to address this problem and has been applied to genetic and internet data. CURdecompositions are low-rank matrix decompositions that are explicitly expressed in termsof a small number of actual columns and/or actual rows of the data matrix. Since they areconstructed from actual data elements, CUR decompositions are interpretable bypractitioners of the eld from which the data are drawn.
Results:
We present an implementation to perform CUR matrix decompositions, in the form of afreely available, open source R-package called rCUR. This package will help users to perform CUR-based analysis on large-scale data, such as those obtained from differenthigh-throughput technologies, in an interactive and exploratory manner. We show twoexamples that illustrate how CUR-based techniques make it possible to reducesignicantly the number of probes, while at the same time maintaining major trends indata and keeping the same classication accuracy.
Conclusions:
The package rCUR provides functions for the users to perform CUR-based matrixdecompositions in the R environment. In gene expression studies, it gives an additionalway of analysis of differential expression and discriminant gene selection based on the useof statistical leverage scores. These scores, which have been used historically indiagnostic regression analysis to identify outliers, can be used by rCUR to identify themost informative data points with respect to which to express the remaining data points.
-
Workflows for microarray data processing in the Kepler
environment
Background:
Microarray data analysis has been the subject of extensive and ongoing pipeline developmentdue to its complexity, the availability of several options at each analysis step, and thedevelopment of new analysis demands, including integration with new data sources.Bioinformatics pipelines are usually custom built for different applications, making themtypically difficult to modify, extend and repurpose. Scientific workflow systems are intendedto address these issues by providing general-purpose frameworks in which to develop andexecute such pipelines. The Kepler workflow environment is a well-established system undercontinual development that is employed in several areas of scientific research. Keplerprovides a flexible graphical interface, featuring clear display of parameter values, for designand modification of workflows. It has capabilities for developing novel computationalcomponents in the R, Python, and Java programming languages, all of which are widely usedfor bioinformatics algorithm development, along with capabilities for invoking externalapplications and using web services.
Results:
We developed a series of fully functional bioinformatics pipelines addressing common tasksin microarray processing in the Kepler workflow environment. These pipelines consist of aset of tools for GFF file processing of NimbleGen chromatin immunoprecipitation onmicroarray (ChIP-chip) datasets and more comprehensive workflows for Affymetrix geneexpression microarray bioinformatics and basic primer design for PCR experiments, whichare often used to validate microarray results. Although functional in themselves, theseworkflows can be easily customized, extended, or repurposed to match the needs of specific projects and are designed to be a toolkit and starting point for specific applications. Theseworkflows illustrate a workflow programming paradigm focusing on local resources(programs and data) and therefore are close to traditional shell scripting or R/BioConductorscripting approaches to pipeline design. Finally, we suggest that microarray data processingtask workflows may provide a basis for future example-based comparison of differentworkflow systems.
Conclusions:
We provide a set of tools and complete workflows for microarray data analysis in the Keplerenvironment, which has the advantages of offering graphical, clear display of conceptualsteps and parameters and the ability to easily integrate other resources such as remote dataand web services.
-
MELTING, a flexible platform to predict the melting temperatures
of nucleic acids
Background:
Computing accurate nucleic acid melting temperatures has become a crucial step for the efficiency and theoptimisation of numerous molecular biology techniques such as in situ hybridisation, PCR, antigenetargeting, and microarrays. MELTING is a free open source software which computes the enthalpy, entropyand melting temperature of nucleic acids. MELTING 4.2 was able to handle several types of hybridisationsuch as DNA/DNA, RNA/RNA, DNA/RNA and provided corrections to melting temperatures due to thepresence of sodium. The program can use either an approximative approach or a more accurateNearest-Neighbour approach.
Results:
Two new versions of the MELTING software have been released. MELTING 4.3 is a direct update ofversion 4.2, integrating newly available thermodynamic parameters for inosine, a modified adenine basewith an universal base capacity, and incorporates a correction for magnesium. MELTING 5 is a completereimplementation wich allows much greater flexibility and extensibility. It incorporates all thethermodynamic parameters and corrections provided in MELTING 4.x and introduces a large set ofthermodynamic formulae and parameters, to facilitate the calculation of melting temperatures for perfectlymatching sequences, mismatches, bulge loops, CNG repeats, dangling ends, inosines, locked nucleic acids,2-hydroxyadenines and azobenzenes. It also includes temperature corrections for monovalent ions (sodium,potassium, tris), magnesium ions and commonly used denaturing agents such as formamide and DMSO.
Conclusions:
MELTING is a useful and very flexible tool for predicting melting temperatures using approximativeformulae or Nearest-Neighbour approaches, where one can select different sets of Nearest-Neighbourparameters, corrections and formulae. Both versions are freely available athttp://sourceforge.net/projects/melting/ and at http://www.ebi.ac.uk/compneur-srv/melting/ under the termsof the GPL license.
-
Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data
Background:
As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size. This is and will be a frequent problem that is encountered everyday by researchers who are working on genetic data. There are some options available for compressing and storing such data, such as general-purpose compression software, PBAT/PLINK binary format, etc. However, these currently available methods either do not offer sufficient compression rates, or require a great amount of CPU time for decompression and loading every time the data is accessed.
Results:
Here, we propose a novel and simple algorithm for storing such sequencing data. We show that the compression factor of the algorithm ranges from 16 to several hundreds, which potentially allows SNP data of hundreds of Gigabytes to be stored in hundreds of Megabytes. We provide a C++ implementation of the algorithm, which supports direct loading and parallel loading of the compressed format without requiring extra time for decompression. By applying the algorithm to simulated and real datasets, we show that the algorithm gives greater compression rate than the commonly used compression methods, and the data-loading process takes less time. Also, The C++ library provides direct-data-retrieving functions, which allows the compressed information to be easily accessed by other C++ programs.
Conclusions:
The SpeedGene algorithm enables the storage and the analysis of next generation sequencing data in current hardware environment, making system upgrades unnecessary.
-
MetaMapp: mapping and visualizing metabolomic
data by integrating information from biochemical
pathways and chemical and mass spectral similarity
Background:
Exposure to environmental tobacco smoke (ETS) leads to higher rates of pulmonary diseases and infections in children. To study the biochemical changes that may precede lung diseases, metabolomic effects on fetal and maternal lungs and plasma from rats exposed to ETS were compared to filtered air control animals. Genome-reconstructed metabolic pathways may be used to map and interpret dysregulation in metabolic networks. However, mass spectrometry-based non-targeted metabolomics datasets often comprise many metabolites for which links to enzymatic reactions have not yet been reported. Hence, network visualizations that rely on current biochemical databases are incomplete and also fail to visualize novel, structurally unidentified metabolites.
Results:
We present a novel approach to integrate biochemical pathway and chemical relationships to map all detected metabolites in network graphs (MetaMapp) using KEGG reactant pair database, Tanimoto chemical and NIST mass spectral similarity scores. In fetal and maternal lungs, and in maternal blood plasma from pregnant rats exposed to environmental tobacco smoke (ETS), 459 unique metabolites comprising 179 structurally identified compounds were detected by gas chromatography time of flight mass spectrometry (GC-TOF MS) and BinBase data processing. MetaMapp graphs in Cytoscape showed much clearer metabolic modularity and complete content visualization compared to conventional biochemical mapping approaches. Cytoscape visualization of differential statistics results using these graphs showed that overall, fetal lung metabolism was more impaired than lungs and blood metabolism in dams. Fetuses from ETS-exposed dams expressed lower lipid and nucleotide levels and higher amounts of energy metabolism intermediates than control animals, indicating lower biosynthetic rates of metabolites for cell division, structural proteins and lipids that are critical for in lung development.
Conclusion:
MetaMapp graphs efficiently visualizes mass spectrometry based metabolomics datasets as network graphs in Cytoscape, and highlights metabolic alterations that can be associated with higher rate of pulmonary diseases and infections in children prenatally exposed to ETS. The MetaMapp scripts can be accessed at http://metamapp.fiehnlab.ucdavis.edu.
-
The Partitioned LASSO-Patternsearch Algorithm with Application
to Gene Expression Data
Background:
In systems biology, the task of reverse engineering gene pathways from data has been limited not just by the curse of dimensionality (the interaction space is huge) but also by systematic error in the data. The gene expression barcode reduces spurious association driven by batch effects and probe effects. The binary nature of the resulting expression calls lends itself perfectly for modern regularization approaches that thrive with dimensionality.
Results:
The Partitioned LASSO-Patternsearch algorithm is proposed to identify patterns of multiple dichotomous risk factors for outcomes of interest in genomic studies. A partitioning scheme is used to identify promising patterns by solving many LASSO-Patternsearch subproblems in parallel. All variables that survive this stage proceed to an aggregation stage where the most significant patterns are identified by solving a reduced LASSO-Patternsearch problem in just these variables. This approach was applied to genetic data sets with expression levels dichotomized by gene expression bar code. Most of the genes and second-order interactions thus selected and are known to be related to the outcomes.
Conclusions:
We demonstrate with simulations and data analyses that the proposed method not only selects variables and patterns more accurately, but also provides smaller models with better prediction accuracy, in comparison to several competing methodologies.
-
Exploration of multivariate analysis in microbial coding sequence modeling
Background:
Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling,identification of promoter regions, issues concerning overlapping genes and more. In the present study wefocus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of theactual coding sequences from genomic DNA. In this respect, we promote a novel multivariate methodknown as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly usedInterpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon andprotein sequences with highly conserved genes taken from several species with different genomic properties.
Results:
The multivariate CPPLS approach classified coding sequence substantially better than the commonly usedIMM on the same set of sequences. We also found that the use of CPPLS with codon representation gavesignificantly better classification results than both IMM with protein (p < 0.001) and with DNA(p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance oncodon representation was significantly smaller than for IMM (p < 0.001).
Conclusions:
The performance of coding sequence modeling can be substantially improved by using an algorithm basedon the multivariate CPPLS method applied to codon or DNA frequencies.
-
SIS: a program to generate draft genome sequence
scaffolds for prokaryotes
Background:
Decreasing costs of DNA sequencing have made prokaryotic draft genome sequencesincreasingly common. A contig scaffold is an ordering of contigs in the correctorientation. A scaffold can help genome comparisons and guide gap closure efforts. Onepopular technique for obtaining contig scaffolds is to map contigs onto a referencegenome. However, rearrangements that may exist between the query and referencegenomes may result in incorrect scaffolds, if these rearrangements are not taken intoaccount. Large-scale inversions are common rearrangement events in prokaryoticgenomes. Even in draft genomes it is possible to detect the presence of inversions givensufficient sequencing coverage and a sufficiently close reference genome.
Results:
We present a linear-time algorithm that can generate a set of contig scaffolds for a draftgenome sequence represented in contigs given a reference genome. The algorithm isaimed at prokaryotic genomes and relies on the presence of matching sequence patternsbetween the query and reference genomes that can be interpreted as the result oflarge-scale inversions; we call these patterns inversion signatures. Our algorithm iscapable of correctly generating a scaffold if at least one member of every inversionsignature pair is present in contigs and no inversion signatures have been overwritten inevolution. The algorithm is also capable of generating scaffolds in the presence of anykind of inversion, even though in this general case there is no guarantee that all scaffoldsin the scaffold set will be correct. We compare the performance of SIS, the program thatimplements the algorithm, to seven other scaffold-generating programs. The results of ourtests show that SIS has overall better performance.
Conclusions:
SIS is a new easy-to-use tool to generate contig scaffolds, available both as stand-aloneand as a web server. The good performance of SIS in our tests adds evidence thatlarge-scale inversions are widespread in prokaryotic genomes.
-
3DMolNavi: A web-based retrieval and navigation tool for flexible molecular shape comparison
Background:
Many molecules of interest are flexible and undergo significant shape deformation as part of their function,but most existing methods of molecular shape comparison treat them as rigid shapes, which may lead toincorrect measure of the shape similarity of flexible molecules. Currently, there still is a limited effort inretrieval and navigation for flexible molecular shape comparison, which would improve data retrieval byhelping users locate the desirable molecule in a convenient way.
Results:
To address this issue, we develop a web-based retrieval and navigation tool, named 3DMolNavi, for flexiblemolecular shape comparison. This tool is based on the histogram of Inner Distance Shape Signature (IDSS)for fast retrieving molecules that are similar to a query molecule, and uses dimensionality reduction tonavigate the retrieved results in 2D and 3D spaces. We tested 3DMolNavi in the Database ofMacromolecular Movements (MolMovDB) and CATH. Compared to other shape descriptors, it achievesgood performance and retrieval results for different classes of flexible molecules.
Conclusions:
The advantages of 3DMolNavi, over other existing softwares, are to integrate retrieval for flexible molecularshape comparison and enhance navigation for user's interaction. 3DMolNavi can be accessed viahttps://engineering.purdue.edu/PRECISE/3dmolnavi/index.html.
-
An in silico platform for the design of heterologous
pathways in nonnative metabolite production
Background:
Microorganisms are used as cell factories to produce valuable compounds in pharmaceuticals,biofuels, and other industrial processes. Incorporating heterologous metabolic pathways intowell-characterized hosts is a major strategy for obtaining these target metabolites andimproving productivity. However, selecting appropriate heterologous metabolic pathways fora host microorganism remains difficult owing to the complexity of metabolic networks.Hence, metabolic network design could benefit greatly from the availability of an in silicoplatform for heterologous pathway searching.
Results:
We developed an algorithm for finding feasible heterologous pathways by which nonnativetarget metabolites are produced by host microorganisms, using Escherichia coli,Cornyebacterium glutamicum, and Saccharomyces cerevisiae as templates. Using thisalgorithm, we screened heterologous pathways for the production of all possible nonnativetarget metabolites contained within databases. We then assessed the feasibility of the targetproductions using flux balance analysis, by which we could identify target metabolitesassociated with maximum cellular growth rate.
Conclusions:
This in silico platform, designed for targeted searching of heterologous metabolic reactions,provides essential information for cell factory improvement.
|