Frederick computer repair
The TAO-Gen algorithm for identifying gene interaction networks with application to SOS repair in E. coli
One major unresolved issue in the analysis of gene expression data is the identification and quantification of gene regulatory networks. Several methods have been proposed for identifying gene regulatory networks, but these methods predominantly focus on the use of multiple pairwise comparisons to identify the network structure. In this article, we describe a method for analyzing gene expression data to determine a regulatory structure consistent with an observed set of expression profiles. Unlike other methods this method goes beyond pairwise evaluations by using likelihood-based statistical methods to obtain the network that is most consistent with the complete data set. The proposed algorithm performs accurately for moderate-sized networks with most errors being minor additions of linkages. However, the analysis also indicates that sample sizes may need to be increased to uniquely identify even moderate-sized networks. The method is used to evaluate interactions between genes in the SOS signaling pathway in Escherichia colt using gene expression data where each gene in the network is over-expressed using plasmids inserts. Key words: gene networks, microarray, Bayesian model selection, SOS repair, toxicogenomics. Environ Health Perspect 112:1614-1621 (2004). doi:10.1289/txg.7105 available via ht(p://dx.doi.org/ [Online 21 July 2004]
**********
Gene expression microarrays (gene chips) have revolutionized biology by generating vast amounts of data roughly quantifying the level of mRNA expression for thousands of genes in a single sample. The analysis of these data is extraordinarily complex, resulting in a shift in biology from predominantly qualitative evaluations to quantitative approaches. With microarray technologies, scientists are forming global views of the structural and dynamic changes in genome activity during different phases in a cell's development and following exposure to external stimulants such as environmental agents or growth factors. These views describe the molecular working of a complex information processing system: the living cell. Numerous methods have already been proposed for the analysis of gene expression data. The most commonly used methods rely on clustering (Eisen et al. 1995; Tamayo et al. 1999), significance testing (Kerr et aI. 2000) and sequence motif identification (Pilpel et al. 2001). These methods do not readily reproduce gene expression networks but are more focused on the fundamental linkage between pairs of genes. Other investigators have proposed methods to identify gene regulatory networks using Boolean networks (Akutsu et al. 2000) where each gene has one of only two states (on and off), regression methods (Gardner et al. 2003), Bayesian network models (Friedman et al. 2000; Hartemink et al. 2002) and other methods (Johnson et al. 2004).
The use of genomics data in the evaluation of health hazards and risks has received considerable attention focusing on priority setting (Pesch et al. 2004), bio-marker identification (Toraason et ah 2004), hazard identification (Surer et ah 2004), and dose--response analysis (Schonwalder and Olden 2003; Simmons and Portier 2002; Waters et ah 2003). If genomics is to play a direct role in dose--response assessment, there will be a need for methods that provide a direct, quantitative assessment of changes in gene expression as a function of dose and changes in toxicity as a function of changes in gene expression. Developing and modeling gene interaction networks can be quantitative and provide direct dose-response data for use in risk assessment. They also are an excellent means of identifying agents that provide identical changes in expression across a broad spectrum of genes and help link agents on the basis of similar mechanistic changes.
Bayesian networks are well suited for inferring genetic interactions because of their ability to model causal influence between genes linked as a network and because they are an effective method for modeling the joint density of all variables in a system. However, the approaches suggested to date have generally focused on conversion of gene expression data to discrete states and have avoided the use of formal statistical methods for quantifying the joint density of the resulting parameters.
In this article we describe a method for inferring an "optimal" gene interaction network from microarray-based gent expression data. Unlike other network identification methods, the analytical approach presented here uses the actual measured observations on gene expression (rather than discretized data) and incorporates prior distributions for all parameters in the gene interaction network model. The method encompasses model selection theory from Bayesian regression to find gene network structures suitable for given data sets. Computer simulations presented in this article demonstrate that the proposed method is capable of identifying networks, given the sample size is sufficiently large. For small networks the limited number of replicates used for most microarray studies available today are adequate; for larger networks other options are discussed.
Materials and Methods
Figure 1 illustrates the general structure of a four gene regulatory system where the linkage between expression of gene i and expression of its parents (indirect regulators to gene i) is described by weighting the function [w.sub.i]([n.sub.i]), where the subscript i denotes that this weighting function pertains to the control of gene i expression by all genes linked to it and [n.sub.i] denotes the vector of parameters defining the functional relationship. Let N be a directed acyclic graph which consists of p vertices (genes). Each edge is also assumed to include information about the linkage between genes (i.e., activation, as in the case for the linkage between expression of gene 1 and expression of gene 4, or suppression, expression of genes 3 and 4). In essence, N is a discrete random variable that takes on any of the different acyclic network structures that are possible for a set of p genes. Define [X.sub.i] to he the random variable corresponding to the measured relative level of gene expression (the expression level of a target gene for an "exposed" group to the expression level of the same gene in a "control" group) for gene [G.sub.i], 1 [less than or equal to] i [less than or equal to] p. For a given network, N = n, and for each [X.sub.i], define the conditional density function, [f.sub.Xi]([X.sub.i]|[pa.sub.n]([X.sub.i]),[n.sub.i]) where ])[pa.sub.n]([X.sub.i]) denotes the set of vertices corresponding to the parents of expression for gene i in the network n with paranacters [n.sub.i]. All networks in the support space for N are assumed to satisfy the Markov property where expression of gene i is independent of all genes not included in [pa.sub.n],([X.sub.i]). Application of the Markov property and imposition of the acyclic restriction allow decomposition of the joint density function into
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where n = ([n.sub.1], [n.sub.2], ... [n.sub.p]) is the set of all parameters in the network.
Gene expression data, for the purposes of this analysis, can be expressed as a p by m matrix of the form [x.bar] = [[x.sub.ik]] i = 1,2, ... p, k = 1,2 ... m where m is the number of observations (samples analyzed for gene expression) taken for each gene and [[x.bar].sub.i] = [[x.sub.ik]] k = 1,2 ... m is the vector of all observations of expression for gene i. The observed gene expression levels for the parent set for gene i in vector notation is [pa.sub.n] ([x.sub.i]) = [[x.sub.ij]] j = 1,2,... [p.sub.i], k = 1,2 ... m where [p.sub.i] is the number of parents for gene i. Similarly &fine the random vector [X.bar]. Then, conditional on the parameters and the model, the likelihood of the data [x.bar] is given by
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
The goal of our analysis is the identification of the "best" network structure using gene expression data. Our criterion for the best network is defined as the network, n", from the set of all acyclic networks that maximizes the posterior likelihood of the network,
[3] [n.sup.*] = arg max/N Pr (N = n|[x.bar]).
The posterior probability Pr(N = n|[x.bar]) is given by
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where Pr(N = n|x) [infinity] and f[n.sub.i]([n.sub.i]) x are derived from the prior distributions of N and [n.sub.i] respectively, and the [n.sub.i] are assumed independent.