# Math Programming Formulations in Genomics and Proteomics

Computational biology problems generally involve the determination of discrete structures over biological configurations determined by genomic or proteomic data. Such problems present great opportunities for application of mathematical programming techniques. We give an overview of formulations employed for the solution of problems in genomics and proteomics. In particular, we discuss mathematical programming formulations for string comparison and selection problems, with high applicability in biological data processing.

# Introduction

Problems in genomics and proteomics are among the most difficult in computational biology. They usually arise from the task of determining combinatorial properties of biological material. Researchers in this area are interested in comparing, understanding the structure, finding similarities, and discovering patterns in genomic and proteomic sequences.

Genomic and proteomic data is composed of a sequence of elements, which can be thought of as being part of an alphabet. A sequence of such elements will be called a *string*. The strings of genetic material (e.g., DNA or RNA) encode “instructions” to produce the proteins that regulate the life of organisms. Proteins themselves can be mathematically modeled as sequences over an alphabet of 20 characters (representing the available amino-acids).

## Advances in Proteomics

In the last few years, advances in biology allowed the study of such sequences of data to move from pure biological science to other areas as well. This was made possible due to the opportunity of analyzing genomic and proteomic material using mathematical models.

The analysis of biological sequences gives rise to interesting and difficult combinatorial problems. As many of these problems are NP-hard, the study of improved techniques is necessary in order to solve problems exactly (whenever possible), or at least with some guarantee of solution quality.

## Problems Discussed

We discuss problems related to configurations of genomic and proteomic sequences. Our interest is in mathematical programming formulations, generally involving integer linear programming (ILP) models. The importance of this study steams from the fact that knowing the mathematical programming properties for a problem usually makes it easier to derive exact as well approximation solution techniques. Another reason is that ILP models can be solved automatically, using standard algorithms and commercial packages for integer programming, and therefore the knowledge of better mathematical models may improve the efficiency of such techniques for the specific problem in hand. More information about applications of optimization in biology can be viewed in the surveys [GHL04,P98].

## Paper Organization

This paper is organized as follows. In Section 2 we discuss some string comparison problems, including the closest and farthest string, as well as the closest and farthest substring problems. In Section3 the protein structure prediction problem is discussed. Sorting by reversals is an interesting problem related to the comparison of gene sequences in different species, presented in Section 4. Section 5 discusses the important task of identifying biological agents in a sample. Integer programming is also very useful in the design of probes for the study of DNA. An example is the minimum cost probe set problem, presented in Section 6. Finally, in Section 7 we present some concluding remarks.