# Interactive Optimization of Signal-to-Noise Ratios for Affymetrix Microarray Projects

##### Jinwook Seo, Marina Bakay, Yi-Wen Chen, Sara Hilmer, Ben Shneiderman, and Eric P Hoffman / 2004

A novel method to choose the most appropriate probe set signal algorithm for your Affy project

The most commonly utilized microarrays for mRNA profiling (Affymetrix) include probe sets of a series of perfect match and mismatch probes (typically 22 oligonucleotides per probe set). There are an increasing number of reported probe set algorithms that differ in their interpretation of a probe set to derive a single normalized "signal" - representative of expression of each mRNA. These algorithms are known to differ in accuracy and sensitivity, and optimization has been done using a small set of standardized control microarray data.

We hypothesized that different mRNA profiling projects have varying sources and degrees of confounding noise, and that these should alter **choice of a specific probe set algorithm**. Also, we hypothesized that **use of the Microarray Suite (MAS) 5.0 probe set detection p value as a weighting function** would improve the performance of all probe set algorithms.

**Permutation Study Framework using Unsupervised Clustering in HCE2W** (the improved version of the Hierarchical Clustering Explorer 2.0 with p-value weighting and F-measure). Inputs to the Hierarchical Clustering Explorer are two files, signal data file and p-value file. Each column of the two input files has values for a sample (or a chip), and the known target biological group index is assigned to each column of the signal data file. Success is measured using F-measure of a dendrogram and the known biological grouping.

Note : HCE 3.0 is a newer version, which has all functions in HCE2W.

## How to prepare input files

You have to prepare two files, **probe set signal file** and **probe set detection p-value file**, for each probe set signal algorithm (e.g., MAS5, dChip, or RMA). As you can see in the figure, you can use the probe set detection p-value file from MAS5 for all other signal files generated by probe set signal algorithms other than MAS5.

#### File names

The two files should be in the same folder. The extension of the detection p-value file should be pvl. Please refer to the following example.

1.Using Excel files

If the signal file name is mah-mas5.xls, the detection p-value file name should be mah-mas5.pvl.xls.2.Using tab delimited text files

If the signal file name is mah-mas5.exp, the detection p-value file name should be mah-mas5.pvl.

#### Format of the input files

- Basically, each row is a gene (, or a probe set), and each column is a chip (, or a sample).
- The order of rows and columns should be the same in both signal file and p-value file.
- As shown in the following figure, each column can have a class ID that represents the known biological group of the sample.
- Please note that the second row of the spreadsheet named "CID" has the class ID's for samples.
- The class ID should be numeric starting from 1 to N (the number of different biological groups). Of course, samples from the same biological group should have the same class ID. For example, the first 5 samples of the following example belong to the same biological group. (They were all taken from the same mouse at the same time point).
- CID should not be in the p-value file.

Example : Please take a close look at this small example input files (mah-mas5-small.exp and mas-mas5-small.pvl) in mah-mas5-small.zip. There are 4808 probe sets and 40 chips. It was filtered from the PGA Murine Airway Hyperresponsiveness project using a very stringent present call filter.

#### Probe set signal file

#### Probe set detection p-value file (generated by MAS5)

## To use continuous MAS 5 probe set detection p-value as a noise filter

Please note that the order of rows and columns is the same as in the signal file.

## To use continuous MAS 5 probe set detection p-value as a noise filter

Affymetrix noise calculations give us two outputs; one is the continuous detection p value assignment, and the other is a simple detection call (present/absent). Each signal intensity value has a confidence factor, detection p-value, which contributes to determining the detection call for the corresponding probe set. When the probe set detection p-value reaches a certain level of significance, then the probe set is assigned a "present" call, while all those probe sets with less robust signal/noise ratios are assigned an absent call.( follow this link at Affymetrix.com (login required) for more detail). This enables the use of a present call threshold noise filter. We reported that a 10% present call noise filter did improve the performance of probe set signal algorithms. While such present call-based filtering improves performance, it is clearly an arbitrary threshold method, and thus it is highly possible that potentially important signals that might be conveyed by the probe sets are filtered out.

There are many possible similarity measures for unsupervised clustering methods, and it is also possible to develop weighted versions of most similarity measures. For example, we can derive a weighted Pearson correlation coefficient as follows from the Pearson correlation coefficient that has been widely used in the microarray analysis. Let **x=(x _{1}, ... ,x_{n})** and

**y=(y**be the vectors representing two arrays to be compared (these values are prepared in the .exp or .xls files) , and let

_{1}, ... ,y_{n})**p(y)=(p(y**and

_{1}, ... ,p(y_{n}))**p(x)=(p(x**be the vectors representing continuous probe set detection p-values for

_{1}, ... ,p(x_{n}))**x**and

**y**respectively. (These p-values are prepared in the .pvl or .pvl.xls files) Then the weighted Pearson correlation coefficient is given by

We use the complement of detection p-value to calculate the weight for each term since the smaller the p-value is, the more significant the signal is. Other similarity measures such as Euclidean distance, Manhattan distance, and cosine coefficient can be extended to their weighted version in a similar way to the weighted Pearson correlation coefficient. In HCE, we can check the option checkbox (highlighted with a red oval in the following figure) to use the MAS 5.0 detection p-values as weights for distance/similarity measures

## To use F-measure for evaluating unsupervised hierarchical clustering results

We applied F-measure to the entire hierarchical structure of clustering results and also to the set of clusters determined by the minimum similarity threshold in HCE2W. Let **C _{1}**,..

**C**,..

_{i}**C**be the right clusters according to the target biological variable. Let

_{n}**HC**,..

_{1}**HC**,..

_{j}**HC**be the clusters from the hierarchical clustering results. In F-measure, each cluster is considered a query and each class (or each correct cluster) is considered the correct answer of the query. The F-measure of a correct cluster

_{m}**C**(or a class) and an actual cluster

_{i}**HC**is defined as follows:

_{j}The precision values **P(i,j)** and recall values **R(i,j)** are defined by the information retrieval concepts. The F-measure of a class **C _{i}**is given by

Finally, the F-measure of the entire clustering result is given by

, where **N** is the total number of arrays in the experiment.

The F-measure score is between 0 and 1. The higher the F-measure score is, the better the clustering result is. When we calculate the F-measure for the entire cluster hierarchy, for each external class we traverse the hierarchy recursively and consider each subtree as a cluster. Then the F-measure for an external class is the maximum of F-measures for all subtrees.

In the final clustering result visualization, each sample name is color-coded by its biological class as shown in the figure at the top. Overall F-measure is highlighted with a pink oval. The F-measure distribution is shown, as the distance from the left side, over the dendrogram display as indicated by an arrow mark.

## A Permutation Study Result

( 2 large novel microarray data, with/without detection p-value weighting, 5 probe set signal algorithm)

We used HCE 3.0 (HCE2W) to test and define parameters in Affymetrix analyses that optimize the ratio of signal (desired biological variable) versus noise (confounding uncontrolled variables). Five probe set algorithms were studied with and without statistical weighting of probe sets using the Microarray Suite (MAS) 5.0 probe set detection p values. The signal/noise optimization method was tested in two large novel microarray datasets with different levels of confounding noise; a 105 sample U133A human muscle biopsy data set (11 groups; mutation-defined; extensive noise), and a 40 sample U74A inbred mouse lung data set (8 groups; little noise). Performance was measured by the ability of the specific probe set algorithm, with and without detection p value weighting, to cluster samples into the appropriate biological groups (unsupervised agglomerative clustering with F-measure values).

Probe set detection p-value weighting had the greatest positive effect on performance of dChip difference model, ProbeProfiler, and RMA algorithms. Importantly, probe set algorithms did indeed perform differently depending on the specific project, likely due to degree of confounding noise. Our data indicates that significantly improved data analysis of mRNA profile projects can be achieved by optimizing the choice of probe set algorithm with the noise levels intrinsic to a project.

- Performances of all probe set signal methods were better with a less-noisy data set (inbred mouse lung data set) than with noisy data set (human muscle biopsy).

- Noise filter using continuous probe set detection p-value improved the performances for dChip difference model, ProbeProfiler, and RMA.

- dChip difference model with MAS 5.0 probe set detection p values as weights was the most consistent at maximizing the effect of the target biological variables on data interpretation of the two data sets.

The following graph shows the external evaluation results using F-measure of unsupervised clustering for the human muscular dystrophy data and the mouse lung biopsy data. "no-wt" bar represents the result without MAS 5.0 detection p-value weighting, and "wt" bar represents the result with p-value weighting.

## Download

HCE is a standalone Windows application running on a general PC environment. It is freely downloadable

for academic and/or research purposes.

Commercial licenses can be negotiated with

the UM Office of Technology Commercialization (James Poulos, jpoulos@umd.edu).

Register and Download HCE 3.0 version (released on March 29, 2004)

User manual

## Publications

- Jinwook Seo, Marina Bakay, Yi-Wen Chen, Sara Hilmer, Ben Shneiderman, and Eric P. Hoffman, Interactively optimizing signal-to-noise ratios in expression profiling: project-specific algorithm selection and detection p-value weighting in Affymetrix microarrays, Bioinformatics