HPC Lab

BMI - OSU





HPC Lab

CPB: Correlated Patterns Biclustering

General information

CPB is a novel two-step Pearson correlation based biclustering approach to mine genes that are co-regulated with a given reference gene in order to discover genes that function in a common biological process. In the first step, the algorithm identifies subsets of genes with high correlation, reducing false negatives with a nonparametric filtering scheme. In the second step, biclusters from multiple datasets are used to extract and rank gene correlation information.

Download

Latest release: cpb-11-4-2011

Syntetic Dataset Example: data_examples

40 real datasets with the their query results with several probes: real_datasets

Dependencies

Although CPB and correlation combination codes are written in c and c++ and they have no other dependencies, we also provide python wrappers that make the use of them easier. These python codes have the following dependencies:

Installation

For installation of CPB:

For the installation of the correlation combination:

Usage

A sample usage of CPB with the python wrapper:

[user cpb_source]$ python run_cpb.py DF=data.txt BF=found.txt NB=500 PCC=0.9 MO=0.25

This will create a file "found.txt", in which the biclusters are in the format:
<rows indices>
<cols col indices>
seperated by empty lines

For description of the parameters of CPB, run the command:
[user cpb_source]$ python run_cpb.py

Examples of 4 different bicluster models can be found here. Each example dataset has 1000 rows and 200 columns with 2 biclusters embedded, without any noise or overlap. The expected biclusters of each data matrix are listed in expected.txt files under the corresponding folders.

Note that, this script creates a temporary folder under "/tmp" folder, which is removed when the job is completed. This directory can be changed by setting the environment variable:

[user cpb_source]$ export TMPDIR=<the_path_to_temporary_directory>

For the usage of correlation calculation:

[user correlation]$ correlation
Usage: correlation <inputfile> <#rows> <#cols> <reference row> <#biclusters>

To combine the results of several datasets, use run_correlation.py.

[user cpb_source]$ python run_correlation.py
  reference_row : is the integer index of reference row where the first row is 0th row.
  input_folder : from which the datasets and their results will be read. See real dataset examples for the example of format of the folder
  output_file: where the output file should be written.

For the usage example of correlation combining code,

These will save the combining results in the given txt files. Note that, run_correlation.py expects exactly the same formatted folder, in which a subfolder having name "GDS987" must contain a file named "GDS987.soft" and also the output biclusters found for the reference row 10722 that are saved in file 10722_found.txt.