Scriptome Home Windows Home Unix/Mac Home Information FAQ Help Overview Principles Resources Tips Tools Calc Change Choose Fetch Merge Sort Protocols Sequences Microarray

Quickbrowse: Go to a tool by selecting the abbreviated tool name from the menu.

calc_

Contents: Click a blue triangle to expand or collapse a list

# Calculate simple statistics

The tools in this section calculate very simple statistics about the data (often tabular data) in given input files.

To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.

## Calculate statistics about a column

### Calculate sum of the nth column of tabular data (calc_col_sum)

(This tool should not be confused with calc_row_sum, which calculates the sum of all columns for each row.)

 \$col Column to sum Input file(s)

 perl -e " \$col=1; while(<>) { s/\r?\n//; @F=split /\t/, \$_; \$sum += \$F[\$col]; } warn qq~\nSum of column \$col for \$. lines\n\n~; print qq~\$sum\n~ " file.tab

Example: Sum second column of `file.tab` by running the above script.

Input file (`file.tab`) Screen output
``` Fly    7
Human  14
Worm   28
Yeast  35```
``` Sum of column 1 for 4 lines

84```

### Calculate length of a given column on each line (calc_col_length)

For a given column of a tab-separated file, calculate the length of the text in that column. For each line, add a column at the end of the line with the length of the chosen column.

 \$col Column to measure length of Input file(s) Output file

 perl -e " \$col=2; while (<>) { s/\r?\n//; @F = split /\t/, \$_; \$len = length(\$F[\$col]); print qq~\$_\t\$len\n~ } warn qq~\nAdded column with length of column \$col for \$. lines.\n\n~; " seqs.tab > seqs_length.tab

Example: Take a FASTA file `seqs.tab` that we've converted to tabular format using change_fasta_to_tab. Calculate the length of column 2 (the third column), which has the sequence in it, by running the above script. Create a new, 4-column, file `seqs_length.tab`.

Input file (`seqs.tab`) ``` SEQ1 First seq ACTGACTG SEQ2 Second seq ACTG SEQ3 Third seq SEQ4 Third seq ACTGACTGACTG``` ``` SEQ1 First seq ACTGACTG 8 SEQ2 Second seq ACTG 4 SEQ3 Third seq 0 SEQ4 Third seq ACTGACTGACTG 12``` ` Added column with length of column 2 for 4 lines.`

## Calculate statistics about each row/line

### Insert line numbers (calc_line_numbers)

For each line in a file, print the line number followed by a separator (by default, a tab, represented by \t), and then the rest of the line.

 \$separator What to print between line number and rest of line - \t means tab Input file(s) Output file

 perl -e " \$separator= qq~\t~; while (<>) { print qq~\$.\$separator\$_~ } warn qq~\nInserted line numbers for \$. lines, with separator \$separator.\n\n~ " gene_list.txt > numbered_gene_list.txt

Example: Add line numbers to a list of genes `gene_list.txt` to generate a numbered list `numbered_gene_list.txt` by running the above script.

Input file (`gene_list.txt`) ``` Hsp90 Heat shock protein apo1 apoptosis-related protein glu7 Glucose metabolism``` ``` 1 Hsp90 Heat shock protein 2 apo1 apoptosis-related protein 3 glu7 Glucose metabolism``` ` Inserted line numbers for 3 lines, with separator .`

### Calculate sum of two or more columns for each row (calc_row_sum)

This tool takes tab-separated data. For each row, it calculates the sum of two or more columns. It adds a new, last column containing the sums.

 @cols Which column(s) to add Input file(s) Output file

 perl -e " @cols=(1, 2, 3); while(<>) { s/\r?\n//; @F=split /\t/, \$_; \$sum = 0; foreach \$col (@cols) { \$sum += \$F[\$col] }; print qq~\$_\t\$sum\n~; } warn qq~\nSum of columns @cols for each line (\$. lines)\n\n~ " in.tab > out.tab

Example: Sum second, third, and fourth column of `file.tab` by running the above script. Results go in new, sixth column

To calculate the sum of all columns in the row, set `@cols=(0 .. \$#F)`.

TODO

## Calculate statistics about groups of lines

Group a number of lines together based, e.g., on having the same value in a certain column. Now calculate statistics about these groups of lines.

### Calculate how many times each value appears in a given column (calc_repeats_for_each_value_in_col)

For a given column of a tab-separated file, count how many times each value appears in that column. Each line of the output will have a value, a tab, and the number of times it appears. Values will be printed in the order of their first appearance.

 \$col Column to count repeats in Input file(s) Output file

 perl -e " \$col=1; while (<>) { s/\r?\n//; @F = split /\t/, \$_; \$val = \$F[\$col]; if (! exists \$count{\$val}) { push @order, \$val } \$count{\$val}++; } foreach \$val (@order) { print qq~\$val\t\$count{\$val}\n~ } warn qq~\nPrinted number of occurrences for ~, scalar(@order), qq~ values in \$. lines.\n\n~; " gene_go.txt > go_repeats.txt

Example: Given a list of genes with associated GO terms `gene_go.txt`, make a new file `go_repeats.txt` showing how many times each GO term is found by running the above script. This could be used to find whether a certain biological process is over-represented in a list of genes, for example. (Note: changing the column to 0 would calculate how many GO terms each gene was associated with.)

Input file (`gene_go.txt`) ``` Hsp90 GO:00171 apo1 GO:00012 apo1 GO:00233 apo1 GO:01234 glu7 GO:00012 glu7 GO:56785``` ``` GO:00171 1 GO:00012 2 GO:00233 1 GO:01234 1 GO:56785 1``` ` Printed number of occurrences for 5 values in 6 lines.`

### Calculate sum of values in a column for each value in another column (calc_sum_of_col_for_groups_of_lines)

Find sets of rows that have the same value in column m. Then get the sum of column n for those sets of rows. Each line of the output will have a value, a tab, and the sum for that value. Values will be printed in the order of their first appearance.

 \$value_col Column to determine grouping of lines \$sum_col Column to sum for sets of similar lines Input file(s) Output file

 perl -e " \$value_col=0; \$sum_col=2; while (<>) { s/\r?\n//; @F = split /\t/, \$_; \$val = \$F[\$value_col]; if (! exists \$sum{\$val}) { push @order, \$val } \$sum{\$val} += \$F[\$sum_col]; } foreach \$val (@order) { print qq~\$val\t\$sum{\$val}\n~ } warn qq~\nPrinted sum of column \$sum_col for each value in column \$value_col\nFound ~, scalar(@order), qq~ values in \$. lines\n\n~; " exon_length.tab > gene_length.tab

Example: Calculate the sum of the length of the exons for each gene.

Input file (`exon_length.tab`) ``` Hsp90 exon1 300 Hsp90 exon2 100 Hsp90 exon3 250 apo1 exon1 100 apo1 exon2 350``` ``` Hsp90 650 apo1 450``` ``` Printed sum of column 2 for each value in column 0 Found 2 values in 5 lines```

## Count lines or records in a file

### Count lines in a file (calc_num_lines)

Simply give a count of the number of lines in a file. (The result is printed to an output file as well as the screen.)

 Input file(s) Output file

 perl -e " while (<>) { } print qq~Counted \$. lines\n~; warn qq~\nCounted \$. lines\n\n~ " gene_list.txt > gene_count.txt

Example: Count how many genes are in file `gene_list.txt` by running the above script.

Input file (`gene_list.txt`) ``` Hsp90 Heat shock protein apo1 apoptosis-related protein glu7 Glucose metabolism``` ` Counted 3 lines` ` Counted 3 lines`

UNIX/Mac users: also check out the wc command.

### Count records in a FASTA file (calc_num_fasta_records)

Counts the number of records (and, for convenience, total sequence length) in a FASTA file. (The result is printed to an output file as well as the screen.)

 Input file(s) Output file

 perl -e " \$count=0; \$len=0; while(<>) { s/\r?\n//; if (/^>/) { \$count++; } else { \$len += length(\$_) } } print qq~Read \$count FASTA records in \$. lines. Total sequence length: \$len\n~; warn qq~\nRead \$count FASTA records in \$. lines. Total sequence length: \$len\n\n~; " seqs.fasta > seqs_count.txt

Example: See how many sequences are in seqs.fasta.

Input file (`seqs.fasta`) ``` >CG123 A small sequence ACGTTGCA GTTACCAG >EG12 ACCGGA >DG124 A smaller sequence GTTACCAG``` ` Read 3 FASTA records in 7 lines. Total sequence length: 30` ` Read 3 FASTA records in 7 lines. Total sequence length: 30`

See above.

## General Calculating Notes

As always, when in doubt, check your output files after each step!

All text strings that are not obviously a number are considered to have a value of 0. (E.g., 3 + 2 + hello + 7 = 12.) This can happen if you have some text data (like N/A) in a number field.

## General Scriptome Notes

Scriptome tools are in blue or green boxes. Cut and paste the text of the tool into a terminal window. Then edit the line as needed. Things that will often need to be edited are highlighted in red. Input and output filenames will almost always need to be changed.

All scripts that work on tabular data assume the data is tab-separated. Use a Change script to change, e.g., comma-separated data to tab-separated before using these scripts.

When working with tabular data, remember that the first column is called column 0, the second column is column 1, etc. The last column can also be referred to as column -1, second-to-last column is -2, etc.