Scriptome Home
UNIX/Mac Home
Windows Home
Information
FAQ
Help
Overview
Principles
Resources
Tips
Tools
Calc
Change
Choose
Fetch
Merge
Sort
Protocols
Sequences
Microarray

Quickbrowse: Go to a tool by selecting the abbreviated tool name from the menu.

calc_

Contents: Click a blue triangle to expand or collapse a list


Calculate simple statistics

The tools in this section calculate very simple statistics about the data (often tabular data) in given input files.

To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.

See More Information for notes on using these tools.

Calculate statistics about a column

Calculate sum of the nth column of tabular data (calc_col_sum)

(This tool should not be confused with calc_row_sum, which calculates the sum of all columns for each row.)

$col Column to sum
Input file(s)

perl -e ' $col=1; while(<>) { s/\r?\n//; @F=split /\t/, $_; $sum += $F[$col]; } warn "\nSum of column $col for $. lines\n\n"; print "$sum\n" ' file.tab

Example: Sum second column of file.tab by running the above script.

Input file (file.tab) Screen output
 Fly    7
 Human  14
 Worm   28
 Yeast  35
 Sum of column 1 for 4 lines
 
 84

Calculate length of a given column on each line (calc_col_length)

For a given column of a tab-separated file, calculate the length of the text in that column. For each line, add a column at the end of the line with the length of the chosen column.

$col Column to measure length of
Input file(s)
Output file

perl -e ' $col=2; while (<>) { s/\r?\n//; @F = split /\t/, $_; $len = length($F[$col]); print "$_\t$len\n" } warn "\nAdded column with length of column $col for $. lines.\n\n"; ' seqs.tab > seqs_length.tab

Example: Take a FASTA file seqs.tab that we've converted to tabular format using change_fasta_to_tab. Calculate the length of column 2 (the third column), which has the sequence in it, by running the above script. Create a new, 4-column, file seqs_length.tab.

Input file
(seqs.tab)
 SEQ1   First seq       ACTGACTG
 SEQ2   Second seq      ACTG
 SEQ3   Third seq       
 SEQ4   Third seq       ACTGACTGACTG
Output file
(seqs_length.tab)
 SEQ1   First seq       ACTGACTG        8
 SEQ2   Second seq      ACTG    4
 SEQ3   Third seq               0
 SEQ4   Third seq       ACTGACTGACTG    12
Screen Output
 Added column with length of column 2 for 4 lines.

Calculate statistics about each row/line

Insert line numbers (calc_line_numbers)

For each line in a file, print the line number followed by a separator (by default, a tab, represented by \t), and then the rest of the line.

$separator What to print between line number and rest of line - \t means tab
Input file(s)
Output file

perl -e ' $separator="\t"; while (<>) { print "$.$separator$_" } warn "\nInserted line numbers for $. lines, with separator $separator.\n\n" ' gene_list.txt > numbered_gene_list.txt

Example: Add line numbers to a list of genes gene_list.txt to generate a numbered list numbered_gene_list.txt by running the above script.

Input file
(gene_list.txt)
 Hsp90  Heat shock protein
 apo1   apoptosis-related protein
 glu7   Glucose metabolism
Output file
(numbered_gene_list.txt)
 1      Hsp90   Heat shock protein
 2      apo1    apoptosis-related protein
 3      glu7    Glucose metabolism
Screen Output
 Inserted line numbers for 3 lines, with separator      .

Calculate sum of two or more columns for each row (calc_row_sum)

This tool takes tab-separated data. For each row, it calculates the sum of two or more columns. It adds a new, last column containing the sums.

@cols Which column(s) to add
Input file(s)
Output file

perl -e ' @cols=(1, 2, 3); while(<>) { s/\r?\n//; @F=split /\t/, $_; $sum = 0; foreach $col (@cols) { $sum += $F[$col] }; print "$_\t$sum\n"; } warn "\nSum of columns @cols for each line ($. lines)\n\n" ' in.tab > out.tab

Example: Sum second, third, and fourth column of file.tab by running the above script. Results go in new, sixth column

To calculate the sum of all columns in the row, set @cols=(0 .. $#F).

TODO

Calculate statistics about groups of lines

Group a number of lines together based, e.g., on having the same value in a certain column. Now calculate statistics about these groups of lines.

Calculate how many times each value appears in a given column (calc_repeats_for_each_value_in_col)

For a given column of a tab-separated file, count how many times each value appears in that column. Each line of the output will have a value, a tab, and the number of times it appears. Values will be printed in the order of their first appearance.

$col Column to count repeats in
Input file(s)
Output file

perl -e ' $col=1; while (<>) { s/\r?\n//; @F = split /\t/, $_; $val = $F[$col]; if (! exists $count{$val}) { push @order, $val } $count{$val}++; } foreach $val (@order) { print "$val\t$count{$val}\n" } warn "\nPrinted number of occurrences for ", scalar(@order), " values in $. lines.\n\n"; ' gene_go.txt > go_repeats.txt

Example: Given a list of genes with associated GO terms gene_go.txt, make a new file go_repeats.txt showing how many times each GO term is found by running the above script. This could be used to find whether a certain biological process is over-represented in a list of genes, for example. (Note: changing the column to 0 would calculate how many GO terms each gene was associated with.)

Input file
(gene_go.txt)
 Hsp90  GO:00171
 apo1   GO:00012
 apo1   GO:00233
 apo1   GO:01234
 glu7   GO:00012
 glu7   GO:56785
Output file
(go_repeats.txt)
 GO:00171       1
 GO:00012       2
 GO:00233       1
 GO:01234       1
 GO:56785       1
Screen Output
 Printed number of occurrences for 5 values in 6 lines.

Calculate sum of values in a column for each value in another column (calc_sum_of_col_for_groups_of_lines)

Find sets of rows that have the same value in column m. Then get the sum of column n for those sets of rows. Each line of the output will have a value, a tab, and the sum for that value. Values will be printed in the order of their first appearance.

$value_col Column to determine grouping of lines
$sum_col Column to sum for sets of similar lines
Input file(s)
Output file

perl -e ' $value_col=0; $sum_col=2; while (<>) { s/\r?\n//; @F = split /\t/, $_; $val = $F[$value_col]; if (! exists $sum{$val}) { push @order, $val } $sum{$val} += $F[$sum_col]; } foreach $val (@order) { print "$val\t$sum{$val}\n" } warn "\nPrinted sum of column $sum_col for each value in column $value_col\nFound ", scalar(@order), " values in $. lines\n\n"; ' exon_length.tab > gene_length.tab

Example: Calculate the sum of the length of the exons for each gene.

Input file
(exon_length.tab)
 Hsp90  exon1   300
 Hsp90  exon2   100
 Hsp90  exon3   250
 apo1   exon1   100
 apo1   exon2   350
Output file
(gene_length.tab)
 Hsp90  650
 apo1   450
Screen Output
 Printed sum of column 2 for each value in column 0
 Found 2 values in 5 lines

Count lines or records in a file

Count lines in a file (calc_num_lines)

Simply give a count of the number of lines in a file. (The result is printed to an output file as well as the screen.)

Input file(s)
Output file

perl -e ' while (<>) { } print "Counted $. lines\n"; warn "\nCounted $. lines\n\n" ' gene_list.txt > gene_count.txt

Example: Count how many genes are in file gene_list.txt by running the above script.

Input file
(gene_list.txt)
 Hsp90  Heat shock protein
 apo1   apoptosis-related protein
 glu7   Glucose metabolism
Output file
(gene_count.txt)
 Counted 3 lines
Screen Output
 Counted 3 lines

UNIX/Mac users: also check out the wc command.

Count records in a FASTA file (calc_num_fasta_records)

Counts the number of records (and, for convenience, total sequence length) in a FASTA file. (The result is printed to an output file as well as the screen.)

Input file(s)
Output file

perl -e ' $count=0; $len=0; while(<>) { s/\r?\n//; if (/^>/) { $count++; } else { $len += length($_) } } print "Read $count FASTA records in $. lines. Total sequence length: $len\n"; warn "\nRead $count FASTA records in $. lines. Total sequence length: $len\n\n"; ' seqs.fasta > seqs_count.txt

Example: See how many sequences are in seqs.fasta.

Input file
(seqs.fasta)
 >CG123 A small sequence
 ACGTTGCA
 GTTACCAG
 >EG12
 ACCGGA
 >DG124  A smaller sequence
 GTTACCAG
Output file
(seqs_count.txt)
 Read 3 FASTA records in 7 lines. Total sequence length: 30
Screen Output
 Read 3 FASTA records in 7 lines. Total sequence length: 30

Calculate line numbers for each line

See above.


More Information

General Calculating Notes

As always, when in doubt, check your output files after each step!

All text strings that are not obviously a number are considered to have a value of 0. (E.g., 3 + 2 + hello + 7 = 12.) This can happen if you have some text data (like N/A) in a number field.

General Scriptome Notes

Scriptome tools are in blue or green boxes. Cut and paste the text of the tool into a terminal window. Then edit the line as needed. Things that will often need to be edited are highlighted in red. Input and output filenames will almost always need to be changed.

All scripts that work on tabular data assume the data is tab-separated. Use a Change script to change, e.g., comma-separated data to tab-separated before using these scripts.

When working with tabular data, remember that the first column is called column 0, the second column is column 1, etc. The last column can also be referred to as column -1, second-to-last column is -2, etc.

 

HomeContact UsDirectoriesSearch