Quickbrowse: Go to a tool by selecting the abbreviated tool name from the menu.

Contents: Click a blue triangle to expand or collapse a list

Calculate simple statistics

Calculate statistics about a column

Calculate statistics about each row/line

Calculate statistics about groups of lines

Count lines or records in a file

More Information

General Calculating Notes
General Scriptome Notes

Calculate simple statistics

The tools in this section calculate very simple statistics about the data (often tabular data) in given input files.

To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.

See More Information for notes on using these tools.

Calculate statistics about a column

Calculate sum of the nth column of tabular data (calc_col_sum)

(This tool should not be confused with calc_row_sum, which calculates the sum of all columns for each row.)

Example: Sum second column of file.tab by running the above script.

Input file (`file.tab`)	Screen output
Fly 7 Human 14 Worm 28 Yeast 35	Sum of column 1 for 4 lines 84

Calculate length of a given column on each line (calc_col_length)

For a given column of a tab-separated file, calculate the length of the text in that column. For each line, add a column at the end of the line with the length of the chosen column.

Example: Take a FASTA file seqs.tab that we've converted to tabular format using change_fasta_to_tab. Calculate the length of column 2 (the third column), which has the sequence in it, by running the above script. Create a new, 4-column, file seqs_length.tab.

Input file
(seqs.tab)

 SEQ1   First seq       ACTGACTG
 SEQ2   Second seq      ACTG
 SEQ3   Third seq       
 SEQ4   Third seq       ACTGACTGACTG

Output file
(seqs_length.tab)

 SEQ1   First seq       ACTGACTG        8
 SEQ2   Second seq      ACTG    4
 SEQ3   Third seq               0
 SEQ4   Third seq       ACTGACTGACTG    12

Screen Output

 Added column with length of column 2 for 4 lines.

Calculate statistics about each row/line

Insert line numbers (calc_line_numbers)

For each line in a file, print the line number followed by a separator (by default, a tab, represented by \t), and then the rest of the line.

Example: Add line numbers to a list of genes gene_list.txt to generate a numbered list numbered_gene_list.txt by running the above script.

Input file
(gene_list.txt)

 Hsp90  Heat shock protein
 apo1   apoptosis-related protein
 glu7   Glucose metabolism

Output file
(numbered_gene_list.txt)

 1      Hsp90   Heat shock protein
 2      apo1    apoptosis-related protein
 3      glu7    Glucose metabolism

Screen Output

 Inserted line numbers for 3 lines, with separator      .

Calculate sum of two or more columns for each row (calc_row_sum)

This tool takes tab-separated data. For each row, it calculates the sum of two or more columns. It adds a new, last column containing the sums.

Example: Sum second, third, and fourth column of file.tab by running the above script. Results go in new, sixth column

To calculate the sum of all columns in the row, set @cols=(0 .. $#F).

TODO

Calculate statistics about groups of lines

Group a number of lines together based, e.g., on having the same value in a certain column. Now calculate statistics about these groups of lines.

Calculate how many times each value appears in a given column (calc_repeats_for_each_value_in_col)

For a given column of a tab-separated file, count how many times each value appears in that column. Each line of the output will have a value, a tab, and the number of times it appears. Values will be printed in the order of their first appearance.

Example: Given a list of genes with associated GO terms gene_go.txt, make a new file go_repeats.txt showing how many times each GO term is found by running the above script. This could be used to find whether a certain biological process is over-represented in a list of genes, for example. (Note: changing the column to 0 would calculate how many GO terms each gene was associated with.)

Input file (`gene_go.txt`)	Hsp90 GO:00171 apo1 GO:00012 apo1 GO:00233 apo1 GO:01234 glu7 GO:00012 glu7 GO:56785
Output file (`go_repeats.txt`)	GO:00171 1 GO:00012 2 GO:00233 1 GO:01234 1 GO:56785 1
Screen Output	Printed number of occurrences for 5 values in 6 lines.

Calculate sum of values in a column for each value in another column (calc_sum_of_col_for_groups_of_lines)

Find sets of rows that have the same value in column m. Then get the sum of column n for those sets of rows. Each line of the output will have a value, a tab, and the sum for that value. Values will be printed in the order of their first appearance.

Example: Calculate the sum of the length of the exons for each gene.

Input file (`exon_length.tab`)	Hsp90 exon1 300 Hsp90 exon2 100 Hsp90 exon3 250 apo1 exon1 100 apo1 exon2 350
Output file (`gene_length.tab`)	Hsp90 650 apo1 450
Screen Output	Printed sum of column 2 for each value in column 0 Found 2 values in 5 lines

Count lines or records in a file

Count lines in a file (calc_num_lines)

Simply give a count of the number of lines in a file. (The result is printed to an output file as well as the screen.)

Example: Count how many genes are in file gene_list.txt by running the above script.

Input file (`gene_list.txt`)	Hsp90 Heat shock protein apo1 apoptosis-related protein glu7 Glucose metabolism
Output file (`gene_count.txt)`	Counted 3 lines
Screen Output	Counted 3 lines

UNIX/Mac users: also check out the wc command.

Count records in a FASTA file (calc_num_fasta_records)

Counts the number of records (and, for convenience, total sequence length) in a FASTA file. (The result is printed to an output file as well as the screen.)

Example: See how many sequences are in seqs.fasta.

Input file
(seqs.fasta)

 >CG123 A small sequence
 ACGTTGCA
 GTTACCAG
 >EG12
 ACCGGA
 >DG124  A smaller sequence
 GTTACCAG

Output file
(seqs_count.txt)

 Read 3 FASTA records in 7 lines. Total sequence length: 30

Screen Output

 Read 3 FASTA records in 7 lines. Total sequence length: 30

Calculate line numbers for each line

See above.

More Information

General Calculating Notes

As always, when in doubt, check your output files after each step!

All text strings that are not obviously a number are considered to have a value of 0. (E.g., 3 + 2 + hello + 7 = 12.) This can happen if you have some text data (like N/A) in a number field.

General Scriptome Notes

Scriptome tools are in blue or green boxes. Cut and paste the text of the tool into a terminal window. Then edit the line as needed. Things that will often need to be edited are highlighted in red. Input and output filenames will almost always need to be changed.

All scripts that work on tabular data assume the data is tab-separated. Use a Change script to change, e.g., comma-separated data to tab-separated before using these scripts.

When working with tabular data, remember that the first column is called column 0, the second column is column 1, etc. The last column can also be referred to as column -1, second-to-last column is -2, etc.