Quickbrowse: Go to a tool by selecting the abbreviated tool name from the menu.
Contents: Click a blue triangle to expand or collapse a list
The tools in this section calculate very simple statistics about
the data (often tabular data) in given input files.
To use a script, cut and paste the code from the light green or blue box into a
terminal window, change the bold, red text as needed, and hit Enter.
See More Information for notes on using these tools.
(This tool should not be confused with calc_row_sum, which calculates the sum
of all columns for each row.)
Example: Sum second column of file.tab
by running the above script.
Input file (file.tab ) |
Screen output |
Fly 7
Human 14
Worm 28
Yeast 35
|
Sum of column 1 for 4 lines
84
|
For a given column of a tab-separated file, calculate the length of
the text in that column. For each line, add a column at the end of the line
with the length of the chosen column.
Example: Take a FASTA file seqs.tab
that we've converted to tabular format
using change_fasta_to_tab. Calculate the length of column 2
(the third column), which has the sequence in it,
by running the above script. Create a new, 4-column, file seqs_length.tab
.
Input file
(seqs.tab ) |
SEQ1 First seq ACTGACTG
SEQ2 Second seq ACTG
SEQ3 Third seq
SEQ4 Third seq ACTGACTGACTG
|
Output file
(seqs_length.tab ) |
SEQ1 First seq ACTGACTG 8
SEQ2 Second seq ACTG 4
SEQ3 Third seq 0
SEQ4 Third seq ACTGACTGACTG 12
|
Screen Output |
Added column with length of column 2 for 4 lines.
|
For each line in a file, print the line number followed by a separator (by
default, a tab, represented by \t), and then the rest of the line.
Example: Add line numbers to a list of genes gene_list.txt
to
generate a numbered list numbered_gene_list.txt
by running the above script.
Input file
(gene_list.txt ) |
Hsp90 Heat shock protein
apo1 apoptosis-related protein
glu7 Glucose metabolism
|
Output file
(numbered_gene_list.txt ) |
1 Hsp90 Heat shock protein
2 apo1 apoptosis-related protein
3 glu7 Glucose metabolism
|
Screen Output |
Inserted line numbers for 3 lines, with separator .
|
This tool takes tab-separated data. For each row, it calculates the sum
of two or more columns. It adds a new, last column containing the sums.
Example: Sum second, third, and fourth column of file.tab
by running
the above script. Results go in new, sixth column
To calculate the sum of all columns in the row, set @cols=(0 .. $#F)
.
TODO
Group a number of lines together based, e.g., on having the same value
in a certain column. Now calculate statistics about these groups of lines.
For a given column of a tab-separated file, count how many times each value
appears in that column. Each line of the output will have a value, a tab, and
the number of times it appears. Values will be printed in the order of their
first appearance.
Example: Given a list of genes with associated GO terms gene_go.txt
,
make a new file go_repeats.txt
showing how many times each GO term
is found by running the above script. This could be used to find whether
a certain biological process is over-represented in a list of genes, for
example. (Note: changing the column to 0 would calculate how many GO terms
each gene was associated with.)
Input file
(gene_go.txt ) |
Hsp90 GO:00171
apo1 GO:00012
apo1 GO:00233
apo1 GO:01234
glu7 GO:00012
glu7 GO:56785
|
Output file
(go_repeats.txt ) |
GO:00171 1
GO:00012 2
GO:00233 1
GO:01234 1
GO:56785 1
|
Screen Output |
Printed number of occurrences for 5 values in 6 lines.
|
Find sets of rows that have the same value in column m. Then get the sum of
column n for those sets of rows. Each line of the output will have a value, a
tab, and the sum for that value. Values will be printed in the order of their
first appearance.
Example: Calculate the sum of the length of the exons for each gene.
Input file
(exon_length.tab ) |
Hsp90 exon1 300
Hsp90 exon2 100
Hsp90 exon3 250
apo1 exon1 100
apo1 exon2 350
|
Output file
(gene_length.tab ) |
Hsp90 650
apo1 450
|
Screen Output |
Printed sum of column 2 for each value in column 0
Found 2 values in 5 lines
|
Simply give a count of the number of lines in a file.
(The result is printed to an output file as well as the screen.)
Example: Count how many genes are in file gene_list.txt
by running the
above script.
Input file
(gene_list.txt ) |
Hsp90 Heat shock protein
apo1 apoptosis-related protein
glu7 Glucose metabolism
|
Output file (gene_count.txt) |
Counted 3 lines
|
Screen Output |
Counted 3 lines
|
UNIX/Mac users: also check out the wc command.
Counts the number of records (and, for convenience, total sequence length)
in a FASTA file.
(The result is printed to an output file as well as the screen.)
Example: See how many sequences are in seqs.fasta.
Input file
(seqs.fasta ) |
>CG123 A small sequence
ACGTTGCA
GTTACCAG
>EG12
ACCGGA
>DG124 A smaller sequence
GTTACCAG
|
Output file (seqs_count.txt) |
Read 3 FASTA records in 7 lines. Total sequence length: 30
|
Screen Output |
Read 3 FASTA records in 7 lines. Total sequence length: 30
|
See above.
As always, when in doubt, check your output files after each step!
All text strings that are not obviously a number are considered to
have a value of 0. (E.g., 3 + 2 + hello + 7 = 12.) This can happen if
you have some text data (like N/A) in a number field.
Scriptome tools are in blue or green boxes. Cut and paste the text of the
tool into a terminal window. Then edit the line as needed.
Things that will often need to be edited are highlighted in
red. Input and output filenames will almost always need to be changed.
All scripts that work on tabular data assume the data is tab-separated.
Use a Change script to change, e.g., comma-separated data
to tab-separated before using these scripts.
When working with tabular data, remember that the first column is called
column 0, the second column is column 1, etc. The last column can
also be referred to as column -1, second-to-last column is -2, etc.