Quickbrowse: Go to a tool by selecting the abbreviated tool name from the menu.
Contents: Click a blue triangle to expand or collapse a list
The scripts in this section select certain information from an input file.
Some scripts choose lines that meet a certain criterion; others choose
certain columns from tabular data.
To use a script, cut and paste the code from the light green or blue box into a
terminal window, change the bold, red text as needed, and hit Enter.
See More Information for notes on using these tools.
Print all lines from a file, but print only certain columns from each line.
Print one or more columns from tab-separated data. The column numbers
can be given in any order, so this tool can also be used to rearrange
the column order.
Example: Get second, last, and third column (subject sequence, score and
percent identity) from blast tabular output file all_cols
to get
some_cols_chosen
. (Note: some columns
removed from input rows for simplicity.)
Input file
(all_cols ) |
NP_438174.1 NP_110118.1 43.41 26 331 31 331 1e-61 228
NP_438174.1 NP_110197.1 33.33 251 319 157 214 0.51 26.6
NP_438174.1 NP_110131.1 31.75 202 257 33 94 1.9 24.6
NP_438174.1 NP_110177.1 21.67 207 326 321 433 2.5 24.3
|
Output file
(some_cols_chosen ) |
NP_110118.1 228 43.41
NP_110197.1 26.6 33.33
NP_110131.1 24.6 31.75
NP_110177.1 24.3 21.67
|
Screen Output |
Chose columns 1, -1, 2 for 4 lines
|
Example: Delete first, fourth through seventh column and last column
from a blast tabular output. Retain just subject sequence ID,
percent identity, and e-value.
(Note: some columns removed from input rows for clarity.) Run the following
script on all_cols
to get some_cols_deleted
.
Input file
(all_cols ) |
NP_438174.1 NP_110118.1 43.41 26 331 31 331 1e-61 228
NP_438174.1 NP_110197.1 33.33 251 319 157 214 0.51 26.6
NP_438174.1 NP_110131.1 31.75 202 257 33 94 1.9 24.6
NP_438174.1 NP_110177.1 21.67 207 326 321 433 2.5 24.3
|
Output file
(some_cols_deleted ) |
NP_110118.1 43.41 1e-61
NP_110197.1 33.33 0.51
NP_110131.1 31.75 1.9
NP_110177.1 21.67 2.5
|
Screen Output |
Deleted columns 0, 3, 4, 5, 6, -1 for 4 lines
|
To choose lines containing certain values in certain columns,
see Choose lines matching a specific text string or pattern.
Choose lines containing a specific string, like "blah" or "23" (The latter
will also match 123456), or lines matching a pattern, like
"line begins with the letters 'CG'".
Choose lines containing a text string (which should be placed inside the
q{}
in the tool. If you input "bc" , "abcd" will match.
If you input "23", "1.234" will match. Warning: Add a backslash to
match ' or }. I.e., to match "Joe's", type "Joe\'s".
Example: Get certain description lines from a FASTA file. Run the above
script on a file called a.fsa
to get descs
, which will have all lines
containing a '>CG'.
Input file (a.fsa ) |
Output file (descs ) |
Screen Output |
>CG123
ACGTTGCA
GTTACCAG
>DG124
GTTACCAG
|
>CG123
|
Chose 1 lines with string [>CG] out of 5 total lines.
|
Choose lines not containing an exact text string.
Choose lines where the string in a given column equals a given text string.
(If 'bc' is given, 'abcd' will NOT match.)
Choose lines where the value in a given column equals a certain number. (If '23'
is given, '1234' will NOT match.)
Remove any lines in a file that are completely empty, as well as lines that
have spaces or tabs in them, but no other text.
Example: Remove empty lines from a FASTA file. Run the above script on a
file called a.fsa
to get noblanks
, which will have all non-blank lines.
Input file (a.fasta ) |
Output file (noblanks.fasta ) |
Screen Output |
>CG123
ACGTTGCA
GTTACCAG
>DG124
GTTACCAG
|
>CG123
ACGTTGCA
GTTACCAG
>DG124
GTTACCAG
|
Removed 2 blank/whitespace lines out of 7 total lines.
|
Choose all lines in a tab-separated file where the value in a given column is
numerically greater than a given limit.
Example: Filter a previously-run BLAST tabular output all.blast
. Pick only
hits with bit score greater than 80 and put them in only_big.blast
. Score is
in the last (-1) column. (Note: some columns removed from input rows for
simplicity.)
Input file
(all.blast ) |
NP_438174.1 NP_110118.1 43.41 26 331 31 331 1e-61 228
NP_438174.1 NP_110197.1 33.33 251 319 157 214 0.51 26.6
NP_438174.1 NP_110131.1 31.75 202 257 33 94 1.9 24.6
NP_438174.1 NP_110177.1 21.67 207 326 321 433 2.5 24.3
|
Output file
(only_big.blast ) |
NP_438174.1 NP_110118.1 43.41 26 331 31 331 1e-61 228
|
Screen Output |
Chose 1 lines out of 4.
|
Choose all lines in a tab-separated file where the value in a given column is
numerically less than a given limit.
Example: Filter a previously-run BLAST tabular output all.blast
. Pick only
hits with bit e-value less than 1e-10 and put them in only_small.blast
.
E-value is in the second to last (-2) column. (Note: some columns removed from
input rows for simplicity.)
Input file
(all.blast ) |
NP_438174.1 NP_110118.1 43.41 26 331 31 331 1e-61 228
NP_438174.1 NP_110197.1 33.33 251 319 157 214 0.51 26.6
NP_438174.1 NP_110131.1 31.75 202 257 33 94 1.9 24.6
NP_438174.1 NP_110177.1 21.67 207 326 321 433 2.5 24.3
|
Output file
(only_small.blast ) |
NP_438174.1 NP_110118.1 43.41 26 331 31 331 1e-61 228
|
Screen Output |
Chose 1 lines out of 4.
|
Find the maximum value in a given column, and print all lines that have that
value in that column. (There may be more than one line with the maximum value.)
Example: Given a list of genes and fold changes in fold_change.tab
,
print the genes with the greatest fold change to col_max.tab
. In this case,
two genes have the same high fold change.
Input file
(fold_change.tab ) |
COX2 3.1
HSP90 1.7
FGFR2 1.1
AGFA 3.1
PERU 1.05
|
Output file (col_max.tab ) |
COX2 3.1
AGFA 3.1
|
Screen Output |
Chose 2 lines out of 5
Maximum value: 3.1
|
Example: Set $col=0
to find the maximum in a simple list of numbers.
Find the minimum value in a given column, and print all lines that have that
value in that column. (There may be more than one line with the minimum value.)
Example: TODO
Example: Set $col=0
to find the minimum in a simple list of numbers.
For each "name" (values in column m), find the maxium "score" (value in
column n). Print all lines with each name, highest-score pair. (There may be
more than one line for a given name that has the highest score.) The names
will be printed in order, based on when they first appear in the file. For
each name, the highest-scoring lines will be printed in the order they appear
in the file.
Example: Run the script on a BLAST hit table to find the BLAST hit for each
query (query names in the first column) with the highest percent identity
(in the third column).
TODO
For each "name" (values in column m), find the minium "score" (value in
column n). Print all lines with each name, lowest-score pair. (There may be
more than one line for a given name that has the lowest score.) The names
will be printed in order, based on when they first appear in the file. For
each name, the lowest-scoring lines will be printed in the order they appear
in the file.
Example: Run the script on a BLAST hit table to find the BLAST hit for each
query (query names in the first column) with the lowest e-value (in the second
to last column).
TODO
Print all lines for which column m equals column n. ".1" and "0.1" are
not considered equal. (See choose_lines_col_m_equals_col_n_num.)
Example: TODO
Print all lines for which column m equals column n. ".1" and "0.1" are
considered equal. (See choose_lines_col_m_equals_col_n_alpha.)
Example: TODO
Print all lines for which column m has a higher value than column n.
Example: TODO
Print the first n lines from a file. (Or just use the UNIX command "head".)
Example: TODO
Print the last n lines from a file. (Or just use the UNIX command "tail".)
Example: TODO
Remove the first n lines from a file.
Example: Remove headers from tabular data. TODO
Print lines from a file, removing duplicates. Only print the first
occurrence of any given line. Note: having the same first
column in tabular data is not the same as being a duplicate.
Example: Given a gene, remove any duplicates. Run the
above script on file genes
to get a file called unique
:
Input file (genes ) |
Output file (unique ) |
Screen Output |
ap23
ap23
CG2500
cxb7
|
ap23
CG2500
cxb7
|
Chose 3 unique lines out of 4 total lines.
|
Print lines from a tab-separated file, removing duplicates, where a duplicate
is defined as a repeated value in a given column, even if other parts of the
line are different. Only print the first line where each value appears in that
given column. Note: 0.1 and .1 are not considered duplicates.
Example: Given a list containing target genes, hit genes, and scores,
take only one hit per target gene. (If you only want the best hit
for each target, use Choose lines with highest score for each name (choose_lines_with_max_per_name)).
Run the above script on file hits
to get a file called unique_hits
:
Input file (hits ) |
Output file (unique_hits ) |
Screen Output |
ap23 ap24 25
ap23 mmm 50
CG2500 mmm 10
cxb7 gly1 5
|
ap23 ap24 25
CG2500 mmm 10
cxb7 gly1 5
|
Chose 3 unique lines out of 4 total lines.
Removed duplicates in column 0.
|
Print FASTA records (descriptions and sequences) from a file. Print
only the first record for a given FASTA ID (everything on the description
line up to the first space). So if a record has an already-printed
ID with a different text description, it will still not be printed.
Example: Remove duplicate FASTAs. Run the
above script on file a.fsa
to get a file called unique.fsa
:
Input file (a.fsa ) |
Output file (unique.fsa ) |
Screen Output |
>CG123
ACGTTGCA
GTTACCAG
>DG124
GTTACCAG
>CG123 second time
ACGTTGCA
GTTACCAG
|
>CG123
ACGTTGCA
GTTACCAG
>DG124
GTTACCAG
|
Chose 2 unique FASTA records out of 3 total.
|
See the Merge page for more examples of this kind of choosing.
Given a table of tab-separated lines, choose only the rows where a given
column contains values from a list.
This is just a special case of merge_lines_based_on_shared_column. Use column 0 for the list - i.e., treat the list as a table with just one column.
Given a list of FASTA IDs and a FASTA file, print out the FASTA sequences
for the given IDs. Sequences will be printed out only once, even if the
ID/sequence appears more than once in either file. It is possible that
some or all of the IDs in the list won't be found in the FASTA file.
A '>' and/or description in the ID list will be ignored; only the ID is read.
Example: Run the above script on the list id_list
and the FASTA file
a.fsa
to get found.fsa
.
Input list file (id_list ) |
input FASTA file (a.fsa ) |
Output file (found.fsa ) |
Screen Output |
CG123 A sequence I want
CGnot_here
|
>CG123 First time
ACGTTGCA
GTTACCAG
>DG124
GTTACCAG
>CG123 second time
ACGTTGCA
GTTACCAG
|
>CG123 First time
ACGTTGCA
GTTACCAG
|
Searched 3 FASTA records.
Found 1 IDs out of 2 in the ID list.
|
Matching is complicated, and these scripts can't hide all of the complexity.
Certain characters may do strange things to Perl matching.
As always, when in doubt, check your output files after each step.
Duplicates will not be removed, unless noted otherwise.
Use tools to Remove duplicate lines or records from a file.
Lines will be printed out in the order that they appear in the input
files, unless noted otherwise.
When comparing numerically, .1 and 0.1 are equal, but not when comparing
"alphabetically". On the other hand, all text strings have a value of
0 when comparing numerically ("a" = "blah" = 0).
Scriptome tools are in blue or green boxes. Cut and paste the text of the
tool into a terminal window. Then edit the line as needed.
Things that will often need to be edited are highlighted in
red. Input and output filenames will almost always need to be changed.
All scripts that work on tabular data assume the data is tab-separated.
Use a Change script to change, e.g., comma-separated data
to tab-separated before using these scripts.
When working with tabular data, remember that the first column is called
column 0, the second column is column 1, etc. The last column can
also be referred to as column -1, second-to-last column is -2, etc.