Scriptome Home
UNIX/Mac Home
Windows Home
Information
FAQ
Help
Overview
Principles
Resources
Tips
Tools
Calc
Change
Choose
Fetch
Merge
Sort
Protocols
Sequences
Microarray

Quickbrowse: Go to a tool by selecting the abbreviated tool name from the menu.

choose_

Contents: Click a blue triangle to expand or collapse a list


Choose data from a file

The scripts in this section select certain information from an input file. Some scripts choose lines that meet a certain criterion; others choose certain columns from tabular data.

To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.

See More Information for notes on using these tools.

Choose columns

Print all lines from a file, but print only certain columns from each line.

Choose columns. Optionally reorder them (choose_cols)

Print one or more columns from tab-separated data. The column numbers can be given in any order, so this tool can also be used to rearrange the column order.

@cols Column(s) to choose (in desired order)
Input file(s)
Output file

perl -e ' @cols=(1, -1, 2); while(<>) { s/\r?\n//; @F=split /\t/, $_; print join("\t", @F[@cols]), "\n" } warn "\nChose columns ", join(", ", @cols), " for $. lines\n\n" ' all_cols > some_cols_chosen

Example: Get second, last, and third column (subject sequence, score and percent identity) from blast tabular output file all_cols to get some_cols_chosen. (Note: some columns removed from input rows for simplicity.)

Input file
(all_cols)
 NP_438174.1    NP_110118.1    43.41   26       331     31      331     1e-61    228
 NP_438174.1    NP_110197.1    33.33   251      319     157     214     0.51    26.6
 NP_438174.1    NP_110131.1    31.75   202      257     33      94      1.9     24.6
 NP_438174.1    NP_110177.1    21.67   207      326     321     433     2.5     24.3
Output file
(some_cols_chosen)
 NP_110118.1    228     43.41
 NP_110197.1    26.6    33.33
 NP_110131.1    24.6    31.75
 NP_110177.1    24.3    21.67
Screen Output
 Chose columns 1, -1, 2 for 4 lines

Delete columns (choose_cols_to_delete)

@del_col Column(s) to delete
Input file(s)
Output file

perl -e ' @del_col=(0, 3..6, -1); while(<>) { s/\r?\n//; @F=split /\t/, $_; foreach $col (sort { $b <=> $a } @del_col) { splice @F, $col, 1 }; print join("\t", @F),"\n"; } warn "\nDeleted columns ", join(", ", @del_col), " for $. lines\n\n" ' all_cols > some_cols_deleted

Example: Delete first, fourth through seventh column and last column from a blast tabular output. Retain just subject sequence ID, percent identity, and e-value. (Note: some columns removed from input rows for clarity.) Run the following script on all_cols to get some_cols_deleted.

Input file
(all_cols)
 NP_438174.1    NP_110118.1     43.41   26      331     31      331     1e-61   228
 NP_438174.1    NP_110197.1     33.33   251     319     157     214     0.51    26.6
 NP_438174.1    NP_110131.1     31.75   202     257     33      94      1.9     24.6
 NP_438174.1    NP_110177.1     21.67   207     326     321     433     2.5     24.3
Output file
(some_cols_deleted)
 NP_110118.1    43.41   1e-61
 NP_110197.1    33.33   0.51
 NP_110131.1    31.75   1.9
 NP_110177.1    21.67   2.5
Screen Output
 Deleted columns 0, 3, 4, 5, 6, -1 for 4 lines

To choose lines containing certain values in certain columns, see Choose lines matching a specific text string or pattern.

Choose lines matching a specific text string or pattern

Choose lines containing a specific string, like "blah" or "23" (The latter will also match 123456), or lines matching a pattern, like "line begins with the letters 'CG'".

Choose lines containing a given text string (choose_lines_matching_text)

Choose lines containing a text string (which should be placed inside the q{} in the tool. If you input "bc" , "abcd" will match. If you input "23", "1.234" will match. Warning: Add a backslash to match ' or }. I.e., to match "Joe's", type "Joe\'s".

$string Text to match
Input file(s)
Output file

perl -e ' $string=q{>CG}; $count=0; while(<>) { if (/\Q$string\E/) { print $_; $count++ } } warn "\nChose $count lines with string [$string] out of $. total lines.\n"; ' a.fsa > descs

Example: Get certain description lines from a FASTA file. Run the above script on a file called a.fsa to get descs, which will have all lines containing a '>CG'.

Input file (a.fsa) Output file (descs) Screen Output
 >CG123
 ACGTTGCA
 GTTACCAG
 >DG124
 GTTACCAG
 >CG123
 Chose 1 lines with string [>CG] out of 5 total lines.

TODO: Choose lines NOT containing a given text string (choose_lines_not_matching_text)

Choose lines not containing an exact text string.

TODO: Choose lines where a given column is a given text string (choose_lines_col_equals_text)

Choose lines where the string in a given column equals a given text string. (If 'bc' is given, 'abcd' will NOT match.)

TODO: Choose lines where a given column is a given number (choose_lines_col_equals_number)

Choose lines where the value in a given column equals a certain number. (If '23' is given, '1234' will NOT match.)

Remove empty lines or lines containing just spaces (choose_nonempty_lines)

Remove any lines in a file that are completely empty, as well as lines that have spaces or tabs in them, but no other text.

Input file(s)
Output file

perl -e ' $count=0; while(<>) { if (/^\s*$/) { $count++ } else { print $_ } } warn "\nRemoved $count blank/whitespace lines out of $. total lines.\n\n" ' a.fasta > noblanks.fasta

Example: Remove empty lines from a FASTA file. Run the above script on a file called a.fsa to get noblanks, which will have all non-blank lines.

Input file (a.fasta) Output file (noblanks.fasta) Screen Output
 >CG123
 ACGTTGCA
         
 GTTACCAG
 >DG124
 
 GTTACCAG
 >CG123
 ACGTTGCA
 GTTACCAG
 >DG124
 GTTACCAG
 Removed 2 blank/whitespace lines out of 7 total lines.

Choose lines where a column matches some mathematical criterion

Choose lines where a given column is greater than a given limit (choose_lines_col_more_than_limit)

Choose all lines in a tab-separated file where the value in a given column is numerically greater than a given limit.

$col Column to limit
$limit Minimum value in column
Input file(s)
Output file

perl -e ' $col=-1; $limit=80; $count=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; if ($F[$col] > $limit) { $count++; print "$_\n" } } warn "\nChose $count lines out of $..\n\n" ' all.blast > only_big.blast

Example: Filter a previously-run BLAST tabular output all.blast. Pick only hits with bit score greater than 80 and put them in only_big.blast. Score is in the last (-1) column. (Note: some columns removed from input rows for simplicity.)

Input file
(all.blast)
 NP_438174.1    NP_110118.1     43.41   26      331     31      331     1e-61   228
 NP_438174.1    NP_110197.1     33.33   251     319     157     214     0.51    26.6
 NP_438174.1    NP_110131.1     31.75   202     257     33      94      1.9     24.6
 NP_438174.1    NP_110177.1     21.67   207     326     321     433     2.5     24.3
Output file
(only_big.blast)
 NP_438174.1    NP_110118.1     43.41   26      331     31      331     1e-61   228
Screen Output
 Chose 1 lines out of 4.

Choose lines where a given column is less than a given limit (choose_lines_col_less_than_limit)

Choose all lines in a tab-separated file where the value in a given column is numerically less than a given limit.

$col Column to limit
$limit Maximum value in column
Input file(s)
Output file

perl -e ' $col=-2; $limit=1e-10; $count=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; if ($F[$col] < $limit) { $count++; print "$_\n" } } warn "\nChose $count lines out of $..\n\n" ' all.blast > only_small.blast

Example: Filter a previously-run BLAST tabular output all.blast. Pick only hits with bit e-value less than 1e-10 and put them in only_small.blast. E-value is in the second to last (-2) column. (Note: some columns removed from input rows for simplicity.)

Input file
(all.blast)
 NP_438174.1    NP_110118.1     43.41   26      331     31      331     1e-61   228
 NP_438174.1    NP_110197.1     33.33   251     319     157     214     0.51    26.6
 NP_438174.1    NP_110131.1     31.75   202     257     33      94      1.9     24.6
 NP_438174.1    NP_110177.1     21.67   207     326     321     433     2.5     24.3
Output file
(only_small.blast)
 NP_438174.1    NP_110118.1     43.41   26      331     31      331     1e-61   228
Screen Output
 Chose 1 lines out of 4.

Choose lines with highest value in a column (choose_lines_with_col_max)

Find the maximum value in a given column, and print all lines that have that value in that column. (There may be more than one line with the maximum value.)

$col Column to find maximum in
Input file(s)
Output file

perl -e ' $col=1; while(<>) { s/\r?\n//; @F=split /\t/, $_; $s = $F[$col]; if (! defined($max) || $s > $max) { $max = $s; @best = () }; if ($s == $max) { push @best, "$_\n" }; } print @best; warn "\nChose ", scalar(@best), " lines out of $.\nMaximum value: $max\n\n"; ' fold_change.tab > col_max.tab

Example: Given a list of genes and fold changes in fold_change.tab, print the genes with the greatest fold change to col_max.tab. In this case, two genes have the same high fold change.

Input file
(fold_change.tab)
 COX2   3.1
 HSP90  1.7
 FGFR2  1.1
 AGFA   3.1
 PERU   1.05
Output file
(col_max.tab)
 COX2   3.1
 AGFA   3.1
Screen Output
 Chose 2 lines out of 5
 Maximum value: 3.1

Example: Set $col=0 to find the maximum in a simple list of numbers.

Choose lines with lowest value in a column (choose_lines_with_col_min)

Find the minimum value in a given column, and print all lines that have that value in that column. (There may be more than one line with the minimum value.)

$col Column to find minimum in
Input file(s)
Output file

perl -e ' $col=1; while(<>) { s/\r?\n//; @F=split /\t/, $_; $s = $F[$col]; if (! defined($min) || $s < $min) { $min = $s; @best = () }; if ($s == $min) { push @best, "$_\n" }; } print @best; warn "Chose ", scalar(@best), " lines out of $.\nMinimum value: $min\n\n"; ' blast.tab > col_min.tab

Example: TODO

Example: Set $col=0 to find the minimum in a simple list of numbers.

Choose lines with highest score for each name (choose_lines_with_max_per_name)

For each "name" (values in column m), find the maxium "score" (value in column n). Print all lines with each name, highest-score pair. (There may be more than one line for a given name that has the highest score.) The names will be printed in order, based on when they first appear in the file. For each name, the highest-scoring lines will be printed in the order they appear in the file.

$name_col Column with identifiers / names
$score_col Column with scores / values
Input file(s)
Output file

perl -e ' $name_col=0; $score_col=2; while(<>) { s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($max{$n})) { push @names, $n }; if (! exists($max{$n}) || $s > $max{$n}) { $max{$n} = $s; $best{$n} = () }; if ($s == $max{$n}) { $best{$n} .= "$_\n" }; } for $n (@names) { print $best{$n} } ' blast.tab > best_hits.tab

Example: Run the script on a BLAST hit table to find the BLAST hit for each query (query names in the first column) with the highest percent identity (in the third column).

TODO

Choose lines with lowest score for each name (choose_lines_with_min_per_name)

For each "name" (values in column m), find the minium "score" (value in column n). Print all lines with each name, lowest-score pair. (There may be more than one line for a given name that has the lowest score.) The names will be printed in order, based on when they first appear in the file. For each name, the lowest-scoring lines will be printed in the order they appear in the file.

$name_col Column with identifiers / names
$score_col Column with scores / values
Input file(s)
Output file

perl -e ' $name_col=0; $score_col=-2; while (<>) { s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($min{$n})) { push @names, $n }; if (! exists($min{$n}) || $s < $min{$n}) { $min{$n} = $s; $best{$n} = () }; if ($s == $min{$n}) { $best{$n} .= "$_\n"; } } for $n (@names) { print $best{$n}; } ' blast.tab > best_hits.tab

Example: Run the script on a BLAST hit table to find the BLAST hit for each query (query names in the first column) with the lowest e-value (in the second to last column).

TODO

Choose lines by comparing values in two columns

Choose lines where column m has the same text string as column n (choose_lines_col_m_equals_col_n_alpha)

Print all lines for which column m equals column n. ".1" and "0.1" are not considered equal. (See choose_lines_col_m_equals_col_n_num.)

$colm Text column to compare
$coln Other text column to compare
Input file(s)
Output file

perl -e ' $colm=0; $coln=1; $count=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; if ($F[$colm] eq $F[$coln]) { print "$_\n"; $count++ } } warn "\nChose $count lines out of $. where column $colm had same text as column $coln\n\n"; ' infile > outfile

Example: TODO

Choose lines where column m numerically equals column n (choose_lines_col_m_equals_col_n_num)

Print all lines for which column m equals column n. ".1" and "0.1" are considered equal. (See choose_lines_col_m_equals_col_n_alpha.)

$colm Numerical column to compare
$coln Other numerical column to compare
Input file(s)
Output file

perl -e ' $colm=0; $coln=1; $count=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; if ($F[$colm] == $F[$coln]) { print "$_\n"; $count++ } } warn "\nChose $count lines out of $. where column $colm was numerically equal to column $coln\n\n"; ' infile > outfile

Example: TODO

Choose lines where column m is greater than column n (choose_lines_col_m_more_than_col_n)

Print all lines for which column m has a higher value than column n.

$colm Numerical column to compare
$coln Other numerical column to compare
Input file(s)
Output file

perl -e ' $colm=0; $coln=1; $count=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; if ($F[$colm] > $F[$coln]) { print "$_\n"; $count++ } } warn "\nChose $count lines out of $. where column $colm was greater than column $coln\n\n"; ' infile > outfile

Example: TODO

Choose numbered lines or records from a file

Choose first N lines (choose_first_n_lines)

Print the first n lines from a file. (Or just use the UNIX command "head".)

$num_lines Number of lines to choose
Input file(s)
Output file

perl -e ' $num_lines=2; while(<>) { if ($. <= $num_lines) { print $_ } else { exit } } ' infile > outfile

Example: TODO

Choose last N lines (choose_last_n_lines)

Print the last n lines from a file. (Or just use the UNIX command "tail".)

$num_lines Number of lines to choose
Input file(s)
Output file

perl -e ' $num_lines=2; while(<>) { push @save, $_; if (@save > $num_lines) { shift @save } } print @save ' infile > outfile

Example: TODO

Remove first N lines (choose_all_but_first_n_lines)

Remove the first n lines from a file.

$del_lines Number of lines to delete
Input file(s)
Output file

perl -e ' $del_lines=1; while(<>) { if ($. > $del_lines) { print $_ } } ' infile > outfile

Example: Remove headers from tabular data. TODO

TODO: Choose first N FASTA records

TODO: Choose Nth FASTA record

Remove duplicate lines or records from a file

Remove duplicate lines from a file (choose_unique_lines)

Print lines from a file, removing duplicates. Only print the first occurrence of any given line. Note: having the same first column in tabular data is not the same as being a duplicate.

Input file(s)
Output file

perl -e ' $unique=0; while(<>) { if (!($save{$_}++)) { print $_; $unique++ } } warn "\nChose $unique unique lines out of $. total lines.\n\n" ' genes > unique

Example: Given a gene, remove any duplicates. Run the above script on file genes to get a file called unique:

Input file (genes) Output file (unique) Screen Output
 ap23
 ap23
 CG2500
 cxb7
 ap23
 CG2500
 cxb7
 Chose 3 unique lines out of 4 total lines.

Remove lines where values are repeated in a given column (choose_unique_lines_by_col)

Print lines from a tab-separated file, removing duplicates, where a duplicate is defined as a repeated value in a given column, even if other parts of the line are different. Only print the first line where each value appears in that given column. Note: 0.1 and .1 are not considered duplicates.

$column Column in which to look for unique values
Input file(s)
Output file

perl -e ' $column=0; $unique=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; if (! ($save{$F[$column]}++)) { print "$_\n"; $unique++ } } warn "\nChose $unique unique lines out of $. total lines.\nRemoved duplicates in column $column.\n\n" ' hits > unique_hits

Example: Given a list containing target genes, hit genes, and scores, take only one hit per target gene. (If you only want the best hit for each target, use Choose lines with highest score for each name (choose_lines_with_max_per_name)). Run the above script on file hits to get a file called unique_hits:

Input file (hits) Output file (unique_hits) Screen Output
 ap23   ap24    25
 ap23   mmm     50
 CG2500 mmm     10
 cxb7   gly1    5
 ap23   ap24    25
 CG2500 mmm     10
 cxb7   gly1    5
 Chose 3 unique lines out of 4 total lines.
 Removed duplicates in column 0.

Remove duplicate FASTA records from a file (choose_unique_FASTA)

Print FASTA records (descriptions and sequences) from a file. Print only the first record for a given FASTA ID (everything on the description line up to the first space). So if a record has an already-printed ID with a different text description, it will still not be printed.

Input file(s)
Output file

perl -e ' $unique=0; $total=0; while(<>) { if (/^>\S+/) { $total++; if (! ($seen{$&}++)) { $unique++; $print_it = 1 } else { $print_it = 0 } }; if ($print_it) { print $_ }; } warn "\nChose $unique unique FASTA records out of $total total.\n\n"; ' a.fsa > unique.fsa

Example: Remove duplicate FASTAs. Run the above script on file a.fsa to get a file called unique.fsa:

Input file (a.fsa) Output file (unique.fsa) Screen Output
 >CG123
 ACGTTGCA
 GTTACCAG
 >DG124
 GTTACCAG
 >CG123 second time
 ACGTTGCA
 GTTACCAG
 >CG123
 ACGTTGCA
 GTTACCAG
 >DG124
 GTTACCAG
 Chose 2 unique FASTA records out of 3 total.

Choose lines or records from a file, using a list in another file

See the Merge page for more examples of this kind of choosing.

Choose lines from a table based on a list (choose_table_rows_from_list)

Given a table of tab-separated lines, choose only the rows where a given column contains values from a list.

This is just a special case of merge_lines_based_on_shared_column. Use column 0 for the list - i.e., treat the list as a table with just one column.

Choose a set of FASTA sequences from a file (choose_FASTAs_from_list)

Given a list of FASTA IDs and a FASTA file, print out the FASTA sequences for the given IDs. Sequences will be printed out only once, even if the ID/sequence appears more than once in either file. It is possible that some or all of the IDs in the list won't be found in the FASTA file.

A '>' and/or description in the ID list will be ignored; only the ID is read.

Input file(s) File with FASTA identifiers (no > signs)
Input file(s) FASTA file
Output file

perl -e ' ($id,$fasta)=@ARGV; open(ID,$id); while (<ID>) { s/\r?\n//; /^>?(\S+)/; $ids{$1}++; } $num_ids = keys %ids; open(F, $fasta); $s_read = $s_wrote = $print_it = 0; while (<F>) { if (/^>(\S+)/) { $s_read++; if ($ids{$1}) { $s_wrote++; $print_it = 1; delete $ids{$1} } else { $print_it = 0 } }; if ($print_it) { print $_ } }; END { warn "Searched $s_read FASTA records.\nFound $s_wrote IDs out of $num_ids in the ID list.\n" } ' id_list a.fsa > found.fsa

Example: Run the above script on the list id_list and the FASTA file a.fsa to get found.fsa.

Input list file (id_list) input FASTA file (a.fsa) Output file (found.fsa) Screen Output
 CG123 A sequence I want
 CGnot_here
 >CG123 First time
 ACGTTGCA
 GTTACCAG
 >DG124
 GTTACCAG
 >CG123 second time
 ACGTTGCA
 GTTACCAG
 >CG123 First time
 ACGTTGCA
 GTTACCAG
 Searched 3 FASTA records.
 Found 1 IDs out of 2 in the ID list.


More Information

General Choosing Notes

Matching is complicated, and these scripts can't hide all of the complexity. Certain characters may do strange things to Perl matching.

As always, when in doubt, check your output files after each step.

Duplicates will not be removed, unless noted otherwise. Use tools to Remove duplicate lines or records from a file.

Lines will be printed out in the order that they appear in the input files, unless noted otherwise.

When comparing numerically, .1 and 0.1 are equal, but not when comparing "alphabetically". On the other hand, all text strings have a value of 0 when comparing numerically ("a" = "blah" = 0).

General Scriptome Notes

Scriptome tools are in blue or green boxes. Cut and paste the text of the tool into a terminal window. Then edit the line as needed. Things that will often need to be edited are highlighted in red. Input and output filenames will almost always need to be changed.

All scripts that work on tabular data assume the data is tab-separated. Use a Change script to change, e.g., comma-separated data to tab-separated before using these scripts.

When working with tabular data, remember that the first column is called column 0, the second column is column 1, etc. The last column can also be referred to as column -1, second-to-last column is -2, etc.

 

HomeContact UsDirectoriesSearch