Scriptome Home
UNIX/Mac Home
Windows Home
Information
FAQ
Help
Overview
Principles
Resources
Tips
Tools
Calc
Change
Choose
Fetch
Merge
Sort
Protocols
Sequences
Microarray

Contents: Click a blue triangle to expand or collapse a list


Sequence Analysis Protocols

Protocols for data manipulation related to sequence analysis.

To use a protocol, just cut and paste the scripts in the colored/dashed-line boxes onto a UNIX/Mac or Windows command line. You can copy and run the individual tools one by one, or all at once if you're feeling lucky.

If you're happy with the filenames given in the protocol, just make sure that your starting file(s) have the right name. Otherwise, you can copy the whole protocol into an editor (vi, Notepad) and change the filenames, then cut and paste onto the command line.

Analyzing BLAST Results

Reciprocal Best Hit BLAST

A method for finding potential orthologs (evolutionarily related sequences).

Input: files A_B and B_A contain all-against-all blasts of genome A against genome B and vice versa. The blasts should be done with the -m8 option for tabular output.

Tool NameInput filesOutput
choose_lines_with_max_per_name A_B A_B.best
choose_lines_with_max_per_name B_A B_A.best
merge_lines_based_on_shared_column A_B.best B_A.best A_B_A
choose_lines_col_m_equals_col_n_alpha A_B_A A_B_A.recip
choose_cols A_B_A.recip RBHB.out

Note: if there are multiple hits with the same score for a given query, you may get multiple "best hits". In that case, you may want to use a tool to remove duplicates in the query columns.

perl -e '$name_col=0; $score_col=11; while(<>) {s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($max{$n})) {push @names, $n}; if (! exists($max{$n}) || $s > $max{$n}) {$max{$n} = $s; $best{$n} = ()}; if ($s == $max{$n}) {$best{$n} .= "$_\n"};} for $n (@names) {print $best{$n}}' A_B > A_B.best
perl -e '$name_col=0; $score_col=11; while(<>) {s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($max{$n})) {push @names, $n}; if (! exists($max{$n}) || $s > $max{$n}) {$max{$n} = $s; $best{$n} = ()}; if ($s == $max{$n}) {$best{$n} .= "$_\n"};} for $n (@names) {print $best{$n}}' B_A > B_A.best
perl -e '$col1=1; $col2=0;' -e '($f1,$f2)=@ARGV; open(F1,$f1); while (<F1>) {s/\r?\n//; @F=split /\t/, $_; $line1{$F[$col1]} .= "$_\n"} open(F2,$f2); while (<F2>) {s/\r?\n//;@F=split /\t/, $_; if ($x = $line1{$F[$col2]}) {$x =~ s/\n/\t$_\n/g; print $x}}' A_B.best B_A.best > A_B_A
perl -e '$colm=0; $coln=13; $count=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; if ($F[$colm] eq $F[$coln]) {print "$_\n"; $count++}} warn "\nChose $count lines out of $. where column $colm had same text as column $coln\n\n";' A_B_A > A_B_A.recip
perl -e '@cols=(0, 1, 11); while(<>) {s/\r?\n//; @F=split /\t/, $_; print join("\t", @F[@cols]), "\n"} warn "\nJoined columns ", join(", ", @cols), " for $. lines\n\n"' A_B_A.recip > RBHB.out

Formatting and Filtering FASTA Files

If you change FASTA files to tabular format, you can use all of the tools that filter and format tabular data, and change back to FASTA format at the end if desired.

Remove duplicate sequences in FASTA

Some FASTA files (e.g., ESTs) have sequences with different IDs that nonetheless have the same sequence. This protocol removes duplicate sequences based on the nucleotide / amino acid sequence, rather than ID.

Tool NameInput filesOutput
change_fasta_to_tab dup.fasta dup.tab
choose_unique_lines_by_col dup.tab unique.tab
change_tab_to_fasta unique.tab unique.fasta

perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";' dup.fasta > dup.tab
perl -e '$column = 2; $unique=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; if (! ($save{$F[$column]}++)) {print "$_\n"; $unique++}} warn "\nChose $unique unique lines out of $. total lines.\nRemoved duplicates in column $column.\n\n"' dup.tab > unique.tab
perl -e '$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) {print " $F[1]"} print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n";} warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n";' unique.tab > unique.fasta

Remove short sequences in FASTA

Some FASTA files have very short sequences (e.g., with N's or repeat elements removed), that will not provide meaningful results when running BLAST or other analyses. This protocol removes sequences whose lengths are less than 30 nucleotides / amino acids. (In order to change the length, cut and paste the whole protocol into an editor and change the $limit parameter in the second tool.)

Tool NameInput filesOutput
change_fasta_to_tab long_short.fasta long_short.tab
calc_col_length long_short.tab ls_length.tab
choose_lines_col_more_than_limit ls_length.tab only_long.tab
choose_cols only_long.tab only_long_three_cols.tab
change_tab_to_fasta only_long_three_cols.tab only_long.fasta

perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";' long_short.fasta > long_short.tab
perl -e '$col = 2;' -e 'while (<>) { s/\r?\n//; @F = split /\t/, $_; $len = length($F[$col]); print "$_\t$len\n" } warn "\nAdded column with length of column $col for $. lines.\n\n";' long_short.tab > ls_length.tab
perl -e '$col=3; $limit=30; while(<>) {BEGIN {$count=0} s/\r?\n//; @F=split /\t/, $_; if ($F[$col] > $limit) {$count++; print "$_\n"}} warn "\nChose $count lines out of $..\n\n"' ls_length.tab > only_long.tab
perl -e '@cols=(0, 1, 2); while(<>) {s/\r?\n//; @F=split /\t/, $_; print join("\t", @F[@cols]), "\n"} warn "\nJoined columns ", join(", ", @cols), " for $. lines\n\n"' only_long.tab > only_long_three_cols.tab
perl -e '$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) {print " $F[1]"} print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n";} warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n";' only_long_three_cols.tab > only_long.fasta

Change FASTA files to have sixty characters per line of sequence

FASTA files may have entire sequences (hundreds or thousands of characters) in a single line. A few programs cannot except these long sequence lines. To change a FASTA file to have only 60 characters per line, all you need to do is translate the FASTA file to tabular format and back.

Tool NameInput filesOutput
change_fasta_to_tab long_seq.fasta long_seq.tab
change_tab_to_fasta long_seq.tab short_seq.fasta

perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";' long_seq.fasta > long_seq.tab
perl -e '$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) {print " $F[1]"} print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n";} warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n";' long_seq.tab > short_seq.fasta

TODO: Choose FASTA sequences containing an exact subsequence

Note: If the ID or description column happen to contain the sequence, then choose_lines_matching_text will find them. This shouldn't happen very often. It would be nicer if you could search within a column, though.

 

HomeContact UsDirectoriesSearch