Sequence Analysis Protocols

Protocols for data manipulation related to sequence analysis.

To use a protocol, just cut and paste the scripts in the colored/dashed-line boxes onto a UNIX/Mac or Windows command line. You can copy and run the individual tools one by one, or all at once if you're feeling lucky.

If you're happy with the filenames given in the protocol, just make sure that your starting file(s) have the right name. Otherwise, you can copy the whole protocol into an editor (vi, Notepad) and change the filenames, then cut and paste onto the command line.

Analyzing BLAST Results

Reciprocal Best Hit BLAST

A method for finding potential orthologs (evolutionarily related sequences).

Input: files A_B and B_A contain all-against-all blasts of genome A against genome B and vice versa. The blasts should be done with the -m8 option for tabular output.

Tool Name	Input files	Output
choose_lines_with_max_per_name	A_B	A_B.best
choose_lines_with_max_per_name	B_A	B_A.best
merge_lines_based_on_shared_column	A_B.best B_A.best	A_B_A
choose_lines_col_m_equals_col_n_alpha	A_B_A	A_B_A.recip
choose_cols	A_B_A.recip	RBHB.out

Note: if there are multiple hits with the same score for a given query, you may get multiple "best hits". In that case, you may want to use a tool to remove duplicates in the query columns.

perl -e "$name_col=0; $score_col=11; while(<>) {s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($max{$n})) {push @names, $n}; if (! exists($max{$n}) || $s > $max{$n}) {$max{$n} = $s; $best{$n} = ()}; if ($s == $max{$n}) {$best{$n} .=  qq~$_\n~};} for $n (@names) {print $best{$n}}" A_B > A_B.best

perl -e "$name_col=0; $score_col=11; while(<>) {s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($max{$n})) {push @names, $n}; if (! exists($max{$n}) || $s > $max{$n}) {$max{$n} = $s; $best{$n} = ()}; if ($s == $max{$n}) {$best{$n} .=  qq~$_\n~};} for $n (@names) {print $best{$n}}" B_A > B_A.best

perl -e "$col1=1; $col2=0;" -e "($f1,$f2)=@ARGV; open(F1,$f1); while (<F1>) {s/\r?\n//; @F=split /\t/, $_; $line1{$F[$col1]} .=  qq~$_\n~} open(F2,$f2); while (<F2>) {s/\r?\n//;@F=split /\t/, $_; if ($x = $line1{$F[$col2]}) {$x =~ s/\n/\t$_\n/g; print $x}}" A_B.best B_A.best > A_B_A

perl -e "$colm=0; $coln=13; $count=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; if ($F[$colm] eq $F[$coln]) {print  qq~$_\n~; $count++}} warn  qq~\nChose $count lines out of $. where column $colm had same text as column $coln\n\n~;" A_B_A > A_B_A.recip

perl -e "@cols=(0, 1, 11); while(<>) {s/\r?\n//; @F=split /\t/, $_; print join( qq~\t~, @F[@cols]),  qq~\n~} warn  qq~\nJoined columns ~, join( qq~, ~, @cols),  qq~ for $. lines\n\n~" A_B_A.recip > RBHB.out

Formatting and Filtering FASTA Files

If you change FASTA files to tabular format, you can use all of the tools that filter and format tabular data, and change back to FASTA format at the end if desired.

Remove duplicate sequences in FASTA

Some FASTA files (e.g., ESTs) have sequences with different IDs that nonetheless have the same sequence. This protocol removes duplicate sequences based on the nucleotide / amino acid sequence, rather than ID.

Tool Name	Input files	Output
change_fasta_to_tab	dup.fasta	dup.tab
choose_unique_lines_by_col	dup.tab	unique.tab
change_tab_to_fasta	unique.tab	unique.fasta

perl -e "$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print  qq~\n~} s/ |$/\t/; $count++; $_ .=  qq~\t~;} else {s/ //g; $len += length($_)} print $_;} print  qq~\n~; warn  qq~\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n~;" dup.fasta > dup.tab

perl -e "$column = 2; $unique=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; if (! ($save{$F[$column]}++)) {print  qq~$_\n~; $unique++}} warn  qq~\nChose $unique unique lines out of $. total lines.\nRemoved duplicates in column $column.\n\n~" dup.tab > unique.tab

perl -e "$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print  qq~>$F[0]~; if (length($F[1])) {print  qq~ $F[1]~} print  qq~\n~; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print  qq~$s\n~;} warn  qq~\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n~;" unique.tab > unique.fasta

Remove short sequences in FASTA

Some FASTA files have very short sequences (e.g., with N's or repeat elements removed), that will not provide meaningful results when running BLAST or other analyses. This protocol removes sequences whose lengths are less than 30 nucleotides / amino acids. (In order to change the length, cut and paste the whole protocol into an editor and change the $limit parameter in the second tool.)

Tool Name	Input files	Output
change_fasta_to_tab	long_short.fasta	long_short.tab
calc_col_length	long_short.tab	ls_length.tab
choose_lines_col_more_than_limit	ls_length.tab	only_long.tab
choose_cols	only_long.tab	only_long_three_cols.tab
change_tab_to_fasta	only_long_three_cols.tab	only_long.fasta

perl -e "$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print  qq~\n~} s/ |$/\t/; $count++; $_ .=  qq~\t~;} else {s/ //g; $len += length($_)} print $_;} print  qq~\n~; warn  qq~\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n~;" long_short.fasta > long_short.tab

perl -e "$col = 2;" -e "while (<>) { s/\r?\n//; @F = split /\t/, $_; $len = length($F[$col]); print  qq~$_\t$len\n~ } warn  qq~\nAdded column with length of column $col for $. lines.\n\n~;" long_short.tab > ls_length.tab

perl -e "$col=3; $limit=30; while(<>) {BEGIN {$count=0} s/\r?\n//; @F=split /\t/, $_; if ($F[$col] > $limit) {$count++; print  qq~$_\n~}} warn  qq~\nChose $count lines out of $..\n\n~" ls_length.tab > only_long.tab

perl -e "@cols=(0, 1, 2); while(<>) {s/\r?\n//; @F=split /\t/, $_; print join( qq~\t~, @F[@cols]),  qq~\n~} warn  qq~\nJoined columns ~, join( qq~, ~, @cols),  qq~ for $. lines\n\n~" only_long.tab > only_long_three_cols.tab

perl -e "$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print  qq~>$F[0]~; if (length($F[1])) {print  qq~ $F[1]~} print  qq~\n~; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print  qq~$s\n~;} warn  qq~\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n~;" only_long_three_cols.tab > only_long.fasta

Change FASTA files to have sixty characters per line of sequence

FASTA files may have entire sequences (hundreds or thousands of characters) in a single line. A few programs cannot except these long sequence lines. To change a FASTA file to have only 60 characters per line, all you need to do is translate the FASTA file to tabular format and back.

Tool Name	Input files	Output
change_fasta_to_tab	long_seq.fasta	long_seq.tab
change_tab_to_fasta	long_seq.tab	short_seq.fasta

perl -e "$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print  qq~\n~} s/ |$/\t/; $count++; $_ .=  qq~\t~;} else {s/ //g; $len += length($_)} print $_;} print  qq~\n~; warn  qq~\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n~;" long_seq.fasta > long_seq.tab

perl -e "$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print  qq~>$F[0]~; if (length($F[1])) {print  qq~ $F[1]~} print  qq~\n~; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print  qq~$s\n~;} warn  qq~\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n~;" long_seq.tab > short_seq.fasta

TODO: Choose FASTA sequences containing an exact subsequence

Note: If the ID or description column happen to contain the sequence, then choose_lines_matching_text will find them. This shouldn't happen very often. It would be nicer if you could search within a column, though.