|
|
|
Contents: Click a blue triangle to expand or collapse a list
Protocols for data manipulation related to sequence analysis.
To use a protocol, just cut and paste the scripts in the colored/dashed-line
boxes onto a UNIX/Mac or Windows command line. You can copy and run the
individual tools one by one, or all at once if you're feeling lucky.
If you're happy with the filenames given in the protocol, just make sure
that your starting file(s) have the right name. Otherwise, you can
copy the whole protocol into an editor (vi, Notepad) and change the filenames,
then cut and paste onto the command line.
A method for finding potential orthologs (evolutionarily related sequences).
Input: files A_B and B_A contain all-against-all blasts of genome
A against genome B and vice versa. The blasts should be done with the
-m8 option for tabular output.
Tool Name | Input files | Output |
choose_lines_with_max_per_name | A_B | A_B.best |
choose_lines_with_max_per_name | B_A | B_A.best |
merge_lines_based_on_shared_column | A_B.best B_A.best | A_B_A |
choose_lines_col_m_equals_col_n_alpha | A_B_A | A_B_A.recip |
choose_cols | A_B_A.recip | RBHB.out |
Note: if there are multiple hits with the same score for a given query,
you may get multiple "best hits". In that case, you may want to use a tool
to
remove duplicates
in the query columns.
If you change FASTA files to tabular format, you can use all of the tools
that filter and format tabular data, and change back to FASTA format
at the end if desired.
Some FASTA files (e.g., ESTs) have sequences with different IDs that
nonetheless have the same sequence. This protocol removes duplicate
sequences based on the nucleotide / amino acid sequence, rather than ID.
Tool Name | Input files | Output |
change_fasta_to_tab | dup.fasta | dup.tab |
choose_unique_lines_by_col | dup.tab | unique.tab |
change_tab_to_fasta | unique.tab | unique.fasta |
Some FASTA files have very short sequences (e.g., with N's or repeat elements
removed), that will not provide meaningful results when running BLAST or
other analyses. This protocol removes sequences whose lengths
are less than 30 nucleotides / amino acids. (In order to change the length,
cut and paste the whole protocol into an editor and change the $limit parameter
in the second tool.)
Tool Name | Input files | Output |
change_fasta_to_tab | long_short.fasta | long_short.tab |
calc_col_length | long_short.tab | ls_length.tab |
choose_lines_col_more_than_limit | ls_length.tab | only_long.tab |
choose_cols | only_long.tab | only_long_three_cols.tab |
change_tab_to_fasta | only_long_three_cols.tab | only_long.fasta |
FASTA files may have entire sequences (hundreds or thousands of characters)
in a single line. A few programs cannot except these long sequence lines.
To change a FASTA file to have only 60 characters per line, all you
need to do is translate the FASTA file to tabular format and back.
Tool Name | Input files | Output |
change_fasta_to_tab | long_seq.fasta | long_seq.tab |
change_tab_to_fasta | long_seq.tab | short_seq.fasta |
Note: If the ID or description column happen to contain the sequence, then
choose_lines_matching_text will find them. This shouldn't happen
very often. It would be nicer if you could search within a column, though.
|
|
| |