Contents: Click a blue triangle to expand or collapse a list
This page features all sorts of tips on using specific tools, writing
protocols, and general data manipulation.
The most general tip is to use trial and error. If at all possible,
take small subsets of your real data (using tools like
choose_cols
and
choose_first_n_lines)
and try using tools or sets of tools on the data.
There are a few idiosyncracies it's good to know about when choosing
parameters for tools. (Most of these have to do with the way the Scriptome
tools use Perl.) They are discussed below, along with some shortcuts.
When using tabular data, column zero is the first column, column
1 is the second column, etc. However, the first line or row of a file
is one. Yes, it's confusing. Sorry. Read the examples to see what
to expect, and always check you results.
When choosing columns in tabular data, the last column will be
-1, the second to last is -2, etc.
7..10 is the same as 7, 8, 9, 10. Use this shortcut for
choose_cols,
for example. You can mix and match ranges, like 2, 3, 7..10, 14, 15..19 also.
Note that backwards ranges like 4..1 don't work. Also, you can't give a range
like 3..-1 to go from the third column through the last column.
Many Scriptome tools assume data is tab-separated. (See
Make Your Data Tabular.) But if you just have a list of genes,
values, whatever, you can still use some of the tools. Just set
the column number to 0, the first column.
Each Scriptome tool does a simple task. The idea is to put them together
to do your more complicated processing
You don't necessarily need to plan every step of the processing from
the beginning. Since most Scriptome tools take only a few seconds to
run, even on large files, it's OK to just run a couple and see whether
the output looks more like the results you want.
Since each step in the process yields an intermediate output file, if
one step doesn't work, just back up a step and try to continue with
a different method.
If you make your data tabular, it will be easier to read and process.
Many Scriptome tools work with tabular data, and of course you can also
throw tab-separated lines into Excel or other programs.
Change your FASTA files to tables with
change_fasta_to_tab.
Now you can sort, merge, and filter to your heart's
content. When you're done, leave the data in tables for analysis
(What's the average sequence length? How many sequences are from chimp?)
or change it back into FASTA to run sequence analysis tools on it.
Change lots of other sequence formats into tabular format using
change_bio_format_to_bio_format with "tab" output format.
(Note that this tab format will have only two columns, not three.)
Change comma-separated, space-separated, or other kinds of tabular data
to tab-separated using
change_any_separator_to_tab.
Very often the information in a single column of tabular data
still has more than one piece of information in it. Separating
a single column into multiple columns with
change_any_separator_to_tab
lets you get at those smaller pieces.
The ID in a GenBank FASTA file actually contains several pieces of information,
separated by '|' characters. After changing the file to tabular format, you can
split the ID into its constituent parts Now you can make a new FASTA file with
gi IDs. Or merge the table with another table containing gi IDs.
As another example, split a sequence "start-end" column on '-' to put start
and end in separate columns. Now you can choose lines based on sequence start
positions.
If you have a file that has data with the wrong accession numbers, and another
file that translates between accession numbers, use
merge_lines_based_on_shared_col to get a new file with the
data and the correct accession numbers.
(But be careful of getting multiple lines with similar data
if the mapping isn't one-to-one.)
After using the table of contents to get to a tool, you can bookmark
the tool itself, rather than bookmarking the whole page. (The same is
true for a protocol.)
Use the "more" command to look at a file page by page, spacebar to go to the
next page, and "q" to quit.