Tips on Using the Scriptome

This page features all sorts of tips on using specific tools, writing protocols, and general data manipulation.

The most general tip is to use trial and error. If at all possible, take small subsets of your real data (using tools like choose_cols and choose_first_n_lines) and try using tools or sets of tools on the data.

Tips on Using the Tools

There are a few idiosyncracies it's good to know about when choosing parameters for tools. (Most of these have to do with the way the Scriptome tools use Perl.) They are discussed below, along with some shortcuts.

Columns start counting at zero, rows/lines start counting at 1

When using tabular data, column zero is the first column, column 1 is the second column, etc. However, the first line or row of a file is one. Yes, it's confusing. Sorry. Read the examples to see what to expect, and always check you results.

Last column is -1

When choosing columns in tabular data, the last column will be -1, the second to last is -2, etc.

Give ranges of numbers like 7..10

7..10 is the same as 7, 8, 9, 10. Use this shortcut for choose_cols, for example. You can mix and match ranges, like 2, 3, 7..10, 14, 15..19 also. Note that backwards ranges like 4..1 don't work. Also, you can't give a range like 3..-1 to go from the third column through the last column.

Use tools for tabular data even with non-tabular data

Many Scriptome tools assume data is tab-separated. (See Make Your Data Tabular.) But if you just have a list of genes, values, whatever, you can still use some of the tools. Just set the column number to 0, the first column.

General Tips for Writing Protocols

Do a little at a time

Each Scriptome tool does a simple task. The idea is to put them together to do your more complicated processing

Build with trial and error

You don't necessarily need to plan every step of the processing from the beginning. Since most Scriptome tools take only a few seconds to run, even on large files, it's OK to just run a couple and see whether the output looks more like the results you want.

Since each step in the process yields an intermediate output file, if one step doesn't work, just back up a step and try to continue with a different method.

Make Your data tabular

If you make your data tabular, it will be easier to read and process. Many Scriptome tools work with tabular data, and of course you can also throw tab-separated lines into Excel or other programs.

Change your FASTA files to tables with change_fasta_to_tab. Now you can sort, merge, and filter to your heart's content. When you're done, leave the data in tables for analysis (What's the average sequence length? How many sequences are from chimp?) or change it back into FASTA to run sequence analysis tools on it.

Change lots of other sequence formats into tabular format using change_bio_format_to_bio_format with "tab" output format. (Note that this tab format will have only two columns, not three.)

Change comma-separated, space-separated, or other kinds of tabular data to tab-separated using change_any_separator_to_tab.

Split single columns

Very often the information in a single column of tabular data still has more than one piece of information in it. Separating a single column into multiple columns with change_any_separator_to_tab lets you get at those smaller pieces.

The ID in a GenBank FASTA file actually contains several pieces of information, separated by '|' characters. After changing the file to tabular format, you can split the ID into its constituent parts Now you can make a new FASTA file with gi IDs. Or merge the table with another table containing gi IDs.

As another example, split a sequence "start-end" column on '-' to put start and end in separate columns. Now you can choose lines based on sequence start positions.

Specific Tasks

Translate accession numbers

If you have a file that has data with the wrong accession numbers, and another file that translates between accession numbers, use merge_lines_based_on_shared_col to get a new file with the data and the correct accession numbers. (But be careful of getting multiple lines with similar data if the mapping isn't one-to-one.)

Miscellaneous Tips

After using the table of contents to get to a tool, you can bookmark the tool itself, rather than bookmarking the whole page. (The same is true for a protocol.)

Use the "more" command to look at a file page by page, spacebar to go to the next page, and "q" to quit.