About The Scriptome

Project overview

Problem: Many experimental biologists must edit large files by hand, or ask programmers for help, in order to explore or manipulate their data. Some may even give up because they don't have the tools to perform operations that are very simple to describe.

Vision: The Scriptome will empower experimental biologists to manipulate and explore their data, on their own, with minimal training.

Users: The main niche of targeted users will be biologists with little or no programming experience, who may be only occasional users. Another group that may find The Scriptome useful is biologists currently in the process of learning programming. Finally, experienced programmers can use it as a "cookbook" rather than writing their own scripts.

Scope: The Scriptome will provide tools to filter or format existing data, rather than performing complex data analysis (like Spotfire or Rosetta Resolver) or generating new data (like BLAST). Biologists' questions are expected to be diverse and changing over time.

(The following is from the abstract for a poster accepted for presentation at ISMB 2005, http://iscb.org/ismb2005, which expresses the above ideas in greater detail.)

The Scriptome: A minimal-learning toolbox for manipulating biological data

Motivation: Computational biology tools generate large files in a wide array of formats. Although analyzing the data in these files or reformatting them for other tools may be trivial for programmers, it often presents a real barrier for the majority of biologists who are not programmers. Existing approaches are demanding: building universal graphical user interfaces requires a large investment of programming time; teaching biologists to program requires substantial training. Furthermore, given the rate at which the underlying data and data structure change, such efforts are not likely to abolish the need for text manipulation.

We present here the Scriptome project, designed to allow non-programmers to perform simple computational tasks such as formatting, filtering and low-level analysis of biological data. It combines the bioinformaticists' experience and skills with the biologists' way of thinking. Bioinformaticists create tiny, "atomic" tools, each performing simple operations on commonly used biological data structures. Biologists assemble "protocols" that use these tools, in an approach similar to constructing experimental protocols in the lab.

Results: We have developed a prototype of the Scriptome, released as a searchable web site containing single-line Perl scripts that perform common tasks. To use the Scriptome, a biologist cuts and pastes these scripts onto the UNIX command line. We chose to use the command line rather than developing a dedicated interface, because it allows new tools to be built faster, obviates complicated installation and dependencies, and lowers the biologists' learning barrier. A significant component of the web site is documentation (including usage notes, sample protocols, and links to related tools), which helps biologists choose and use the tools effectively. In addition to the documentation, we offer a training session, which is organized around solving representative biological problems by building protocols.

A typical example involves searching two BLAST output files, each with hundreds of hits, creating a non- redundant list of genes for further study. The problem is broken down into a protocol with these steps:

Select top N BLAST hits from a BLAST report. (Do once for each of the two BLAST output files.)
Combine the two gene lists, keeping only one copy of duplicated genes
Get the descriptions of the genes from the FASTA formatted sequence database.

The biologist now selects a Scriptome tool for each step, pasting the Perl code onto a UNIX command line [and editing it].

Conclusions: The Scriptome offers an innovative approach to bridging the way bioinformaticists and biologists manipulate data. Relying on their existing skills, and with minimal training, it allows biologists to solve, on their own, data formatting and analysis problems that might otherwise be prohibitively difficult.

 Amir Karger, Christopher Botka, and Eitan Rubin
 Bauer Center for Genomics Research