Quickbrowse: Go to a tool by selecting the abbreviated tool name from the menu.
Contents: Click a blue triangle to expand or collapse a list
Merging involves putting together the information in two (or sometimes more)
files into one output file. In some scripts, a line from
file1 and a line from file2 are joined together into just one line of output.
In others, you would get two lines in the output. See the documentation for
details.
To use a script, cut and paste the code from the light green or blue box into a
terminal window, change the bold, red text as needed, and hit Enter.
See More Information for notes on using these tools.
In this set of tools, lines are copied unchanged from the input files
into the output file. The scripts in this section differ based on
which lines from each file are used in the output file.
Use these tools when you want to merge the same kinds of data about
different things. For example, two files in the same format which contain
annotations about different sets of genes. The output will contain
annotations about both sets genes.
All lines appearing in either or both input files will be printed. The lines
will be printed in the order they appear in the first file, followed by the
order of lines in the second file.
Note: Even if a line appears more than once in a file, or appears in both
files, it will be printed only once. (Having the same first
column in tabular data is not the same as being a duplicate.)
Example: Given two gene lists, get all genes found in either list. Run the
above script on files file1.txt
and file2.txt
to get a file called union.txt
:
First file
(file1.txt ) |
ap23
ap23
CG2500
cxb7
|
Second file (file2.txt ) |
cxb7
CG12345
CG2500
|
Output file (union.txt ) |
ap23
CG2500
CG12345
cxb7
|
Screen Output |
Read 7 lines.
Took union and removed duplicates, yielding 4 lines.
|
Any line that appears in the first file and also appears in the second file
will be printed. The lines will be printed in the order they appear in the
first file.
Note: Even if a line appears more than once, it will be printed
only once. (Having the same first
column in tabular data is not the same as being a duplicate line.)
Example: Given two gene lists, get only genes that are found in both
lists. Run the above script on files file1.txt
and file2.txt
to get a
file called intersection.txt
:
First file
(file1.txt ) |
ap23
ap23
CG2500
cxb7
|
Second file (file2.txt ) |
cxb7
CG12345
CG2500
|
Output file (intersection.txt ) |
CG2500
cxb7
|
Screen Output |
Read 7 lines.
Took intersection and then removed duplicates, yielding 2 lines.
|
All lines that appear in first file or in the second file, but not in
both, will be printed. The lines from the first file will be printed (in the
order they appear) followed by the lines in the second file.
Note: this tool will use more memory and run more slowly if the second file
is very large.
Example: Given two gene lists, get genes that are NOT shared between the
lists. Run the above script on files xor1
and xor2
to get a file
called not_shared
:
First file (xor1 ) |
ap23
ap23
CG2500
CG2500
cxb7
|
Second file (xor2 ) |
cxb7
CG12345
CG12345
CG2500
|
Output file (not_shared ) |
ap23
ap23
CG12345
CG12345
|
Screen Output |
Read 9 lines.
Printed 4 lines found in xor1 or xor2, but not both
|
---|
All lines that appear in first file but not in the second file will be
printed. (Make sure to give the files in the correct order!)
The lines will be printed in the order they appear in the (first)
file.
Example: Given two gene lists, get genes that are only in the first list.
Run the above script on files yes_list
and no_list
to get a file
called only_yes
:
First file (yes_list ) |
ap23
ap23
CG2500
cxb7
|
Second file (no_list ) |
cxb7
CG12345
CG2500
|
Output file (only_yes ) |
ap23
ap23
|
Screen Output |
Read 4 lines from yes_list and 3 lines from no_list.
Printed 2 lines found in yes_list but not in no_list
|
---|
If we give no_list
as the first argument and yes_list
as the second, then
the result will contain (only) CG12345.
In this set of tools, a line from file1 and a line from file2 are merged
together to create just one line of output. The tools in this section differ
based on which lines are merged together in the output file.
Use these tools when you want to merge different kinds of data about the
same things. For example, one file contains disease associations about a
certain set of genes, the other contains mouse orthologs for the
same genes. The output file will contain disease associations and
orthologs for the genes.
Join tables in tab-separated files file1
and file2
. For all lines where
the mth column in file 1 equals the nth column in file2, print the line
from file1, a tab, and the line from file2. This operation is similar to a SQL
join.
If a value appears more than once in the file, then the line containing it
will appear that many times in the output file. A value appearing three
times merging with a value appearing twice will yield six output lines.
The lines will be in the order that they appear in the first file.
Example: Given a table of potential fly-human orthologs ortho.tab
and
a table of human genes' functions human_func.tab
, join the tables to
create a file fly_func.tab
with potential functions of fly genes.
First file (ortho.tab ) |
Second file (human_func.tab ) |
Fly1 Hum11
Fly3 Hum7
Fly7 Hum36
|
Hum7 light-sensing 22
Hum11 overeating 4
Hum11 oversleeping 32
Hum17 blue eyes X
|
Output file (fly_func.tab ) |
Screen Output |
Fly1 Hum11 Hum11 overeating 4
Fly1 Hum11 Hum11 oversleeping 32
Fly3 Hum7 Hum7 light-sensing 22
|
Joining ortho.tab column 1 with human_func.tab column 0
ortho.tab: 3 lines
human_func.tab: 4 lines
Merged file: 3 lines
|
Example: Given a list of gene names and a tab-separated table of
annotations, take only the lines where the fourth column has names from the
list. Simply treat the list as a table with only one column. That is:
perl -e '$col1=0; $col2=3; ...' gene_names.list all_annot.tab > some_annot.tab
Join a line from file1 and a line from file2 into a single line in the
output file.
Print a line from one file, a separator, and a line from the next file. (The
default separator is a tab, \t.) Given two tables, this will print corresponding
lines from each table next to each other, effectively joining the tables side
by side. The tool will print a warning if the files have a different number of
lines.
This tool can be useful if you remove a couple columns from a file,
manipulate those columns with other Scriptome tools, and then want to
put the changed columns back into the original file.
Example: Your original file (abbreviated BLAST results) has long identifiers
like "gi|33504569|ref|NP_878312.1|" in the second column. You want just the
gi number (e.g., 33504569) and the Refseq identifier (NP_878312.1) to be
in separate columns. You used other Scriptome tools to pull out
the identifiers, but now they're in a separate file. Running the above script
will take the original file, annotations.tab
. To each line, it will
add a tab (\t) and then the two columns from the identifier file, ids.tab
,
to yield the combined table, annotation_and_ids.tab
.
First file (annotations.tab ) |
gi|33504569|ref|NP_878312.1| Agene 6 123456
gi|33504561|ref|NP_878311.1| Bgene 1000 X 234567
gi|42476237|ref|NP_571328.2| Cgene 2500 Y 987654
|
Second file (ids.tab ) |
33504569 NP_878312.1
33504561 NP_878311.1
42476237 NP_571328.2
|
Output file (annotations_and_ids.tab ) |
gi|33504569|ref|NP_878312.1| Agene 6 123456 33504569 NP_878312.1
gi|33504561|ref|NP_878311.1| Bgene 1000 X 234567 33504561 NP_878311.1
gi|42476237|ref|NP_571328.2| Cgene 2500 Y 987654 42476237 NP_571328.2
|
Screen Output) |
Merged 3 lines side by side with separator $separator
Merged files annotations.tab and ids.tab side by side
|
---|
Note that the lines or columns that are being merged must be identical.
0 and 0.0 are not considered equivalent values when joining.
Extra spaces or different case will also break the merge. Use a tool to delete
extra spaces and change everything to the same case if you suspect this
might cause problems. Also (until things are fixed), if you try and
merge a UNIX file with a file copied from DOS, lines that SEEM the
same might not be. Use the dos2unix tool to conver the DOS file if you suspect
this is happening.
Scriptome tools are in blue or green boxes. Cut and paste the text of the
tool into a terminal window. Then edit the line as needed.
Things that will often need to be edited are highlighted in
red. Input and output filenames will almost always need to be changed.
All scripts that work on tabular data assume the data is tab-separated.
Use a Change script to change, e.g., comma-separated data
to tab-separated before using these scripts.
- Choose
-
Merging is a kind of choosing, so there is some overlap between
the two toolsets.