Quickbrowse: Go to a tool by selecting the abbreviated tool name from the menu.

Contents: Click a blue triangle to expand or collapse a list

Merge files together

Merge files without changing individual lines

Merge lines from separate files to create single lines

More Information

General Merging Notes
General Scriptome Notes
See Also

Merge files together

Merging involves putting together the information in two (or sometimes more) files into one output file. In some scripts, a line from file1 and a line from file2 are joined together into just one line of output. In others, you would get two lines in the output. See the documentation for details.

To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.

See More Information for notes on using these tools.

Merge files without changing individual lines

In this set of tools, lines are copied unchanged from the input files into the output file. The scripts in this section differ based on which lines from each file are used in the output file.

Use these tools when you want to merge the same kinds of data about different things. For example, two files in the same format which contain annotations about different sets of genes. The output will contain annotations about both sets genes.

Take lines in file 1 plus lines in file 2, removing duplicates (merge_files_union)

All lines appearing in either or both input files will be printed. The lines will be printed in the order they appear in the first file, followed by the order of lines in the second file.

Note: Even if a line appears more than once in a file, or appears in both files, it will be printed only once. (Having the same first column in tabular data is not the same as being a duplicate.)

Example: Given two gene lists, get all genes found in either list. Run the above script on files file1.txt and file2.txt to get a file called union.txt:

First file (`file1.txt`)	ap23 ap23 CG2500 cxb7
Second file (`file2.txt`)	cxb7 CG12345 CG2500
Output file (`union.txt`)	ap23 CG2500 CG12345 cxb7
Screen Output	Read 7 lines. Took union and removed duplicates, yielding 4 lines.

Take any line that appears in file 1 and also appears in file 2 (merge_files_intersection)

Any line that appears in the first file and also appears in the second file will be printed. The lines will be printed in the order they appear in the first file.

Note: Even if a line appears more than once, it will be printed only once. (Having the same first column in tabular data is not the same as being a duplicate line.)

Example: Given two gene lists, get only genes that are found in both lists. Run the above script on files file1.txt and file2.txt to get a file called intersection.txt:

First file (`file1.txt`)	ap23 ap23 CG2500 cxb7
Second file (`file2.txt`)	cxb7 CG12345 CG2500
Output file (`intersection.txt`)	CG2500 cxb7
Screen Output	Read 7 lines. Took intersection and then removed duplicates, yielding 2 lines.

Take lines in file 1 or file 2, but not both (merge_files_lines_not_in_both)

All lines that appear in first file or in the second file, but not in both, will be printed. The lines from the first file will be printed (in the order they appear) followed by the lines in the second file.

Note: this tool will use more memory and run more slowly if the second file is very large.

Example: Given two gene lists, get genes that are NOT shared between the lists. Run the above script on files xor1 and xor2 to get a file called not_shared:

First file (`xor1`)	ap23 ap23 CG2500 CG2500 cxb7
Second file (`xor2`)	cxb7 CG12345 CG12345 CG2500
Output file (`not_shared`)	ap23 ap23 CG12345 CG12345
Screen Output	Read 9 lines. Printed 4 lines found in xor1 or xor2, but not both

Take lines in file 1 but NOT in file 2 (merge_files_lines_only_in_first)

All lines that appear in first file but not in the second file will be printed. (Make sure to give the files in the correct order!) The lines will be printed in the order they appear in the (first) file.

Example: Given two gene lists, get genes that are only in the first list. Run the above script on files yes_list and no_list to get a file called only_yes:

First file (`yes_list`)	ap23 ap23 CG2500 cxb7
Second file (`no_list`)	cxb7 CG12345 CG2500
Output file (`only_yes`)	ap23 ap23
Screen Output	Read 4 lines from yes_list and 3 lines from no_list. Printed 2 lines found in yes_list but not in no_list

If we give no_list as the first argument and yes_list as the second, then the result will contain (only) CG12345.

Merge lines from separate files to create single lines

In this set of tools, a line from file1 and a line from file2 are merged together to create just one line of output. The tools in this section differ based on which lines are merged together in the output file.

Use these tools when you want to merge different kinds of data about the same things. For example, one file contains disease associations about a certain set of genes, the other contains mouse orthologs for the same genes. The output file will contain disease associations and orthologs for the genes.

Join two tables based on columns sharing a value (merge_lines_based_on_shared_column)

Join tables in tab-separated files file1 and file2. For all lines where the mth column in file 1 equals the nth column in file2, print the line from file1, a tab, and the line from file2. This operation is similar to a SQL join.

If a value appears more than once in the file, then the line containing it will appear that many times in the output file. A value appearing three times merging with a value appearing twice will yield six output lines. The lines will be in the order that they appear in the first file.

Example: Given a table of potential fly-human orthologs ortho.tab and a table of human genes' functions human_func.tab, join the tables to create a file fly_func.tab with potential functions of fly genes.

First file (`ortho.tab`)	Second file (`human_func.tab`)
Fly1 Hum11 Fly3 Hum7 Fly7 Hum36	Hum7 light-sensing 22 Hum11 overeating 4 Hum11 oversleeping 32 Hum17 blue eyes X

Output file (fly_func.tab) Screen Output

 Fly1   Hum11   Hum11   overeating      4
 Fly1   Hum11   Hum11   oversleeping    32
 Fly3   Hum7    Hum7    light-sensing   22

 Joining ortho.tab column 1 with human_func.tab column 0
 ortho.tab: 3 lines
 human_func.tab: 4 lines
 Merged file: 3 lines

Example: Given a list of gene names and a tab-separated table of annotations, take only the lines where the fourth column has names from the list. Simply treat the list as a table with only one column. That is:

 perl -e '$col1=0; $col2=3; ...' gene_names.list all_annot.tab > some_annot.tab

Join two tables side by side (merge_lines_side_by_side)

Join a line from file1 and a line from file2 into a single line in the output file. Print a line from one file, a separator, and a line from the next file. (The default separator is a tab, \t.) Given two tables, this will print corresponding lines from each table next to each other, effectively joining the tables side by side. The tool will print a warning if the files have a different number of lines.

This tool can be useful if you remove a couple columns from a file, manipulate those columns with other Scriptome tools, and then want to put the changed columns back into the original file.

Example: Your original file (abbreviated BLAST results) has long identifiers like "gi|33504569|ref|NP_878312.1|" in the second column. You want just the gi number (e.g., 33504569) and the Refseq identifier (NP_878312.1) to be in separate columns. You used other Scriptome tools to pull out the identifiers, but now they're in a separate file. Running the above script will take the original file, annotations.tab. To each line, it will add a tab (\t) and then the two columns from the identifier file, ids.tab, to yield the combined table, annotation_and_ids.tab.

First file (`annotations.tab`)	gi\|33504569\|ref\|NP_878312.1\| Agene 6 123456 gi\|33504561\|ref\|NP_878311.1\| Bgene 1000 X 234567 gi\|42476237\|ref\|NP_571328.2\| Cgene 2500 Y 987654
Second file (`ids.tab`)	33504569 NP_878312.1 33504561 NP_878311.1 42476237 NP_571328.2
Output file (`annotations_and_ids.tab`)	gi\|33504569\|ref\|NP_878312.1\| Agene 6 123456 33504569 NP_878312.1 gi\|33504561\|ref\|NP_878311.1\| Bgene 1000 X 234567 33504561 NP_878311.1 gi\|42476237\|ref\|NP_571328.2\| Cgene 2500 Y 987654 42476237 NP_571328.2
Screen Output)	Merged 3 lines side by side with separator $separator Merged files annotations.tab and ids.tab side by side

More Information

General Merging Notes

Note that the lines or columns that are being merged must be identical. 0 and 0.0 are not considered equivalent values when joining. Extra spaces or different case will also break the merge. Use a tool to delete extra spaces and change everything to the same case if you suspect this might cause problems. Also (until things are fixed), if you try and merge a UNIX file with a file copied from DOS, lines that SEEM the same might not be. Use the dos2unix tool to conver the DOS file if you suspect this is happening.

General Scriptome Notes

Scriptome tools are in blue or green boxes. Cut and paste the text of the tool into a terminal window. Then edit the line as needed. Things that will often need to be edited are highlighted in red. Input and output filenames will almost always need to be changed.

All scripts that work on tabular data assume the data is tab-separated. Use a Change script to change, e.g., comma-separated data to tab-separated before using these scripts.