Scriptome Home
UNIX/Mac Home
Windows Home
Information
FAQ
Help
Overview
Principles
Resources
Tips
Tools
Calc
Change
Choose
Fetch
Merge
Sort
Protocols
Sequences
Microarray

Quickbrowse: Go to a tool by selecting the abbreviated tool name from the menu.

merge_

Contents: Click a blue triangle to expand or collapse a list


Merge files together

Merging involves putting together the information in two (or sometimes more) files into one output file. In some scripts, a line from file1 and a line from file2 are joined together into just one line of output. In others, you would get two lines in the output. See the documentation for details.

To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.

See More Information for notes on using these tools.

Merge files without changing individual lines

In this set of tools, lines are copied unchanged from the input files into the output file. The scripts in this section differ based on which lines from each file are used in the output file.

Use these tools when you want to merge the same kinds of data about different things. For example, two files in the same format which contain annotations about different sets of genes. The output will contain annotations about both sets genes.

Take lines in file 1 plus lines in file 2, removing duplicates (merge_files_union)

All lines appearing in either or both input files will be printed. The lines will be printed in the order they appear in the first file, followed by the order of lines in the second file.

Note: Even if a line appears more than once in a file, or appears in both files, it will be printed only once. (Having the same first column in tabular data is not the same as being a duplicate.)

Input file(s) First file
Input file(s) Second file
Output file

perl -e ' $count=0; while (<>) { if (! ($save{$_}++)) { print $_; $count++; } } warn "\n\nRead $. lines.\nTook union and removed duplicates, yielding $count lines.\n" ' file1.txt file2.txt > union.txt

Example: Given two gene lists, get all genes found in either list. Run the above script on files file1.txt and file2.txt to get a file called union.txt:

First file
(file1.txt)
 ap23
 ap23
 CG2500
 cxb7
Second file
(file2.txt)
 cxb7
 CG12345
 CG2500
Output file
(union.txt)
 ap23
 CG2500
 CG12345
 cxb7
Screen Output
 Read 7 lines.
 Took union and removed duplicates, yielding 4 lines.

Take any line that appears in file 1 and also appears in file 2 (merge_files_intersection)

Any line that appears in the first file and also appears in the second file will be printed. The lines will be printed in the order they appear in the first file.

Note: Even if a line appears more than once, it will be printed only once. (Having the same first column in tabular data is not the same as being a duplicate line.)

Input file(s) First file
Input file(s) Second file
Output file

perl -e ' ($file1, $file2) = @ARGV; open F2, $file2 or die $!; while (<F2>) { $h2{$_}++ }; open F1, $file1 or die; $total=$.; $printed=0; while (<F1>) { $total++; if ($h2{$_}) { print $_; $h2{$_} = ""; $printed++; } } warn "\n\nRead $total lines.\nTook intersection and then removed duplicates, yielding $printed lines.\n" ' file1.txt file2.txt > intersection.txt

Example: Given two gene lists, get only genes that are found in both lists. Run the above script on files file1.txt and file2.txt to get a file called intersection.txt:

First file
(file1.txt)
 ap23
 ap23
 CG2500
 cxb7
Second file
(file2.txt)
 cxb7
 CG12345
 CG2500
Output file
(intersection.txt)
 CG2500
 cxb7
Screen Output
 Read 7 lines.
 Took intersection and then removed duplicates, yielding 2 lines.

Take lines in file 1 or file 2, but not both (merge_files_lines_not_in_both)

All lines that appear in first file or in the second file, but not in both, will be printed. The lines from the first file will be printed (in the order they appear) followed by the lines in the second file.

Note: this tool will use more memory and run more slowly if the second file is very large.

Input file(s) First file
Input file(s) Second file
Output file

perl -e ' ($file1, $file2) = @ARGV; $printed = 0; $total = 0; open F2, $file2 or die "$file2: $!\n"; @lines2 = <F2>; $total += $.; foreach (@lines2) { $h2{$_}++; } open F1, $file1 or die "$file1: $!\n"; while (<F1>) { if (exists $h2{$_}) { $h2{$_} = 0; } else { print $_; $printed++; } } $total += $.; foreach (@lines2) { if ($h2{$_}) { print $_; $printed ++; } } warn "\nRead $total lines.\nPrinted $printed lines found in $file1 or $file2, but not both\n\n"; ' xor1 xor2 > not_shared

Example: Given two gene lists, get genes that are NOT shared between the lists. Run the above script on files xor1 and xor2 to get a file called not_shared:

First file
(xor1)
 ap23
 ap23
 CG2500
 CG2500
 cxb7
Second file
(xor2)
 cxb7
 CG12345
 CG12345
 CG2500
Output file
(not_shared)
 ap23
 ap23
 CG12345
 CG12345
Screen Output
 Read 9 lines.
 Printed 4 lines found in xor1 or xor2, but not both

Take lines in file 1 but NOT in file 2 (merge_files_lines_only_in_first)

All lines that appear in first file but not in the second file will be printed. (Make sure to give the files in the correct order!) The lines will be printed in the order they appear in the (first) file.

Input file(s) Take lines in this file
Input file(s) Exclude lines in this file
Output file

perl -e ' ($file1, $file2) = @ARGV; $printed = 0; open F2, $file2; while (<F2>) { $h2{$_}++ }; $count2 = $.; open F1, $file1; while (<F1>) { if (! $h2{$_}) { print $_; $printed++; } } $count1 = $.; warn "\nRead $count1 lines from $file1 and $count2 lines from $file2.\nPrinted $printed lines found in $file1 but not in $file2\n\n" ' yes_list no_list > only_yes

Example: Given two gene lists, get genes that are only in the first list. Run the above script on files yes_list and no_list to get a file called only_yes:

First file
(yes_list)
 ap23
 ap23
 CG2500
 cxb7
Second file
(no_list)
 cxb7
 CG12345
 CG2500
Output file
(only_yes)
 ap23
 ap23
Screen Output
 Read 4 lines from yes_list and 3 lines from no_list.
 Printed 2 lines found in yes_list but not in no_list

If we give no_list as the first argument and yes_list as the second, then the result will contain (only) CG12345.

Merge lines from separate files to create single lines

In this set of tools, a line from file1 and a line from file2 are merged together to create just one line of output. The tools in this section differ based on which lines are merged together in the output file.

Use these tools when you want to merge different kinds of data about the same things. For example, one file contains disease associations about a certain set of genes, the other contains mouse orthologs for the same genes. The output file will contain disease associations and orthologs for the genes.

Join two tables based on columns sharing a value (merge_lines_based_on_shared_column)

Join tables in tab-separated files file1 and file2. For all lines where the mth column in file 1 equals the nth column in file2, print the line from file1, a tab, and the line from file2. This operation is similar to a SQL join.

If a value appears more than once in the file, then the line containing it will appear that many times in the output file. A value appearing three times merging with a value appearing twice will yield six output lines. The lines will be in the order that they appear in the first file.

$col1 Column in first file
$col2 Column in second file
Input file(s) First file
Input file(s) Second file
Output file

perl -e ' $col1=1; $col2=0; ($f1,$f2)=@ARGV; open(F2,$f2); while (<F2>) { s/\r?\n//; @F=split /\t/, $_; $line2{$F[$col2]} .= "$_\n" }; $count2 = $.; open(F1,$f1); while (<F1>) { s/\r?\n//; @F=split /\t/, $_; $x = $line2{$F[$col1]}; if ($x) { $num_changes = ($x =~ s/^/$_\t/gm); print $x; $merged += $num_changes } } warn "\nJoining $f1 column $col1 with $f2 column $col2\n$f1: $. lines\n$f2: $count2 lines\nMerged file: $merged lines\n"; ' ortho.tab human_func.tab > fly_func.tab

Example: Given a table of potential fly-human orthologs ortho.tab and a table of human genes' functions human_func.tab, join the tables to create a file fly_func.tab with potential functions of fly genes.

First file (ortho.tab) Second file (human_func.tab)
 Fly1   Hum11
 Fly3   Hum7
 Fly7   Hum36
 Hum7   light-sensing   22
 Hum11  overeating      4
 Hum11  oversleeping    32
 Hum17  blue eyes       X
Output file (fly_func.tab) Screen Output
 Fly1   Hum11   Hum11   overeating      4
 Fly1   Hum11   Hum11   oversleeping    32
 Fly3   Hum7    Hum7    light-sensing   22
 Joining ortho.tab column 1 with human_func.tab column 0
 ortho.tab: 3 lines
 human_func.tab: 4 lines
 Merged file: 3 lines

Example: Given a list of gene names and a tab-separated table of annotations, take only the lines where the fourth column has names from the list. Simply treat the list as a table with only one column. That is:

 perl -e '$col1=0; $col2=3; ...' gene_names.list all_annot.tab > some_annot.tab

Join two tables side by side (merge_lines_side_by_side)

Join a line from file1 and a line from file2 into a single line in the output file. Print a line from one file, a separator, and a line from the next file. (The default separator is a tab, \t.) Given two tables, this will print corresponding lines from each table next to each other, effectively joining the tables side by side. The tool will print a warning if the files have a different number of lines.

This tool can be useful if you remove a couple columns from a file, manipulate those columns with other Scriptome tools, and then want to put the changed columns back into the original file.

$separator Separator between lines from file1 and file2
Input file(s) First file
Input file(s) Second file
Output file

perl -e ' $separator="\t"; ($file1, $file2) = @ARGV; open (F1, $file1) or die; open (F2, $file2) or die; while (<F1>) { if (eof(F2)) { warn "WARNING: File $file2 ended early\n"; last } $line2 = <F2>; s/\r?\n//; print "$_$separator$line2" } if (! eof(F2)) { warn "WARNING: File $file1 ended early\n"; } warn "\nMerged $. lines side by side with separator $separator\nMerged files $file1 and $file2 side by side\n\n" ' annotations.tab ids.tab > annotations_and_ids.tab

Example: Your original file (abbreviated BLAST results) has long identifiers like "gi|33504569|ref|NP_878312.1|" in the second column. You want just the gi number (e.g., 33504569) and the Refseq identifier (NP_878312.1) to be in separate columns. You used other Scriptome tools to pull out the identifiers, but now they're in a separate file. Running the above script will take the original file, annotations.tab. To each line, it will add a tab (\t) and then the two columns from the identifier file, ids.tab, to yield the combined table, annotation_and_ids.tab.

First file
(annotations.tab)
 gi|33504569|ref|NP_878312.1|   Agene 6 123456
 gi|33504561|ref|NP_878311.1|   Bgene   1000    X       234567
 gi|42476237|ref|NP_571328.2|   Cgene   2500    Y       987654
Second file
(ids.tab)
 33504569       NP_878312.1
 33504561       NP_878311.1
 42476237       NP_571328.2
Output file
(annotations_and_ids.tab)
 gi|33504569|ref|NP_878312.1|   Agene 6 123456  33504569        NP_878312.1
 gi|33504561|ref|NP_878311.1|   Bgene   1000    X       234567  33504561        NP_878311.1
 gi|42476237|ref|NP_571328.2|   Cgene   2500    Y       987654  42476237        NP_571328.2
Screen Output)
 Merged 3 lines side by side with separator $separator
 Merged files annotations.tab and ids.tab side by side


More Information

General Merging Notes

Note that the lines or columns that are being merged must be identical. 0 and 0.0 are not considered equivalent values when joining. Extra spaces or different case will also break the merge. Use a tool to delete extra spaces and change everything to the same case if you suspect this might cause problems. Also (until things are fixed), if you try and merge a UNIX file with a file copied from DOS, lines that SEEM the same might not be. Use the dos2unix tool to conver the DOS file if you suspect this is happening.

General Scriptome Notes

Scriptome tools are in blue or green boxes. Cut and paste the text of the tool into a terminal window. Then edit the line as needed. Things that will often need to be edited are highlighted in red. Input and output filenames will almost always need to be changed.

All scripts that work on tabular data assume the data is tab-separated. Use a Change script to change, e.g., comma-separated data to tab-separated before using these scripts.

See Also

Choose

Merging is a kind of choosing, so there is some overlap between the two toolsets.

 

HomeContact UsDirectoriesSearch