|
|
tc314@xxxxxxxxxxx wrote:
I've got two similar large files with one word per line and they're
sorted.
Each file has a few words not in the other.
I typically identify the unique words in the file using diff,grep,cut.
When the files are too big (2Gig) diff dies with "memory exhausted".
I want to search for the unique words in file1 but I might need to
ping-pong since neither file is a superset of the other.
I don't want to be limited by physical RAM as the file sizes exceed
RAM.
I assume I'm not the first to have this problem.
Can someone point me to perl code?
This appears to do what you require:
#!/usr/bin/perl
use warnings;
use strict;
my ( $file1, $file2 ) = ( 'file1', 'file2' );
open my $F1, '<', $file1 or die "Cannot open '$file1' $!";
open my $F2, '<', $file2 or die "Cannot open '$file2' $!";
my ( $first, $second ) = ( '', '' );
do {
if ( $first eq $second ) {
$first = <$F1> || '~'; # because ~ is the last ASCII character
$second = <$F2> || '~';
}
elsif ( $first lt $second ) {
print "$file1: $first";
$first = <$F1> || '~';
}
elsif ( $first gt $second ) {
print "$file2: $second";
$second = <$F2> || '~';
}
} until eof $F1 and eof $F2;
__END__
John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall
|
|