How does one implement the Huffman tree algorithm in c plus plus?
As a student in 1951, David A. Huffman (1925-1999) was given the
option of sitting a final exam or submitting a term paper. He chose
the latter. The term paper outlined the problem: find the most
efficient binary code. In other words, given any amount of data,
find the most efficient method of encoding that data in the least
number of bits. When we encode data using standard 8-bit symbols,
every symbol is obviously the same length (8 bits). If we alter the
encoding so that we use shorter codes for symbols that appear most
often, and longer codes for those that appear less often, we can
encode the data more efficiently. For instance, if the most
frequent symbol could be encoded with just 1 bit (say, 0), then we
save 7 bits for every occurrence of that symbol. However, no other
symbol can use an encoding that begins with 0, they must all be
longer than 1 bit and must begin with a 1. Moreover, no symbol can
start with the same sequence of bits assigned to any other
symbol.
Huffman quickly ascertained that he was looking for a prefix
coding method. While efficient encoding methods certainly existed
at the time (Shannon-Fano being one), none could be guaranteed to
provide the most efficient coding. Indeed, Huffman had all but
given up and resigned himself to the final exam when the answer
finally dawned on him. If he used a frequency-sorted binary tree,
he could encode any symbol in the tree by tracing the path back to
the root. The symbols formed the leaves of the tree and if he
followed the left path to the leaf's parent, he prefixed a 0 bit,
otherwise he prefixed a 1 bit. Repeating this process all the way
to the root, every symbol could be assigned a unique prefix code.
Moreover, when reading the encoded data back, he simply started at
the root and followed the left or right paths according to the bits
in the prefix. When he reached a leaf node, he'd found his symbol,
and started at the root to find the next.
Huffman wasn't the first to discover this of course, but what
had eluded all the mathematicians and programmers of the day was
simply how to go about building the tree in the first place.
Normally we build trees from the top down, from the root. Huffman's
revelation was in building the tree from the bottom up, starting
with the leaves. With this simple change, not only did he complete
his term paper with honours, he exceeded even his professor's
expectations. His method proved to be the most efficient encoding
and Huffman's name was entered into the computing history
books.
Huffman started by creating a frequency table of symbols which
he then sorted in descending order of frequency. He then copied the
table and, while the copy contained more than one node, he
extracted (removed) the last two nodes and constructed a parent
node, with the last node attached to its left node, and its
predecessor to its right. The parent node's frequency was the sum
of its left and right node frequencies. The parent node was then
inserted back into the copy, maintaining the descending order of
frequency. When only one node remained in the copy, the tree was
complete. The one remaining node becomes the root of the Huffman
tree.
You then encode the tree. Initially, all nodes have no prefix
code but, starting from the root, you copy its prefix (an empty
string) to its left child and append a "0", and then recursively
encode the left child (thus its left child will be prefix "00"). Do
the same for the right child, but append a "1" instead. Thus the
child to the right of the root's left child will be prefix "01".
When you reach a leaf, its prefix code determines the precise route
you took from the root to the leaf (where "0" means go left, and
"1" means go right). Since the tree was a copy of the frequency
table, each leaf in the original frequency table will also be
encoded.
You are now ready to begin encoding the file. You begin by
outputing the original file size (so you know when to stop reading
when decoding the encoded file). You then output the binary
representation of the tree and its symbols. That is, starting from
the root node, output a 0 bit if it is a parent node, or a 1 bit
followed by the 8-bit unencoded symbol if it is a leaf. Repeat the
process for the node's left and right children if it has any (it
must have both or neither, it cannot have one but not the other).
Finally, read the original file one symbol at a time, locate it in
the frequency table, and output its prefix code. When writing
individual bits, you need to use a bitwriter that buffers up to 8
bits at a time and then write a complete byte. When you write the
final bit, pad any unused bits with zeroes to make a full byte.
To decode the encoded file, use a bitreader to read back 1 bit
at a time. Decoding is simply a matter of reading the file size,
then reconstructing the tree from its binary representation, and
then reading one bit a time to navigate the tree from the root.
when you reach a leaf, output the symbol, decrement the file size
and return to the root. Repeat until the file size is zero, at
which point the original file is restored.
The following implementation is not complete by any means. For
instance, if you were to encode a file containing the text "The
quick brown fox jumps over the lazy dog.", you will find that this
44 byte file compresses to a 66 byte file. This seems unimpressive,
but remember that the encoded file also contains the file size and
the Huffman tree as well as the encoded data. The encoded data is
actually smaller than the original file, but because the file is so
small to begin with, the overhead of the additional information
produces a larger file. Huffman encoding works best on larger files
(the bigger the better). An alternative approach is to use a static
table, thus eliminating the need to transmit the table along with
the encoded data. However this only works when the frequency table
is highly predictable -- it won't be suitable for all file types.
Another approach is to create several static tables, and use the
one that produces the smallest file. Again, this is less than
optimal, but if you have 255 different tables, then you only need
to transmit 8 additional bits to determine which table to use for
decoding (where table 0 implies no encoding, for those files that
cannot be reduced with any of the tables). But for larger files,
the overhead of the frequency table is more than outweighed by the
reduction in the encoded data. For instance, a file that contains
nothing but 0 bits can be reduced to a file a little over 1/8th the
size of the original file. Of course, such a file could easily be
compressed to a fraction of that simply by utilising run length
encoding, but you can also combine both methods, using RLE to
reduce the repeated bit sequences, and then Huffman encoding to
reduce the RLE file. These implementations (amongst many others,
many of which are patented) are left for the reader to
discover.
In the meantime, be aware that the following implementation
provides only the minimum functionality to both encode and decode
using Huffman's algorithm, and makes use of standard output for the
DESTINATION file (you may wish to make the destination file a
command-line option, and test the source and destination are
actually different files). Note also that besides some simple
error-checks (some of which are unnecessary with prudent assertions
and verifications), overall robustness has been sacrificed for the
sake of simplicity.
// Huffman.exe
// Copyright ©2013 PCForrest.
#include<iostream>
#include<list>
// The bitreader class facilitates the reading
// of binary files, 1 bit at a time.
class bitreader
{
public:
// The constructor accepts a file pointer to a file that is
// already opened for binary reading. The m_size and
m_buffer
// members are initially zero.
bitreader( FILE* file ): m_file( file ), m_buffer( 0 ), m_size(
0 ){}
~bitreader(){};
int readbit();
public:
FILE* m_file;
unsigned char m_buffer;
int m_size;
};
// Returns a single bit (0 or 1) from the file.
// Returns -1 if an error occurs.
int bitreader::readbit()
{
// Are there any bits in the buffer?
if( !m_size )
{
// Read 8 bits into the buffer.
size_t read = fread( &m_buffer, 1, 1, m_file );
if( read == 1 )
m_size = 8;
else
// Fatal error!
return( -1 );
}
// If we get this far, there is at least 1 bit in the
buffer.
// Extract the MSB from the buffer, shift the remaining bits
// left and decrement the size.
int bit = m_buffer&0x80?1:0;
m_buffer<<=1;
--m_size;
// Return the extracted bit.
return( bit );
}
// The bitwriter class facilitates the writing
// of binary files, 1 bit at a time.
class bitwriter
{
public:
// The constructor simply initialises the members.
bitwriter(): m_buffer( 0 ), m_unused_bits( 8 ){}
~bitwriter();
void writebit( unsigned char bit = 1 );
public:
char m_buffer; // The bit buffer.
int m_unused_bits; // Unused bits.
};
// The destructor pads any unwritten bits to make a full
byte.
bitwriter::~bitwriter()
{
while( m_unused_bits != 8 )
writebit( 0 );
}
// Writes a single bit to the buffer. When the buffer is
full,
// outputs the buffer to standard output and resets the
buffer.
void bitwriter::writebit( unsigned char bit /* = 1 */ )
{
// Bit must be 0 or 1.
if( bit > 1 )
return;
// Decrement the buffer length.
--m_unused_bits;
if( bit )
{
// Shift the bit left if necessary.
if( m_unused_bits )
bit <<= m_unused_bits;
// Combine bit with current buffer.
m_buffer |= bit;
}
// Do we have a complete byte?
if( !m_unused_bits )
{
// Write the byte and empty the buffer.
std::cout << m_buffer;
m_buffer = 0;
m_unused_bits = 8;
}
}
// A node object acts as both a leaf and a parent. Only the
// leafs store any symbols. Parents store both a left and
// right node. During encoding, leafs store the frequency
// of their symbols while parents store the sum of the
// frequencies of their two children. During decoding,
// the frequency is not used.
class node
{
// Used when sorting the Huffman table.
friend bool compare(const node* x, const node* y);
public:
node( unsigned char data );
node( node& left, node& right );
~node( void );
int encode( bitwriter& writer );
int decode( bitreader& reader );
char get_symbol( void ) const { return( m_symbol ); }
unsigned int get_freq( void ) const { return( m_freq ); }
const std::string& get_code( void ) const { return( m_code
); }
node* get_left( void ) const { return( m_left ); }
node* get_right( void ) const { return( m_right ); }
void inc_freq( void ){ ++m_freq; }
private:
unsigned char m_symbol;
unsigned int m_freq;
std::string m_code;
node* m_left;
node* m_right;
};
// Constructs a leaf.
node::node( unsigned char symbol )
: m_symbol( symbol )
, m_freq( 1 )
, m_code( "" )
, m_left( NULL )
, m_right( NULL ) {}
// Constructs a parent.
node::node( node& left, node& right )
: m_symbol( 0 )
, m_freq( left.m_freq + right.m_freq )
, m_code( "" )
, m_left( &left )
, m_right( &right ){}
// Recursive destructor.
node::~node( void )
{
delete( m_left );
delete( m_right );
}
// Recursively decodes (rebuilds) the tree from the given bit
reader.
int node::decode( bitreader& reader )
{
// Read the next bit.
int result = reader.readbit();
switch( result )
{
case( -1 ): // Fatal error!
return( result );
case( 0 ): // This node is a parent.
// Instantiate the left node.
if( m_left = new node(0) )
{
// Decode the left node (recursively).
result = m_left->decode( reader );
// Ensure success (zero).
if( !result )
// Instantiate the right node.
if( m_right = new node(0) )
// Decode the right node (recursively).
result = m_right->decode( reader );
}
// Return the result.
return( result );
case( 1 ): // This node is a leaf.
// Read the next 8 bits and store the resultant symbol.
for( int i=0; i<8; ++i )
{
m_symbol<<=1;
result = reader.readbit();
if( result == -1 )
return( result );
m_symbol |= result;
}
}
// Success!
return( 0 );
}
// Encodes the Huffman tree. If this node is a parent,
// write a 0 bit, otherwise write a 1 bit followed by
// the 8-bit unencoded symbol.
int node::encode(bitwriter& writer )
{
int error=-1;
// Is this node a leaf?
if( !m_left && !m_right )
{
// Write a 1 bit.
writer.writebit(1);
// Store the symbol locally.
unsigned char symbol=m_symbol;
// Write each bit from the MSB to the LSB.
for(int i=0; i<8; ++i, symbol<<=1)
writer.writebit(symbol&0x80?1:0);
// Success!
error=0;
}
// Or is this node a parent?
else if( m_left && m_right )
{
// Write a zero bit.
writer.writebit(0);
// Copy this node's code to the left node
// and append a 0.
m_left->m_code = m_code;
m_left->m_code += "0";
// Recurse for the left node.
error = m_left->encode( writer );
if( !error )
{
// Copy this node's code to the right node
// and append a 1.
m_right->m_code = m_code;
m_right->m_code += "1";
// Recurse for the right node.
error = m_right->encode( writer );
}
}
return( error );
}
// Compare method used to sort the Huffman table
// in descending order of frequency.
bool compare( const node* x, const node* y )
{
return( x->m_freq>y->m_freq );
}
// Decodes the given file name and redirects the
// decoded data to standard output.
int decode( const char* filename )
{
// Open the input file for binary reading.
FILE* input = fopen( filename,"rb" );
if( !input )
{
std::cerr<<"File access error:
"<<filename<<std::endl;
return( -1 );
}
// Instantiate a bit reader object.
bitreader reader( input );
int error = -1, bit=0;
// Determine the length of the decoded data.
unsigned int len=0;
for( int i=0; i<sizeof(unsigned int)*8; ++i )
{
len <<= 1;
bit = reader.readbit();
if( bit == -1 )
return( error );
len |= bit;
}
// Instantiate the Huffman tree root.
node root( 0 );
// Decode (rebuild) the tree.
root.decode( reader );
// Begin reading the encoded data until all symbols
// have been read.
while( len-- )
{
// Point to the root of the Huffman tree.
node* pnode = &root;
// Repeat while the current node is a parent node.
while( pnode->get_left() && pnode->get_right()
)
{
// Read the next bit.
bit = reader.readbit();
if( bit == error )
// Fatal error!
return( error );
// Navigate right if the bit is set, otherwise navigate
left.
pnode = bit?pnode->get_right():pnode->get_left();
}
// We've reached a leaf, output the data.
std::cout << pnode->get_symbol();
}
// Success!
fclose( input );
return(0);
}
// Encodes the given file name and redirects the encoded
// data to standard output.
int encode( const char* filename )
{
// Open the file for binary reading.
FILE* input=fopen( filename,"rb" );
if( !input )
{
std::cerr<<"File access error:
"<<filename<<std::endl;
return(-1);
}
// Instantiate the Huffman frequency table.
std::list<node*> table;
// Some variables.
unsigned int len=0;
char symbol=0;
// Read one symbol at a time.
while( fread( &symbol, 1, 1, input ) == 1 )
{
// Increment the file length.
++len;
// Locate the symbol in the Huffman frequency table.
std::list<node*>::iterator it=table.begin();
while( it!=table.end() )
{
// Reference the node.
node& rnode = *(*it);
// Check for equality.
if( rnode.get_symbol() == symbol )
{
// Symbol found, increment its frequency.
rnode.inc_freq();
break;
}
++it;
}
// Was the symbol found?
if( it==table.end() )
{
// No -- create a new node for the symbol
// and add it to the table.
node* pnode = new node( symbol );
table.push_back( pnode );
}
}
if( ferror( input ))
{
fclose( input );
std::cerr << "File read error: " << filename
<< std::endl;
return( -1 );
}
// Sort the Huffman frequency table in descending order of
frequency.
table.sort( compare );
// Instantiate the Huffman tree (copy the Huffman frequency
table).
std::list<node*> tree( table );
// Repeat while the Huffman tree has more than one node.
while( tree.size() > 1 )
{
// Extract the last two nodes.
std::list<node*>::reverse_iterator rit =
tree.rbegin();
node* left = *rit++;
node* right = *rit;
tree.pop_back();
tree.pop_back();
// Create a parent node for the two extracted nodes.
node* parent = new node( *left, *right );
// Add the parent node and resort the tree.
tree.push_back( parent );
tree.sort( compare );
}
// Instantiate a bit writer.
bitwriter writer;
// Write the file length (32 bits).
for(int i=0; i<sizeof( unsigned int )*8; ++i,
len<<=1)
writer.writebit( len&0x80000000?1:0 );
// Reference the root node of the Huffman tree.
node& root = *( *tree.begin() );
// Recursively encode the Huffman tree (also
// encodes the Huffman frequency table, since
// both contain pointers to the same nodes).
root.encode(writer);
// Read symbols from the start of the input file.
fseek( input, 0, SEEK_SET );
while( fread( &symbol, 1, 1, input ) == 1 )
{
// Locate the symbol in the Huffman frequency table.
for(std::list<node*>::iterator it=table.begin();
it!=table.end(); ++it )
{
// Reference the current node.
node& rnode = *(*it);
if( rnode.get_symbol() == symbol )
{
// Write the symbol's code.
for(unsigned int i=0; i<rnode.get_code().size(); ++i)
writer.writebit( rnode.get_code().at(i)=='1'?1:0 );
break;
}
}
}
if( ferror( input ))
{
fclose( input );
std::cerr << "File read error: " << filename
<< std::endl;
return( -1 );
}
// Finished with the input file.
fclose(input);
// Success!
return(0);
}
// The main function handles the command line.
int main(int argc, char* argv[])
{
if(argc==3)
{
std::string option( argv[2] );
if( !option.compare("-d") !option.compare("-D") )
if( decode(argv[1] ))
std::cerr<<"The input file cannot be
decoded.\n"<<std::endl;
else
return(0);
else if( !option.compare("-e") !option.compare("-E") )
if( encode(argv[1] ))
std::cerr<<"The input file could not be
encoded.\n"<<std::endl;
else
return(0);
}
std::cerr<<"Usage:\n\n"
<<argv[0]<<" SOURCE -switch > DEST\n"
<<"\nWhere:\n\n"
<<"\tSOURCE\t- source file name\n"
<<"\tDEST\t- destination file name\n\n"
<<"Switch:\n\n"
<<"\td\t- decode\n"
<<"\te\t- encode\n"<<std::endl;
return(-1);
}