Compressor

Text files waste space. A lot of it. Every character takes the same number of bits, even though some characters appear far more often than others. The letter ‘e’ and the letter ‘z’ both consume 8 bits, despite ‘e’ appearing 57 times more frequently in English text.

David Huffman figured this out in 1952. Give common characters short codes and rare characters long codes. The average code length drops, and the file shrinks.

I wanted to understand this at the bit level. Not through a library, but by writing every bit operation myself. Compressor was that exercise.

What It Does

Compressor is a Linux command-line tool written in C. It compresses and decompresses files using Huffman coding. The interface is straightforward:

./compressor zip input.txt      # Creates input.txt.rol
./compressor unzip input.txt.rol # Restores input.txt

The compressed files use a custom .rol extension. The tool reads any file type, but works best on text files where character frequency patterns create meaningful compression.

How Huffman Coding Works

The algorithm builds a binary tree where frequently used characters sit near the root and rare characters live in the leaves. Each character gets a unique binary code based on its path from root to leaf. Left branches add a 0. Right branches add a 1.

Consider a file containing only “AAABBC”. The frequency count shows: A appears 3 times, B appears 2 times, C appears 1 time. Without compression, this takes 48 bits (6 characters times 8 bits). With Huffman coding, A might get code 0 (1 bit), B gets 10 (2 bits), and C gets 11 (2 bits). The encoded data: 0 0 0 10 10 11 takes 9 bits.

The Compression Pipeline

Compressing a file happens in four stages:

Stage 1: Frequency Analysis

The program reads the entire file and counts how many times each byte value appears. A linked list stores each unique character and its count.

Stage 2: Tree Construction

The algorithm sorts characters by frequency. It then repeatedly takes the two nodes with the lowest frequencies, creates a parent node with their combined frequency, and reinserts that parent into the list. This continues until one node remains. That node is the root of the Huffman tree.

Stage 3: Code Generation

A recursive traversal of the tree assigns binary codes to each character. Going left appends a 0 bit. Going right appends a 1 bit. The codes get stored in a lookup table for fast encoding.

Stage 4: Encoding

The program reads the original file again. For each character, it looks up the corresponding bit code and writes those bits to the output file. The compressed file also stores the code table at the beginning so the decompressor can rebuild the tree.

Bit Manipulation in C

This is where the project got interesting. C does not have a “write 3 bits” function. You work with bytes. Writing arbitrary bit sequences requires accumulating bits in a buffer and flushing full bytes to the file.

The encoder maintains a 32-bit accumulator and a count of how many bits it holds. When encoding a character, it shifts the accumulator left by the code length and ORs in the new bits. When the accumulator holds 8 or more bits, it writes the top byte to the file.

Shift operations do the heavy lifting:

bits << n multiplies by 2^n, making room for new bits
bits | code combines the accumulated bits with new ones
bits >> n extracts the top bits for writing

The decompressor reverses this. It reads bytes into a buffer, then pulls individual bits to navigate the Huffman tree. Each bit tells it whether to go left or right. When it reaches a leaf, it outputs that character and starts over from the root.

File Format

The compressed file starts with metadata:

Number of unique characters (1 byte)
Total character count in original file (8 bytes)
For each unique character:
- The character (1 byte)
- Code length (1 byte)
- The binary code (8 bytes)

After the header comes the compressed data stream. The decoder reads the header first, rebuilds the Huffman tree from the codes, then decodes the data using that tree.

Endianness Handling

Multi-byte integers store differently on different machines. x86 processors use little-endian (least significant byte first). The code explicitly handles byte order when reading 32-bit values:

unsigned long maskA = bits & 0x000000FFUL;
unsigned long maskB = bits & 0x0000FF00UL;
unsigned long maskC = bits & 0x00FF0000UL;
unsigned long maskD = bits & 0xFF000000UL;
maskA <<= 24;
maskB <<= 8;
maskC >>= 8;
maskD >>= 24;
return maskA + maskB + maskC + maskD;

This reverses the byte order so the decoder interprets bits correctly regardless of the machine architecture.

What I Learned

Writing a compressor from scratch taught me how data really lives in memory. High-level languages hide the distinction between bytes and bits. In C, you control every shift and mask.

The Huffman algorithm itself is elegant. A simple frequency count and a greedy tree-building process produce provably optimal prefix codes. No character’s code is a prefix of another, so decoding is unambiguous.

The project also showed me why compression libraries exist. Edge cases multiply. What about empty files? Files with only one unique character? Binary files with all 256 byte values? Handling these correctly takes care.

I still use the mental model from this project when thinking about binary protocols, network packets, and file formats. Understanding bits makes the rest clearer.

GitHub - RolandoAndrade/compressor: Aplicación de consola que permitir comprimir y descomprimir archivos mediante el algoritmo de Huffman realizado para practicar la manipulación de las operaciones de bit

Aplicación de consola que permitir comprimir y descomprimir archivos mediante el algoritmo de Huffman realizado para practicar la manipulación de las operaciones de bit - RolandoAndrade/compressor

github.com

If you are curious about low-level programming or data compression, building a Huffman encoder is a great weekend project. The algorithm is simple enough to implement in a few hundred lines, but the bit manipulation teaches lessons that stick.