Luc Gommans/ blog

ZIP Made Really Easy

Written on 2016-09-29

HTTP Made Really Easy is one of my favorite old websites. I used it a lot because it just gets to the point. I don't even need it anymore because it did a great job at teaching.

Here I will try to do the same for ZIP files without compression up to 4 gigabytes. This will describe only necessary parts to get you writing zip files in minutes (with the spec it cost me 4 hours).

General format

The spec mentions lots of parts and fails to mention that many are optional. There are basically two parts:

  1. Zero or more file headers (they call it "local file headers"), followed by file data.
  2. The central directory: a list of all files in the zip, and some headers. In that order. Indeed, the header is after the file list. Yeah, I know right?

A file header contains fields like filename, file length, compression method, etc. It is followed by the file's data. Repeat for each file.

At the end comes the "central directory", which is an overview of all files in the zip. Each entry mentions where a file can be found, a bunch of useless fields, and a copy of the local file header.

The EOCD (End Of Central Directory) is the worst name ever. It means "central directory header" because it contains info about the CD (central directory), but since it's at the end, it would be weird to call it a header so they don't.

1. File headers

These consist of:

Magic bytes: 0x504b0304 (or "PK\x03\x04").

Minimum version to extract. Use 0x0a00. For magic reasons, that means version 1 and is compatible with everything.

General purpose flag. Use 0x0000. I don't see anything in the spec to indicate that anything else is necessary.

Compression method. 0x0000 for no compression; deflate is 0x0800; bzip2 is 0x1200.

Date and time of the file. The spec refers to the MS-DOS date and time format without actually mentioning the format. I looked it up for you:

For the date (continuing):

This means the format will stop working in 2044 and has a precision of 2 seconds. Extended file attributes can contain a more precise time, but we will not cover that here. If you don't care about date and time, leave it zeros.

A crc32 of the data. The spec just goes "someone copied the implementation from this excellent book about NetBIOS, you can look it up there. Oh, and we made it little-endian for your amusement."

In PHP, this is known as "crc32b": invert_endianness(hash_file('crc32b', $filename))

A very helpful comment mentions it's "the 32-bit Frame Check Sequence of ITU V.42 (used in Ethernet and popularised by PKZip)." It should compute cbf43926 for "the "standard" crctest.txt (numbers 1 to 9 in sequence").

The compressed size. In 4 bytes little-endian.

The uncompressed size. Again, 4 bytes little-endian.

The filename length. Little-endian, but 2 bytes for a twist.

The 'extra field' length. If you want to have even more headers, e.g. for a proper date and time field, this is your chance. Otherwise, set it to 0x0000 (so it's 2 bytes).

This is the end of the fixed-length header. You will need this header again, minus the magic bytes, for the central directory.

The filename. Just write the path here as normal bytes with forward slashes. No leading slash necessary.

The extra field. If you have more headers, put them here.

Finally, the (compressed or not) file data goes here.

Repeat for every file.

2. Central directory (file listing)

Magic: 0x504b0102 (or "PK\x01\x02").

Version made by. Which version this zip file was made by. Who cares? 0x1e03 seems to work (taken from another zip file).

The local file header. Copy all the fields, byte for byte, from the local file header. This means from "minimum version" until and including "extra field length". The spec doesn't mention any of this, it just makes you re-implement a few fields until you go "hey, this sounds familiar" and realize what's going on.

Comment length. 4 bytes, little-endian.

Disk number start. 0x0000 nobody uses this.

Internal file attributes. 0x0000. The spec is phrased very confusingly about this, but it seems to be unnecessary.

External file attributes. These are black magic. It's mentioned to be "platform specific" and can be filled in as desired, which is of no help whatsoever. 0x0000ed81 sets mods rwxr-xr-x on Linux (owner rwx, group read and execute, world read and execute).

Byte offset (in the zip) of this file's local file header. E.g. if you are referring to the first file in the zip archive, then it's simply zero (assuming you wrote the file and the preceding header to the beginning). This is 8 bytes, little-endian.

The filename.

Extra field. If you use this.

File comment. If you use this.

Repeat for every file, in no particular order.

The EOCD (central directory header)

The End Of Central Directory record consists of:

More magic: 0x504b0506 (or "PK\x05\x06")

Something about disks. 0x00000000.

Total entries in CD. (CD = Central Directory.) Little-endian, 2 bytes. Technically number of entries in the CD of the current disk, but we aren't making multi disk zip files.

Total entries in CD. Little-endian, 2 bytes. Yes, just repeat the same value.

CD size. Little-endian, 4 bytes. This is just the size of the directory (that you wrote earlier, until this EOCD) in bytes.

CD offset. Where it starts, in bytes. Little-endian, 4 bytes.

Comment length. 2 bytes, little-endian.

Comment. If any.

Reference implementation.

I wrote a simple implementation in PHP: github.com/lgommans/PhpZipStream