Now this is something i didn’t know. I always thought that the PDF files are just jibberish bytecodes. This is not absolutely wrong but most of it is not right… What actually happens is that PDF represents documents as structured data.

Opening a pdf document created by a PDF creator program is not the best example for you to see how information is stored in a pdf. I will provide the “Hello World” example here.

%Hello World in Portable Document Format (PDF)
%PDF-1.2
1 0 obj
<<
/Type /Page
/Parent 5 0 R
/Resources 3 0 R
/Contents 2 0 R
>>
endobj
2 0 obj
<<
/Length 51
>>
stream
BT
/F1 24 Tf
1 0 0 1 260 600 Tm
(Hello World)Tj
ET
endstream
endobj
3 0 obj
<<
/ProcSet[/PDF/Text]
/Font <>
>>
endobj
4 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /F1
/BaseFont /Arial
>>
endobj
5 0 obj
<<
/Type /Pages
/Kids [ 1 0 R ]
/Count 1
/MediaBox
[ 0 0 612 792 ]
>>
endobj
6 0 obj
<<
/Type /Catalog
/Pages 5 0 R
>>
endobj
trailer
<<
/Root 6 0 R
>>

Be my guest and copy/paste all the above to a text file and name it “whatever.pdf“. Nice huh? Now this is awfully big for a simple “Hello world” but consider that this would not be sugnificantly bigger if we did more stuff. It’s the structure that takes all the space. I will not describe what exactly goes on in there because a) i am not quite familiar with it and b) there are some very nice specifications and api’s on the net. First you can find the RFC 3778 and secondly, and even better, adobe’s “Portable Document Format Reference Manual“.