You used the split feature to split your PDF file (10 pages 1MB) and the result is 10 PDF files almost 1MB each… something must be wrong, it should have been 10 PDF files roughly 100kB each right? Wrong.
How PDF works
Oversimplifying, a PDF page is just a set of draw operations, something like draw this line here or write this text there using this font or draw this image here. Each page has resources attached to it in something called Resources Dictionary
, a bucket of resources (fonts, images..) used by the page so when the operation draw this image here is met, the image is found in the page Resources Dictionary
.
Now let’s imagine there is the same company logo on each page, does this mean each page has it’s own duplicated logo image in the Resources Dictionary
? Of course not, each Resources Dictionary
can point to the same image resource which is shared among these pages. We could even go a bit further and have every page pointing to the same Resources Dictionary
which, in this scenario, would act as a document wide bucket of resources containing all the images and fonts used in the document.
What happens when we split
In our example we are splitting the 1MB PDF creating 10 new files, what happens is that each of the 10 new files must have all the resources needed to draw the page, meaning that each file must have its own copy of the company logo, fonts and all other resources. This explains why, even if the files are just one page, they have almost the same size of the original 10 pages documents, because they still need all fonts and images that were making up most of the original 1MB size
Is it always the case?
No, sometimes PDFsam is just not smart enough. Imagine page 5 (and only page 5) has a nice big image and all the pages point to the same shared Resources Dictionary
containing this image, when we split we duplicate and attach the resources to each of the 10 PDF files created by the task but only the file containing page 5 will need the full resource dictionary, all the others don’t need the big image. PDFsam has an algorithm that tries (and most of the times succeeds) to identify this kind of situations and optimizes the resources attached to the resulting files, removing unused resources, in this case the big image for all but one of the generated files, and this is why most of the times you get files of the size you would expect. This process can be slow because we need to parse the page to figure what resources are actually used so we don’t always apply it but we try to identify files where there are resources potentially unused. Here is where PDFsam sometime misses a valid candidate and skips the optimization, it happens very rarely but it can happen.
Most of the splitters out there don’t even perform this kind of optimization so rest assured, PDFsam remains one of the best tool for the job 🙂
Hi, would be great to see an example here.
What kind of example?
I found a very strange one, and it happened with only one of several similar files. A collection of medical records, each PDF about 1,500 to 2,000 pages, maybe 300 MB in size, to be split into individual PDFs. This particular one created individual PDFS, enormous to begin with, but of increasing size as the process went on. It took over an hour, and the last one-page PDF file was 33 MB. The resulting folder was in the hundreds of GB.
Even when I tried to extract using the PDF editor’s function, (200 pages at a time, going backward) the same thing happened, and the last PDFs in the collection were much larger even though they were extracted first. So I concluded it had to be some anomaly in the original PDF itself, not with pdfSam’s function.
This was one of three original PDFs, created at the same time, totaling about 4,800 pages. The other two were properly split using pdfSam, with each individual PDF a couple of hundred KB in size.
I split a hunded+ page document into subsections, the total was about 10MB, each split is also ~10MB regardless of how many pages it is split by. The every page split, single page files, are within 2% of the original file size, as are the splits where I use splits after specific numbers of pages with varying number of pages per file being output.
Hi Rodney,
not sure how to help, it probably boils down to how the internal structure of the file is. If you want to send us the file we could take a look but as stated in the post, there are some situations where we miss a valid candidate for optimization, this might be the case.
In my opinion that’s pretty useless. If you say “split at 50 meg chunks” that’s what it should do. So far it has always worked for me but now I spilt a 163mb pdf to 4 pdf’s and EACH ONE is 164mb.. so each one is BIGGER and no where NEAR 50mg. That makes no sense at ALL!!!