While it's also the most compressed versions (
no extra fluff) there's also missing things like formatting. And while 95% of the text normally doesn't need formatting, there's a few spots that may benefit from it. Stripping formatting however isn't that hard, as long as they didn't do something really wonky. (
I've seen a few very weird formatting choices). It often fails to support wordwrap and non-ASCII text well in most programs.
There's 2 annoying formats I've seen in PDF that are hard to convert. First is when they use Unicode, and they choose a secondary font that's a duplicate of the ASCII characters but much further in; Converters don't like it so you get blanks, literally the characters aren't there (
Arm Cannon Academy comes to mind, which had a opening line of text at each chapter this way). The second is where they specifically place each character (
i believe) on the page with a letter each with their own formatting alignment, etc. This is where you get like 40k for a single page in some JPBIG format as it's technically seen as an image when i extract images; Visually it's readable. You have to turn it to an image and OCR it if you want the text. (
KLRXO the new king pdf was like this)
Reminds me of being 14 again in the gargoyles-fans fan-fiction section, where i download stories for backup, and they are a meg in size.... but only like 20k words. Look at the formatting in html and every single line had like 100-200 characters of formatting for font, size, style, adjustment for how much to move on the margin, things that.... don't matter, added that every line was a paragraph and every line consisted of about 80 characters.
(
CSS would combine that all to be automatically added to any tag later, cleaning up HTML 4.0 by leaps and bounds).
If i had to choose a format, it would either be HTML (
well a handful of tags in html, i don't want the whole thing), or RTF. I remember in 2009 when the first wave of epaper devices and ereaders were coming out. Got one (
still got it, but the screen doesn't refresh right so i can't use it) and it supported a number of formats, but i had to adjust the format to RTF and paper size A4 in order for the text to be readable and for it to be viewable in general.