Calibre is far from ideal so I wonder if there is a better way to convert a PDF into EPUB? Maybe a new AI tool exist for that purpose? What do you use?

Alvaro @social.graves.cl · 1 year ago

Calibre is far from ideal so I wonder if there is a better way to convert a PDF into EPUB? Maybe a new AI tool exist for that purpose? What do you use?

Em Adespoton@lemmy.ca · 1 year ago

You’ve identified the main issue: PDF extraction. A PDF can lay out pages in an infinite number of ways.

My personal workflow is to take a PDF, tun it through ClearType OCR, save it as a web-friendly, accessibility standard compliant PDF, which will extract all the text and re-lay it out so a screen reader can read the text in the correct order.

After that, it’s a matter of exporting the PDF to HTML, chunking it, zipping the results with a CSS file and a manifest, and you’ve got an ePub.

And of course, there are Python libraries to do a lot of the conversion as well.

Wilker@lemmy.blahaj.zone · 1 year ago

question: why is using OCR software more worth it than taking its contents with something like LibreOffice Draw?

DonnieDarkmode@lemm.ee · 1 year ago

Oh yeah my actual workflow to create a book was horribly inefficient and time-consuming. How automated is that HTML export and chunking process? Are you still going through and manually adding in every last <p></p> and href?

I’m curious about your use case, because I was doing this with a book that was hundreds of pages long, full of photos and footnotes, which added lots of tedium.

Em Adespoton@lemmy.ca · 1 year ago

I just use a text editor and regex to add all the paras and hrefs. Done it for a few horribly mangled books I was converting. Whole process was “manually automated”.