I’m currently trying to extract text from a large number of German documents printed in Fraktur font. There is a commercial solution for Fraktur OCR from ABBYY and there is the Google way. My primary focus for this is not a perfect recognition but a text good enough to use it for full text search. Luckily Google has the OCR engine open sourced, it is called Tesseract. Tesseract reads an image and converts it to UTF-8 text, its as simple as that.
Now I have the whole PDF document which is basically images and I need to extract them. Again, there are commercial packages which do that, but there is also ImageMagick which can read from PDF with the help of Ghostscript and output to a series of images. Exactly what I need.
So lets get ImageMagick from http://www.imagemagick.org/script/binary-releases.php#windows and Ghostscript from http://sourceforge.net/projects/ghostscript/files/ latest version.
Tesseract is available on Google Code http://code.google.com/p/tesseract-ocr/downloads/list
Grab the newest setup exe and the deu-frak.traindata.gz which has to be unzipped and moved to the tessdata folder.
If everything is installed, the PDF can be converted with:
[ImageMagick path]\convert.exe document.pdf document_%04d.jpg
This is what I used and what leaves the size of the images unchanged and uses an automatically increased 4-digit number suffix with leading zeros (this is basically C printf). convert has a bunch of optional parameters, but the default worked quite fine for the purpose of OCR. Note that this process may need huge chunks of disk space in the user temp folder! In my case it produced a 30+ MB temporary file for each extracted image, and it extracts all of them before starting to write the final images. In my case I had to move the temp folder to my big user partition, Windows lets you do this in the System settings http://answers.microsoft.com/en-us/windows/forum/windows_7-files/change-location-of-temp-files-folder-to-another/19f13330-dde1-404c-aa27-a76c0b450818
The second step is to run the Tesseract engine:
for %i in (*.jpg) do [tesseract path]\tesseract.exe %i %i -l deu-frak
I wrote a small C program to join the resulting text files.
This is the result from the OCR for the page shown above
Baufül)ruugs-Kosteu der Befestignng von Ulm.
Personal und Kosten der weiteren Bauführung rechten Donauufers.
Pkc«tsidium. Das Personal und die Kosten der weiteren Baufül)rung für die
Befestigung von Ulin rechten Donauufers betreffend, waren in Gemäßheit der neuerlich
erhaltenen Zufertigung und unter Beziehung auf die Erlasse vom l8. April und 9. Mai
vorigen Jahres (S§. 166., 204., 4ö5. v. J. 1855) mit Schreiben an das Festungs-
gouvernement zu Ulm vom 14. Januar folgende Bestimmungen zu treffen (Alsg. S-dir.
23. v. J. 1856.):




Bauunternehmen Brenner
Brenner Fotoversand