When reviewing #420 I realized to setup OCR in the local dev environment you need two things:
-
Tesseract data in the form of language files that you need to put in the directory documentcloud/documents/processing/ocr/tesseract/tessdata/
-
Run the Django management upload_languages which takes the language files stored in tessdata and puts them into ocr_languages bucket in Minio which can be accessed during runtime.
This isn't documented anywhere and we need to add a sample command for downloading a language file (curl probably as it is cross platform) and then running the upload_languages command into the local dev env instructions
When reviewing #420 I realized to setup OCR in the local dev environment you need two things:
Tesseract data in the form of language files that you need to put in the directory documentcloud/documents/processing/ocr/tesseract/tessdata/
Run the Django management upload_languages which takes the language files stored in tessdata and puts them into ocr_languages bucket in Minio which can be accessed during runtime.
This isn't documented anywhere and we need to add a sample command for downloading a language file (curl probably as it is cross platform) and then running the upload_languages command into the local dev env instructions