I’ve been trying to use Imagick to turn PDF files in my PHP application into PNGs so that I can get Tesseract OCR’s PHP library to scan only handwritten text in the documents. The handwritten text areas are surrounded by a black border in the documents, and there’s a chance that they could be slightly tilted since some of the PDFs are scanned.
Is there any way that I can use Imagick to create images of only the bordered boxes from the PDFs? I’ve tried looking at the Imagick docs and tried using despeckleImage()
and then trimImage()
, but I was only able to trim a little bit due to some fuzz pixels in the image.
The box with the handwritten text is the one I want to get an image of so that I can scan that text in it. This imgur link has both scans I’ve been working with. The first one has no fuzz at all, but the second one is a scan with fuzz. I’m not sure how to even approach the problem since there are so many functions in the library, so if you guys have any ideas it would be much appreciated.
Advertisement
Answer
You can do that in ImageMagick command line using connected components. I do not know if PHP Imagick can do that. But you can check. Otherwise, just use PHP exec() to run my command.
The following is Unix syntax.
Input:
bbox=`convert Ef82Whf_d.webp -threshold 75% -type bilevel -define connected-components:exclude-header=true -define connected-components:area-threshold=500 -define connected-components:keep-top=1 -define connected-components:verbose=true -define connected-components:mean-color=true -connected-components 8 null: | grep "gray(255)" | tail -n +2 | awk '{print $2}'` convert Ef82Whf_d.webp -crop $bbox +repage -shave 5x5 textbox_crop.png