Commit Graph

183 Commits

Author SHA1 Message Date
c7ba52dd68 Fixing background whiting out.
Whites out the background pretty well. Changed it
to an adaptive threshold first and then use contours to get a mask.
Also using morphology to clean up said mask.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00
abe1b2358d First testing steps towards dewarping.
Too hard. High level math. For later.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00
95a922ce84 Need to fix Python auto-formatting
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:13:52 -04:00
4a8917bb84 Tiny bit of cleanup after text clarification
Title^^^

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-29 23:50:38 -04:00
fa57b17169 Updated text clarifier
Changed the technique it uses. Seems to work a little better.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-29 23:43:51 -04:00
e77b2a31be Updating text clarifier again
Just adding a little bit of complexity to try and remove some of the
random clumps and spots that appear.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-29 00:56:58 -04:00
23aaae51a2 Fixing whitedbackground with inpaint.
The mask used for inpainting wasn't correct (it seems).
Updated it to use the correct mask for inpainting.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-28 18:52:27 -04:00
38eded961e Updated text clarifier
Using just OTSU thresholding with some morphology as it's similar
quality but a lot faster.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-28 18:44:49 -04:00
651309a6cb Small adjustment to bruteforce rect processing.
Just removed unnecessary sorting from the function.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-28 13:53:20 -04:00
b32da17431 Quick line remover cleanup.
Title^^^^

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-28 13:51:43 -04:00
d83ba20d9a Removing horizontal and vertical lines from receipt
Exactly as the title says.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-28 13:45:38 -04:00
ad3c748e35 Updated textClarifying function for new background whiteout
As the title says but also adjusted the demoing and specle
thresholding functions so that they work a bit better.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-28 13:42:24 -04:00
bf262d9200 Fixed edge merging from background whiteout.
Instead of bluring the edge, now I used inpainting
to use the page colour to fill in the background so it's uniform.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-28 13:32:16 -04:00
df27778a88 Blurring the edge of background whiteout
Doesn't work well when text is near the edge.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-28 13:22:01 -04:00
2d40ca2455 Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into autocropper-test 2023-10-27 11:36:34 -04:00
63ba388f95 Adding some more details for custom Rectangle class.
Added some getter and setter functions in rectangle.h file.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-27 11:27:36 -04:00
a3a6fe9474 Fixing background whiting out.
Whites out the background pretty well. Changed it
to an adaptive threshold first and then use contours to get a mask.
Also using morphology to clean up said mask.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-26 21:50:08 -04:00
65d28173a4 First steps towards implementing shared libraries.
Currently have Rectangle library set up; just need to implement
the actual functions in the .cpp file. Also set Line library to
begin creating the class and functions. Set up the CMake as well
which was a bit of a pain. Also added the libraries branch to
the setup scripts.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-26 00:45:57 -04:00
960f6aa7ba Incomplete first steps towards cleaning up helper implementations.
Stopped because why recode a cleaner Python if I plan to reuse in C.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-24 16:42:35 -04:00
fc07d9ad91 Merge pull request 'V1 of line extractor' (#9) from textextractor into main
Reviewed-on: #9
2023-10-24 15:09:18 -04:00
6c80e5661c Merge pull request 'V1 of line extractor' (#8) from textextractor-test into textextractor
Reviewed-on: #8
2023-10-24 15:08:23 -04:00
2706935750 First implementation of line isolator.
Isolates and returns grayscale images of each line at original
resolution. (which is much less since it's a small selected
part of the original image).

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-24 14:36:13 -04:00
031313dba0 First testing steps towards dewarping.
Too hard. High level math. For later.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-24 11:48:48 -04:00
9250037998 First steps toward isolating lines as pictures.
Clustering but need to deskew before subclustering.
Looking to dewarp page now (done in the autocropper branch).

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-23 17:56:56 -04:00
df56a41ac2 Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into textextractor 2023-10-21 19:31:35 -04:00
84bbb4e939 Merge pull request 'Adding Autocropper V2 to main' (#4) from autocropper into main
Reviewed-on: #4
2023-10-21 19:30:29 -04:00
0eb2ec34c0 Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into autocropper 2023-10-21 19:29:05 -04:00
83306830ac Separating text extractor and text classifier.
Exactly what the title says.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-21 19:27:46 -04:00
a4d75fc6bd Updating scripts to work with this branch
Just making textextractor one of the branches that can be chosen
in the scripts.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-18 23:30:35 -04:00
4015328160 Initial prep for textextractor branch.
Making the docker file and an early Jupyter
Notebook for test work.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-18 23:13:18 -04:00
1260625061 Update .gitignore
Title. Just a quick update.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-18 23:10:34 -04:00
c0d5373f1f Pulled .gitignore from autocropper and cleanup
Removing the default Dockerfile since it doesn't
do anything.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-18 23:06:23 -04:00
f91fb1f9b1 Merge pull request 'Merging Py V1 of the completed autocropper/cleanup' (#2) from autocropper-test into autocropper
Reviewed-on: #2
2023-10-18 22:54:19 -04:00
423b511dd9 Cleanup commit
Moving around the testing notebooks. Autocropping is about done
with exception to any new versions or converting the stuff to C
code.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-18 22:48:24 -04:00
e0ce309a0e Merge branch 'autocropper' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into autocropper-test 2023-10-18 22:35:06 -04:00
2888e40ee2 Updated Plan for autocropper
Title. Plan is to just convert all the houghline
processing into C since it should be faster than the python opencv.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-18 21:52:29 -04:00
eb04834d66 Updating gitignore for python function file.
As title says. The python file seems to make
a cache directory so we are ignoring it.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-18 21:39:39 -04:00
b25ffb8602 V2.1 Tuning update
Just tuning the threshold a bit so that the
background whiteout works better.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-18 20:11:18 -04:00
518cea9968 Cleaning up and putting functions into python file
As title says, cleaning up and putting the used/important
functions into myfunctions.py file so they could be easily used.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-18 17:25:17 -04:00
011acea572 V2 houghline preprocessing
Final version of V2 of houghline preprocessing.
May need to make changes to this version but it's complete
and ready for the OCR and actual ML part now.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-18 17:15:14 -04:00
3a98ad0d1c Quick checkpoint 2 with post processing.
I believe the idea is to use morphology to clean
up the merged threshold version and then get bounding boxes
for the letters, get a mask from those bounding boxes and
and then apply the merged thresh onto a white page using the mask.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-13 22:34:23 -04:00
15b36b2de0 Quick checkpoint with postprocessing.
Still struggling with getting the text set well while
removing all the other noise and such.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-11 21:57:03 -04:00
264f8fa80a V1 of final postprocessing for houghline
Has it working but it isn't tuned so it needs to
be tuned/adjusted so that it works well.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-11 16:07:12 -04:00
882892cb53 Complete houghline deskewing and cropping.
Now to implement cleaning up the resulting image so that
the the outer area and page is white and the text is black.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-11 01:29:23 -04:00
834f562604 Cropped original image using functions.
Houghline crop now returns a cropped version of
the original image (no threshold/black-and-white or
a shrunken picture (smaller size). Just the base image cropped
to what it should be. Need to now add post processing
to make the background white and sharpen the image (make the text
a hard black and the page a hard white).

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-11 01:02:16 -04:00
7f796c7a7d Third checkpoint in making houghline crop and deskew
Currently breaking it down in to more digestable
functions and then need to work on post processing so
that the main piece is kept and the background is set to
white and the main piece of paper is set to completely white
with text being full black.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-10 17:05:40 -04:00
9c2d801be3 Second iteration of houghline cropping and deskewing.
Now have deskewing and cropping as one function.
Need to modify it still so it refines the photo even more
and also returns just the rotated rectangle of the original image
instead of the effected (by morphology) image.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-09 23:56:23 -04:00
4c7bb2c54f First implementation of houghline cropping.
Need to fix because deskewing still has some issues.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-09 17:20:18 -04:00
b2f3e89014 First complete implementation of hough line deskewing
Now to work on hough line cropping.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-09 00:28:26 -04:00
a62f628cc1 Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into autocropper-test 2023-10-08 13:46:34 -04:00