Commit Graph

183 Commits

Author SHA1 Message Date
70cabaabd4 Improved background whiteout
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-14 19:06:34 -05:00
c06408c783 Updating the text clarification for a more faithful output
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-14 17:04:38 -05:00
ac7f0edf1c Merge pull request 'Starting to make libraries but mainly updating scripts' (#24) from utilities into main
Reviewed-on: #24
2023-11-14 16:25:38 -05:00
527362ac0f Working on dataset making/preprocessing
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-14 16:20:07 -05:00
b7ebbb21bd Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into utility-maker-test 2023-11-14 13:31:28 -05:00
8e75a7e0ce Merge pull request 'Updating textextractor' (#23) from textextractor into main
Reviewed-on: #23
2023-11-13 23:25:17 -05:00
ae555d0660 Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into textextractor 2023-11-13 23:24:35 -05:00
637ea908fa Merge pull request 'autocropper update' (#22) from autocropper into main
Reviewed-on: #22
2023-11-13 23:24:03 -05:00
6f1b499f57 Merge pull request 'Updating text clarification' (#21) from autocropper-test into autocropper
Reviewed-on: #21
2023-11-13 23:23:15 -05:00
ade8ca1e73 Another update to text clarifier
Tried with kernel based off of character size but
it didn't work. Also removed old adaptive kernel
which was based off of image size.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:23:15 -05:00
87517abeb8 Merge branch 'textextractor' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into textextractor 2023-11-13 23:05:55 -05:00
d1c9cb8947 Merge pull request 'Updated text extractor' (#16) from textextractor-test into textextractor
Reviewed-on: #16
2023-11-13 23:01:52 -05:00
849224ee7a Finishing this iteration of the text extractor.
Still extracts barcodes and stuff, not just text.
That's the possible thing I see left to fix.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
d1147eb988 Checkpoint for working text extraction
Implemented first and sub line clustering. A few touchups to do

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
56f01257df Checkpoint in line isolation
In between testing out using a line morphology.
Can't seem to dial it in for the subclusters.
Want to try combining the old and new technique. That is,
the line morphology for the first full receipt and then the letter
technique for the subclustering.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
4a228c531e Adding a line morphology to grab tiny bits of characters.
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
140bb5c2aa Working line extractor with an asterisk
It does not work if there are small dots near the edge.
For example, if a small bit of a character is detached or
a colon is really low/high.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
5113b278e7 Another update to text clarifier
Tried with kernel based off of character size but
it didn't work. Also removed old adaptive kernel
which was based off of image size.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
643a73c2c7 Updating the text clarifier to try and connect letters better.
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
762de3602f Working towards updating line extractor
Need to update text clarifier so that lines aren't
merged together on characters but have updated it so that
there is a deskew in between line clustering.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
714a10a499 Adding file with helpful links
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
864b45b2fa Some changes for implementing a model to extract test
Plan to try and train a Donut model.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
9962789a3b Beginning model implementation.
Tried easyOCR but it was pretty bad so I'm going to try
the pytorch based model TrOCR which uses the MIT Licence.
6f60612e7c/LICENSE

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
80316ff83d Updating and adding test images.
Updated refined images and also added images of the extracted lines
using the new autocropper and line extraction functions respectively.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
d81460d7c0 Updated/tweaked line extractor.
Generally extracts the lines well. There might be some errors
in the future so it needs to be checked.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
f2aeccd3ab Small changes. Just switching branches.
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
6b53c82334 Merge pull request 'Small text clarifier tweak' (#15) from autocropper into main
Reviewed-on: #15
2023-11-06 18:56:14 -05:00
7b4f1a7a2b Merge pull request 'Small text clarifier tweak' (#14) from autocropper-test into autocropper
Reviewed-on: #14
2023-11-06 18:55:15 -05:00
2aabfdcfd5 Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into autocropper-test 2023-11-04 10:26:11 -04:00
346c4f3cdd Adding file with helpful links
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-04 10:25:53 -04:00
07e6de44c3 Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into utility-maker-test 2023-10-30 21:29:54 -04:00
2f50b048ac Updating the text clarifier to try and connect letters better.
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 16:59:29 -04:00
59ac6965cd Merge pull request 'Adjusting/improving line removal' (#13) from autocropper into main
Reviewed-on: #13
2023-10-30 14:51:31 -04:00
b75c7696de Merge pull request 'Adjusting/improving line removal' (#12) from autocropper-test into autocropper
Reviewed-on: #12
2023-10-30 14:49:59 -04:00
0c6187619e Fixing the line removal.
It use to leave little scraps of the line. Adjusted it so it doesn't.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 14:49:05 -04:00
825e8f75cb Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into autocropper-test 2023-10-30 10:16:50 -04:00
64c5e7a1fa Merge pull request 'Updating text refiner' (#11) from autocropper into main
Reviewed-on: #11
2023-10-30 00:38:08 -04:00
8cb999e610 Merge pull request 'Updating text refining' (#10) from autocropper-test into autocropper
Reviewed-on: #10
2023-10-30 00:36:17 -04:00
2cdf553d6d Need to fix Python auto-formatting
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00
0e81ce0405 Tiny bit of cleanup after text clarification
Title^^^

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00
f9a86b22e5 Updated text clarifier
Changed the technique it uses. Seems to work a little better.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00
22ac088c5d Updating text clarifier again
Just adding a little bit of complexity to try and remove some of the
random clumps and spots that appear.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00
182446f952 Fixing whitedbackground with inpaint.
The mask used for inpainting wasn't correct (it seems).
Updated it to use the correct mask for inpainting.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00
136105b2d7 Updated text clarifier
Using just OTSU thresholding with some morphology as it's similar
quality but a lot faster.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00
f8f4c6a761 Small adjustment to bruteforce rect processing.
Just removed unnecessary sorting from the function.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00
dc49d5cf64 Quick line remover cleanup.
Title^^^^

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00
1f1a4c3000 Removing horizontal and vertical lines from receipt
Exactly as the title says.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00
a0e15fada6 Updated textClarifying function for new background whiteout
As the title says but also adjusted the demoing and specle
thresholding functions so that they work a bit better.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00
eab6fdc8e1 Fixed edge merging from background whiteout.
Instead of bluring the edge, now I used inpainting
to use the page colour to fill in the background so it's uniform.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00
2b4de89208 Blurring the edge of background whiteout
Doesn't work well when text is near the edge.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 00:36:17 -04:00