Updated text extractor #16

Merged
ewellenr merged 19 commits from textextractor-test into textextractor 2023-11-13 23:01:54 -05:00

19 Commits

Author SHA1 Message Date
c5e2ef3634 Finishing this iteration of the text extractor.
Still extracts barcodes and stuff, not just text.
That's the possible thing I see left to fix.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 22:52:32 -05:00
7cf80d2f8a Checkpoint for working text extraction
Implemented first and sub line clustering. A few touchups to do

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 22:44:02 -05:00
a2e6fe715b Checkpoint in line isolation
In between testing out using a line morphology.
Can't seem to dial it in for the subclusters.
Want to try combining the old and new technique. That is,
the line morphology for the first full receipt and then the letter
technique for the subclustering.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-10 23:51:03 -05:00
3b60d26c30 Adding a line morphology to grab tiny bits of characters.
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-09 22:33:09 -05:00
ec8df21fa6 Working line extractor with an asterisk
It does not work if there are small dots near the edge.
For example, if a small bit of a character is detached or
a colon is really low/high.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-09 11:43:49 -05:00
85935e13f1 Merge branch 'autocropper-test' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into textextractor-test 2023-11-08 18:16:16 -05:00
589133069d Another update to text clarifier
Tried with kernel based off of character size but
it didn't work. Also removed old adaptive kernel
which was based off of image size.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-08 18:07:02 -05:00
a54ca827cf Working towards updating line extractor
Need to update text clarifier so that lines aren't
merged together on characters but have updated it so that
there is a deskew in between line clustering.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-07 23:30:57 -05:00
12eff9c27c Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into textextractor-test 2023-11-04 10:27:09 -04:00
2aabfdcfd5 Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into autocropper-test 2023-11-04 10:26:11 -04:00
346c4f3cdd Adding file with helpful links
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-04 10:25:53 -04:00
1a706e19d1 Some changes for implementing a model to extract test
Plan to try and train a Donut model.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-04 10:24:39 -04:00
c6062a6d93 Beginning model implementation.
Tried easyOCR but it was pretty bad so I'm going to try
the pytorch based model TrOCR which uses the MIT Licence.
6f60612e7c/LICENSE

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 21:26:01 -04:00
e3c5650f0a Updating and adding test images.
Updated refined images and also added images of the extracted lines
using the new autocropper and line extraction functions respectively.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 17:03:33 -04:00
2f50b048ac Updating the text clarifier to try and connect letters better.
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 16:59:29 -04:00
9d4dd9b08b Updated/tweaked line extractor.
Generally extracts the lines well. There might be some errors
in the future so it needs to be checked.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-30 16:57:44 -04:00
599b9bc437 Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into textextractor-test 2023-10-30 14:53:11 -04:00
d0bf58a21e Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into textextractor-test 2023-10-30 00:40:42 -04:00
b40d7379fc Small changes. Just switching branches.
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-10-24 18:09:05 -04:00