Updating textextractor #23

Merged
ewellenr merged 17 commits from textextractor into main 2023-11-13 23:25:17 -05:00

17 Commits

Author SHA1 Message Date
ae555d0660 Merge branch 'main' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into textextractor 2023-11-13 23:24:35 -05:00
87517abeb8 Merge branch 'textextractor' of ssh://ssh.git.ewellenr.ca:2222/ewellenr/receipt_indexer into textextractor 2023-11-13 23:05:55 -05:00
d1c9cb8947 Merge pull request 'Updated text extractor' (#16) from textextractor-test into textextractor
Reviewed-on: #16
2023-11-13 23:01:52 -05:00
849224ee7a Finishing this iteration of the text extractor.
Still extracts barcodes and stuff, not just text.
That's the possible thing I see left to fix.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
d1147eb988 Checkpoint for working text extraction
Implemented first and sub line clustering. A few touchups to do

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
56f01257df Checkpoint in line isolation
In between testing out using a line morphology.
Can't seem to dial it in for the subclusters.
Want to try combining the old and new technique. That is,
the line morphology for the first full receipt and then the letter
technique for the subclustering.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
4a228c531e Adding a line morphology to grab tiny bits of characters.
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
140bb5c2aa Working line extractor with an asterisk
It does not work if there are small dots near the edge.
For example, if a small bit of a character is detached or
a colon is really low/high.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
5113b278e7 Another update to text clarifier
Tried with kernel based off of character size but
it didn't work. Also removed old adaptive kernel
which was based off of image size.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
643a73c2c7 Updating the text clarifier to try and connect letters better.
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
762de3602f Working towards updating line extractor
Need to update text clarifier so that lines aren't
merged together on characters but have updated it so that
there is a deskew in between line clustering.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
714a10a499 Adding file with helpful links
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
864b45b2fa Some changes for implementing a model to extract test
Plan to try and train a Donut model.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
9962789a3b Beginning model implementation.
Tried easyOCR but it was pretty bad so I'm going to try
the pytorch based model TrOCR which uses the MIT Licence.
6f60612e7c/LICENSE

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
80316ff83d Updating and adding test images.
Updated refined images and also added images of the extracted lines
using the new autocropper and line extraction functions respectively.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
d81460d7c0 Updated/tweaked line extractor.
Generally extracts the lines well. There might be some errors
in the future so it needs to be checked.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00
f2aeccd3ab Small changes. Just switching branches.
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
2023-11-13 23:01:52 -05:00