Updated text extractor #16

Merged
ewellenr merged 19 commits from textextractor-test into textextractor 2023-11-13 23:01:54 -05:00
Owner

Just an improved version with specialized sub clustering for lines

Just an improved version with specialized sub clustering for lines
ewellenr added 19 commits 2023-11-13 23:01:17 -05:00
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
Generally extracts the lines well. There might be some errors
in the future so it needs to be checked.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
Updated refined images and also added images of the extracted lines
using the new autocropper and line extraction functions respectively.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
Tried easyOCR but it was pretty bad so I'm going to try
the pytorch based model TrOCR which uses the MIT Licence.
6f60612e7c/LICENSE

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
Plan to try and train a Donut model.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
Need to update text clarifier so that lines aren't
merged together on characters but have updated it so that
there is a deskew in between line clustering.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
Tried with kernel based off of character size but
it didn't work. Also removed old adaptive kernel
which was based off of image size.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
It does not work if there are small dots near the edge.
For example, if a small bit of a character is detached or
a colon is really low/high.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
In between testing out using a line morphology.
Can't seem to dial it in for the subclusters.
Want to try combining the old and new technique. That is,
the line morphology for the first full receipt and then the letter
technique for the subclustering.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
Implemented first and sub line clustering. A few touchups to do

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
Still extracts barcodes and stuff, not just text.
That's the possible thing I see left to fix.

Signed-off-by: Ethan Wellenreiter <ewellenreiter@gmail.com>
ewellenr merged commit d1c9cb8947 into textextractor 2023-11-13 23:01:54 -05:00
ewellenr deleted branch textextractor-test 2023-11-13 23:02:00 -05:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: ewellenr/receipt_indexer#16
No description provided.