You often find or come across images with superposed text. You may want to extract the text. Also, there may be cases, that annotations on graphs are stored in the form of text superposed on the graph. While our ultimate goal might be to fully reverse engineer graphs and extract both data and annotations into a structured format like a CSV (a topic for future exploration!), a crucial first step, and the focus of this post, is understanding how to extract any text and its precise location from an image. This can be done using OCR, optical character recognition.
Using Python, you need to use a OCR library pytesseract
- We start with the image library PIL. if you don't have it, use "pip install Pillow". If you use "pip install PIL", you may get an error.
- From PIL you can import "Image"
- You need to install the "pytesseract" library. It is available here, GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)
- I downloaded an executable from this site: Home · UB-Mannheim/tesseract Wiki · GitHub
- I placed the execuatble in my System's environmental path.
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
image = Image.open("LibreViewOneDay.jpg")
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
# Open a file to write the results
with open("extracted_notes.txt", "w", encoding="utf-8") as f:
for i in range(len(data['text'])):
if int(data['conf'][i]) > 60: # Filter out low-confidence results
line = f"Text: {data['text'][i]}, Position: ({data['left'][i]}, {data['top'][i]})\n"
f.write(line)
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
Understanding the Output
The key to OCR extraction is the piece:
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
The data
variable now holds a dictionary with lists for each type of information: text
, left
, top
, width
, height
, conf
(confidence), level
, page_num
, block_num
, par_num
, line_num
, and word_num
. Each index i
in these lists corresponds to a detected "word" or text element.
By iterating through this data
dictionary, we can access:
data['text'][i]
: The extracted text string.data['left'][i]
: The x-coordinate of the top-left corner of the bounding box.data['top'][i]
: The y-coordinate of the top-left corner of the bounding box.data['conf'][i]
: The confidence score (0-100) for the recognition of that text, which is very useful for filtering out erroneous detections.