Skip to main content
Home
Robert Abraham's Blog

Main navigation

  • Home
  • Educational materials
  • GenPass
  • ReactImages
  • Nemesis Information System
  • About
User account menu
  • Log in

Breadcrumb

  1. Home

Multilingual text recognition with translation

Profile picture for user admin
By admin, 27 September, 2024
Joshua Hoehne | unsplash.com

Sometimes you may have a printed document on your desk on a language you don't speak and it would take a long time to type the entire document to upload to a translation service. You might have also received this document in an image format even if it's digitalized. It can also be a confidental document that you must not upload to any third party service. Even if you were allowed to to type and upload it to a translation service, you may find it difficult to do so as it was written on a language, whose character set cannot be found on your keyboard or system. Either way, you have a problem with translating this document. This post will try to provide a solution for this problem. Here you will learn how you can build quickly a multilingual text recognition and translation web application based on Tesseract-OCR, LibreTranslate, Apache 2 web server and PHP module that runs on Ubuntu Server 22.04 LTS. We are not going to reinvent the wheel, we will use all existing resources we can find to keep this task simple and perform it as quickly as possible.

What we need:

  • An up-to-date Linux server (I am using Ubuntu 22.04 LTS and this post is based on that but Debian or other distributions should work as well, however the steps may be different when installing the necessary packages);
  • Tesseract-OCR package (I am going to install all available languages for long term use; you may want to install only the languages you actually need);
  • LibreTranslate package (we are going to install it as a service so it will always start with the system so the web server can connect to it anytime);
  • Apache 2 web server (this will be the main web server we are going to use to connect to LibreTranslate service so our connection will be secure);
  • PHP module (this will be the interpreter that runs our web application).

I am not going to cover how to install Apache 2 and PHP on your system; this may be different when you are not using a Debian-based distribution but this step should be easy enough to assume it's already done and you have a working web server with PHP.

Installing dependencies

First we install the text recognition software itself: Tesseract-OCR package. On Ubuntu/Debian, open a root terminal and update your installed packages first by running the following commands:

sudo apt-get update
sudo apt-get upgrade

Now install Tesseract-OCR package with all available languages:

sudo apt-get install tesseract-ocr tesseract-ocr-all

You can check if the installation was successful by running the following command:

tesseract --version

To get a list of the installed languages, run the following command:

tesseract --list-langs

Next, we are going to install LibreTranslate service. To run LibreTranslate service, you need Python interpreter so you have to install it if it's not installed yet:

sudo apt-get install python3-pip

You can check if it's installed by running the following command:

pip3 --version

Now you can install LibreTranslate package:

sudo pip install libretranslate

To install language models, you need Argos Translate package, which you can install this way:

sudo pip install argos-translate

Now you can list all available language models:

argos-translate-cli list

Then, install your preferred language model like this:

sudo argos-translate-cli install-model <source_language_code> <target_language_code>

For example, to install the Japanese to English model, run this command:

sudo argos-translate-cli install-model ja en

I may need all language models so I installed all of them. Note: installing all language models may take a long time and storage so be prepared for that. You can verify your installed language models with the following command:

argos-translate-cli installed

LibreTranslate will automatically detect the installed models, so you can start the service with this command:

libretranslate [args]

where [args] are the command-line parameters you may want to supply to customize your libretranslate service. For the complete list of arguments and additional information, please check the project's website:

https://github.com/LibreTranslate/LibreTranslate

Now this service will run as long as you stop it. It won't start with your system which is not good for us so we have to create a Systemd service that starts LibreTranslate on boot:

sudo nano /etc/systemd/system/libretranslate.service

The content of the file should look like this:

[Unit]
Description=LibreTranslate Service
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/libretranslate --host 0.0.0.0 --port 5000
Restart=on-failure
User=root
WorkingDirectory=/root
[Install]
WantedBy=multi-user.target

I created this service on my local virtual machine that nobody else has access to so it's OK for me but on a production environment, you may not want to start the service as root or you must add port 5000 to your firewall configuration so only local access is allowed to the port LibreTranslate is listening on. Now reload Systemd to find the new service:

sudo systemctl daemon-reload

Then enable LibreTranslate service:

sudo systemctl enable libretranslate.service

Once it's enabled, you can start the service:

sudo systemctl start libretranslate.service

You can check if it's actually running with the following command:

sudo systemctl status libretranslate.service

Depending on your configuration, it may take a few minutes for LibreTranslate to start properly. Based on the above configuration, LibreTranslate communicates on port 5000. If you haven't blocked it remotely, you can test it in your browser and see if it's working properly:

http://<your-server-ip-or-domain-name>:5000/

Building the web application

Now we have all dependencies, we can start building our web application. This will probably be the simplest frontend you've ever built: 1 single index.php file and an uploads directory! We still want it to look beautiful and be responsive so we still add Bootstrap and some custom styles as well.

The application will work the following way: you navigate to index.php on your server, upload an image file that contains some text, select the language the text was written on, then (optionally, if you also want translation not just text recognition) select the language you want the recognized text to be translated to. Text recognition is handled by Tesseract-OCR package we installed in the first step and is called by PHP through a shell_exec statement (so you have to validate its params properly). Once Tesseract-OCR creates a new text file that contains the recognized text, we will output it to the browser and optionally pass its content to LibreTranslate service for translation, then we send the translated content returned by LibreTranslate to the browser. LibreTranslate will detect the language of the recognized text automatically. Lastly, we clean up all uploaded file and wait for the next file upload on the same page that new contains the result of the previous upload.

First, locate your web root directory (usually /var/www/html on modern Ubuntu/Debian systems) and create a new directory for our web application: tesseract and a subdirectory: uploads. In the tesseract directory, create and index.php file with this content:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>OCR with Tesseract</title>
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.1/css/bootstrap.min.css">
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.7.1/jquery.min.js"></script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.1/js/bootstrap.min.js"></script>
    <style>
    body {
      background: url('<your-own-nice-background-image>') no-repeat center center fixed, linear-gradient(135deg, #add8e6, #87cefa, #00bfff);
      background-size: cover;
      color: #000000;
      display: flex;
      flex-direction: column;
      justify-content: center;
      align-items: center;
      min-height: 100vh;
      margin: 0;
      font-family: 'Arial', sans-serif;
    }
    .container {
      background: rgba(255, 255, 255, 0.7);
      border-radius: 15px;
      padding: 30px;
      box-shadow: 0 4px 30px rgba(0, 0, 0, 0.1);
      backdrop-filter: blur(10px);
      border: 1px solid rgba(255, 255, 255, 0.3);
      width: 100%;
      max-width: 600px;
    }
    h1 {
      text-align: center;
      color: #007bff;
      margin-bottom: 20px;
    }
    h2 {
      text-align: center;
      color: #007bff;
      margin-top: 20px;
      margin-bottom: 20px;
    }
    .form-control {
      border-radius: 10px;
    }
    .btn-primary {
      background-color: #007bff;
      border: none;
      border-radius: 10px;
      padding: 10px 20px;
    }
    .btn-primary:hover {
      background-color: #0056b3;
    }
    textarea {
      margin-top: 20px;
      background: rgba(255, 255, 255, 0.8);
      border-radius: 10px;
      border: none;
      padding: 15px;
      width: 100%;
      font-size: 1rem;
      resize: none;
    }
    .result {
      background: rgba(255, 255, 255, 0.5);
      border-radius: 15px;
      padding: 20px;
    }
    </style>
  </head>
  <body>
    <div class="container">
      <h1>OCR with Tesseract</h1>
      <form action="<?php echo basename($_SERVER['SCRIPT_NAME']); ?>" method="post" enctype="multipart/form-data" class="form-group">
        <input type="hidden" id="image_uploaded" name="image_uploaded" value="1">
        <div class="form-group">
          <label for="image">Select image to upload:</label>
          <input type="file" name="image" id="image" accept="image/*" class="form-control" required>
        </div>
        <div class="form-group">
          <label for="language">Select language:</label>
          <select name="language" id="language" class="form-control" required>
            <?php
            // Get selected language if provided
            $language = isset($_POST['language']) ? preg_replace('/[^a-z0-9\-_]/i', '', $_POST['language']) : null;
            // Run Tesseract command to list available languages
            $output = shell_exec('tesseract --list-langs');
            $languages = explode("\n", trim($output));
            // Sort the languages alphabetically
            sort($languages);
            // Loop through the languages and create options for the select dropdown
            if ($languages) {
              foreach ($languages as $lang) {
                if (!$lang) { continue; }
                if (preg_match('/languages/i', $lang)) { continue; }
                $selected = '';
                if (strtolower($lang) == strtolower($language)) {
                  if (!$selected) { $selected = ' selected'; }
                } elseif (strtolower($lang) == 'eng') {
                  if (!$selected) { $selected = ' selected'; }
                }
                $lang_text = strtoupper($lang);
                echo "<option value=\"{$lang}\"{$selected}>{$lang_text}</option>";
              }
            }
            ?>
          </select>
        </div>
        <?php
        $translate_request = ($_POST['translate']) ? true : false;
        $translate_checked = $translate_request ? ' checked' : '';
        // Function to get available languages from LibreTranslate
        function get_available_languages() {
          $ch = curl_init();
          curl_setopt($ch, CURLOPT_URL, 'http://127.0.0.1:5000/languages');
          curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
          curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json'));
          curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 600);
          curl_setopt($ch, CURLOPT_TIMEOUT, 600);
          $response = curl_exec($ch);
          if (curl_errno($ch)) {
            echo '<p class="alert alert-danger">Error fetching languages: ' . curl_error($ch) . '</p>';
          }
          curl_close($ch);
          return json_decode($response, true);
        }
        // Fetch available languages from LibreTranslate
        $available_languages = get_available_languages();
        $target_language = isset($_POST['target_language']) ? preg_replace('/[^a-z0-9\-_]/i', '', $_POST['target_language']) : 'en';
        ?>
        <div class="form-group form-check">
          <input type="checkbox" class="form-check-input" id="translate" name="translate" value="1"<?php echo $translate_checked; ?>>
          <label class="form-check-label" for="translate">Translate to the selected language:</label>
        </div>
        <div class="mb-3">
          <select name="target_language" id="target_language" class="form-select" disabled>
            <?php
            // Populate the target language options dynamically
            if ($available_languages) {
              foreach ($available_languages as $current_language) {
                if (!isset($current_language['code'])) { continue; }
                if (!isset($current_language['name'])) { continue; }
                $language_selected = (strtolower(trim($current_language['code'])) == strtolower($target_language)) ? ' selected' : '';
                echo '<option value="' . htmlentities(trim($current_language['code'])) . "\"{$language_selected}>" . htmlentities(trim($current_language['name'])) . '</option>';
              }
            }
            ?>
          </select>
        </div>
        <div class="form-group text-center">
          <input type="submit" name="submit" value="Upload and Process" class="btn btn-primary">
        </div>
      </form>
      <?php
      if ($_POST['image_uploaded'] && isset($_FILES['image'])) {
        // Set max execution time to 10 minutes
        ini_set('max_execution_time', 600);
        // Check time zone for DateTime codes and fix it if necessary
        $php_timezone = trim(ini_get('date.timezone'));
        if (!$php_timezone) {
          $php_timezone = 'UTC';
          ini_set('date.timezone', $php_timezone);
        }
        $current_datetime = new DateTime();
        // Set target directory
        $target_dir = 'uploads/';
        // Define path for uploaded file
        $target_file = $target_dir . md5(rand()) . '_' . $current_datetime->format('YmdHis') . '_' . preg_replace('/[^a-z0-9\-_\.]/i', '', basename($_FILES['image']['name']));
        $imageFileType = strtolower(pathinfo($target_file, PATHINFO_EXTENSION));
        // Allow only certain file formats (jpg, png, gif, tiff)
        $allowed_types = array('jpg', 'jpeg', 'png', 'gif', 'tiff');
        if (in_array($imageFileType, $allowed_types)) {
          if (move_uploaded_file($_FILES['image']['tmp_name'], $target_file)) {
            // Get selected language from the form
            $language = escapeshellarg($language);
            // Run Tesseract with the selected language
            $output_file = $target_file . '.txt';
            $command = 'tesseract ' . escapeshellarg($target_file) . ' ' . escapeshellarg($output_file) . " -l {$language}";
            shell_exec($command);
            // Read and display the OCR result
            if (!file_exists($output_file)) { $output_file .= '.txt'; }
            if (file_exists($output_file)) {
              $ocr_result = trim(file_get_contents($output_file));
              // If the user requested translation, call LibreTranslate API
              $label_for_original = 'OCR Result';
              if (($translate_request) && (strlen($ocr_result) >= 1)) {
                $label_for_original = 'Original:';
                $post_data = array(
                  'q' => $ocr_result,
                  'source' => 'auto',
                  'target' => $target_language,
                  'format' => 'text',
                  'api_key' => '');
                // Convert the data to JSON format
                $post_data_json = json_encode($post_data);
                // Call LibreTranslate service
                $ch = curl_init();
                curl_setopt($ch, CURLOPT_URL, 'http://127.0.0.1:5000/translate');
                curl_setopt($ch, CURLOPT_POST, true);
                curl_setopt($ch, CURLOPT_HTTPHEADER, array(
                  'Content-Type: application/json',
                  'Content-Length: ' . strlen($post_data_json)));
                curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data_json);
                curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
                curl_setopt($ch, CURLOPT_VERBOSE, true);
                curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 600);
                curl_setopt($ch, CURLOPT_TIMEOUT, 600);
                // Execute the request and get the response
                $translated_result = curl_exec($ch);
                // Check for cURL errors
                $curl_error = null;
                if (curl_errno($ch)) {
                  $curl_error = curl_error($ch);
                }
                curl_close($ch);
                // Decode the JSON response to get the translation
                $response = json_decode($translated_result, true);
                $translatedText = null;
                if (isset($response['translatedText'])) {
                  $translatedText = trim($response['translatedText']);
                } else {
                  // Handle case where translation failed
                  $translatedText = "Translation failed. Please check LibreTranslate service. cURL Error: {$curl_error}. Response: {$translated_result}";
                }
                echo "<h2>OCR Result</h2>\n";
                echo "<div class=\"result\">\n";
                echo "<h3>Translated:</h3>\n";
                echo "<textarea readonly rows='10'>{$translatedText}</textarea>\n";
                echo "</div>";
              }
              echo "<div class=\"result\">\n";
              echo "<h3>{$label_for_original}</h3>\n";
              echo "<textarea readonly rows='10'>{$ocr_result}</textarea>\n";
              echo "</div>";
              // Clean up text file
              unlink($output_file);
            } else {
              echo "<p class='alert alert-danger'>Sorry, there was an error creating your file.</p>";
            }
            // Clean up the uploaded image
            if (file_exists($target_file)) { unlink($target_file); }
          } else {
            echo "<p class='alert alert-danger'>Sorry, there was an error uploading your file.</p>";
          }
        } else {
          echo "<p class='alert alert-warning'>Invalid file format. Please upload a JPG, PNG, GIF, or TIFF file.</p>";
        }
      }
      ?>
    </div>
    <script>
    // Function to enable/disable the target language dropdown
    function toggleTargetLanguage() {
      var targetLanguageSelect = document.getElementById('target_language');
      var translateCheckbox = document.getElementById('translate');
      // Toggle the disabled attribute based on checkbox status
      if (translateCheckbox.checked) {
        targetLanguageSelect.disabled = false;
      } else {
        targetLanguageSelect.disabled = true;
      }
    }
    // Check the state of the checkbox when the page loads
    window.onload = function() {
      toggleTargetLanguage();  // Set initial state based on whether the checkbox is checked
    };
    // Add event listener to the checkbox to enable/disable the select on change
    document.getElementById('translate').addEventListener('change', toggleTargetLanguage);
    </script>
  </body>
</html>

You can see the most important steps in the comments. Make sure your web server has read and write access to the uploads directory otherwise it won't be able to handle your uploaded files! Once everything is done, test your web application in your web browser:

http://<your-server-ip-or-doman-name>/tesseract/

Based on the code above, our website should look like this:

OCR with Tesseract + LibreTranslate
OCR with Tesseract + LibreTranslate

The base PHP code has been generated by ChatGPT and then audited and refactored by me. Please note this application is just a quick and dirty solution for our problem and on a production server, you may need better input validation and you may also want to reorganize the application to put stylesheets and JS code to separate files, then include them in the header so the code will be more readable and esier to manage.

Enjoy :)

Update

After upgrading to Ubuntu 24.04 LTS from Ubuntu 22.04 LTS, half of my system was messed up due to incorrect or missing packages. Among a lot of other things, I had to reinstall the entire LibreTranslate package so on Ubuntu 24.04 LTS system, installing LibreTranslate service requires installing cmake first:

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install cmake

Then you ned to install LibreTranslate as written on its GitHub page (assuming you already have Python and pip packages installed):

sudo pip install libretranslate

If Python is crying about the environment, you can use the --break-system-packages arument on your own risk:

sudo pip install libretranslate --break-system-packages

I am working on a virtual machine with snapshots that I can restore the previous state any time I want so I absolutely didn't care what breaks, I just wanted to install it quickly. I also had a lot of errors when I reinstalled LibreTranslate due to obsolete packages. To get across them, I usually forced installing their new versions by running the following command at each error:

pip install packagename --force-reinstall --no-deps --ignore-installed --break-system-packages

Once LibreTranslate is installed, start it without arguments to get all the language models installed:

sudo libretranslate

That should be enough to get LibreTranslate package working. To make a service for it that starts with the system, the steps should be the same as above. For further information, refer to LibreTranslate's GitHub page:

https://github.com/LibreTranslate/LibreTranslate

Tags

  • Tesseract-OCR
  • LibreTranslate
  • Apache2
  • PHP
  • Ubuntu
  • Linux

Comments

Recent content

  • Multilingual text recognition with translation
    Wed, 10/23/2024 - 18:46
  • Encrypted mini-server on Raspberry Pi with VeraCrypt and LXD containers
    Tue, 10/15/2024 - 23:34
  • Trying ChatGPT on PL/pgSQL
    Wed, 02/08/2023 - 19:28
  • How to create encrypted Windows 7 - Ubuntu dualboot system with DiskCryptor
    Tue, 09/06/2022 - 18:19

Monthly archive

  • October 2024 (1)
  • September 2024 (1)
  • February 2023 (1)
  • July 2022 (1)
  • April 2022 (1)
  • November 2019 (4)

Footer

  • Privacy Policy
  • Cookie Policy
  • About
Powered by Drupal