51工具盒子

依楼听风雨
笑看云卷云舒,淡观潮起潮落

Mac Java 使用 tesseract 进行 ORC 识别

在 Java 开发中使用图片转文字时,难免会遇到问题,比如我使用 Mac (M1 芯片) 系统进行开发,就出现报错。

博主博客 {#博主博客}

一、直接使用 {#一、直接使用}

1. 使用 brew 进行安装 {#使用-brew-进行安装}

* 
                          01
                  brew install tesseract
            </code>
          </pre>

如果是其他系统的, 建议看 https://tesseract-ocr.github.io/tessdoc/Installation.html 进行安装。

2. 查看版本 {#查看版本}

`` nukix@nukixPC ~ % tesseract --version tesseract 5.5.0 leptonica-1.85.0 libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.44 : libtiff 4.7.0 : zlib 1.2.12 : libwebp 1.4.0 : libopenjp2 2.5.2 Found NEON Found libarchive 3.7.7 zlib/1.2.12 liblzma/5.6.3 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.6 Found libcurl/8.7.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.12 nghttp2/1.62.0

                </code>
              </pre>

3. 查看安装路径 {#查看安装路径}

{#codeBlock0-1733987627644}

                    `
                      brew list tesseract
`
                </code>
              </pre>

比如我安装路径是 /opt/homebrew/Cellar/tesseract/5.5.0 , 下面会用到。

4. 在图片中识别文字(英文) {#在图片中识别文字(英文)}

`` `` tesseract abc.jpg out.txt -l eng

                </code>
              </pre>

命令 tesseract 图片路径 输出文件 -l 语言码 , 我上面的命令识别出来的结果会在 out.txt 文件中。

语言码可以在 https://tesseract-ocr.github.io/tessdoc/Data-Files 找到。

5. 语言包下载 {#语言包下载}

语言包传送门: https://github.com/tesseract-ocr/tessdata

根据自己需要下载对应的语言包, 比如中文的语言包是 https://github.com/tesseract-ocr/tessdata/raw/refs/heads/main/chi_tra.traineddata 。

下载后放到 tessdata 目录下面, 比如我的目录在 /opt/homebrew/Cellar/tesseract/5.5.0/share/tessdata

二、Java API 调用 {#二、java-api-调用}

2.1 导入 Maven 库 {#导入-maven-库}

因为 tesseract 是一个 C 语言库, 所以不能直接使用,官方推荐了其他语言一些第三方封装库可以在 https://tesseract-ocr.github.io/tessdoc/AddOns.html#tesseract-wrappers 查看, 而 Java 是 https://github.com/nguyenq/tess4j 。

Gradle(short) {#gradleshort}

{#codeBlock0-1733987627644}

                    `
                      implementation 'net.sourceforge.tess4j:tess4j:5.13.0'
`
                </code>
              </pre>

Maven {#maven}

`` `` <dependency> <groupId>net.sourceforge.tess4j</groupId> <artifactId>tess4j</artifactId> <version>5.13.0</version> </dependency>

                </code>
              </pre>

2.2 Java 调用 {#java-调用}

根据 https://tess4j.sourceforge.net/tutorial/ 的案例,可以这样调用。 {#codeBlock0-1733987627644}

                    `
                      package tess4j.example;
`

import java.io.File; import net.sourceforge.tess4j.*;

public class TesseractExample { public static void main(String[] args) { // System.setProperty("jna.library.path", "32".equals(System.getProperty("sun.arch.data.model")) ? "lib/win32-x86" : "lib/win32-x86-64");

    File imageFile = new File("eurotext.tif");
    ITesseract instance = new Tesseract();  // JNA Interface Mapping
    // ITesseract instance = new Tesseract1(); // JNA Direct Mapping
    // File tessDataFolder = LoadLibs.extractTessResources("tessdata"); // Maven build bundles English data
    // instance.setDatapath(tessDataFolder.getPath());

    try {
        String result = instance.doOCR(imageFile);
        System.out.println(result);
    } catch (TesseractException e) {
        System.err.println(e.getMessage());
    }
}

}

                </code>
              </pre>

2.3 java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract' {#java.lang.unsatisfiedlinkerror-unable-to-load-library-tesseract}

问题 {#问题}

不出意外的话会有这一个异常,运行库中没有 tesseract 库。 `` `` java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract': dlopen(libtesseract.dylib, 0x0009): tried: 'libtesseract.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OSlibtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/./libtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/../lib/libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache), 'libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache) dlopen(libtesseract.dylib, 0x0009): tried: 'libtesseract.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OSlibtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/./libtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/../lib/libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache), 'libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache) dlopen(/Users/nukix/Library/Frameworks/tesseract.framework/tesseract, 0x0009): tried: '/Users/nukix/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/nukix/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/Users/nukix/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file, not in dyld cache) dlopen(/Library/Frameworks/tesseract.framework/tesseract, 0x0009): tried: '/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file, not in dyld cache) dlopen(/System/Library/Frameworks/tesseract.framework/tesseract, 0x0009): tried: '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/System/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file, not in dyld cache) Native library (darwin-aarch64/libtesseract.dylib) not found in resource path (/Users/nukix/AndroidStudioProjects/nukix/TestWeb/build/classes/java/main:/Users/nukix/AndroidStudioProjects/nukix/TestWeb/build/resources/main) at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:325) ~[jna-5.14.0.jar:5.14.0 (b0)] at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:481) ~[jna-5.14.0.jar:5.14.0 (b0)] at com.sun.jna.Library$Handler.<init>(Library.java:197) ~[jna-5.14.0.jar:5.14.0 (b0)] at com.sun.jna.Native.load(Native.java:618) ~[jna-5.14.0.jar:5.14.0 (b0)] at com.sun.jna.Native.load(Native.java:592) ~[jna-5.14.0.jar:5.14.0 (b0)] at net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:83) ~[tess4j-5.13.0.jar:5.13.0] at net.sourceforge.tess4j.TessAPI.<clinit>(TessAPI.java:42) ~[tess4j-5.13.0.jar:5.13.0] at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:353) ~[tess4j-5.13.0.jar:5.13.0] at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:202) ~[tess4j-5.13.0.jar:5.13.0] at net.sourceforge.tess4j.ITesseract.doOCR(ITesseract.java:59) ~[tess4j-5.13.0.jar:5.13.0]

                </code>
              </pre>

解决 {#解决}

根据报错的路径, 把文件复制过去, 比如我报错的路径是 /Users/nukix/Library/Frameworks/tesseract.framework/tesseract

先创建文件夹 {#codeBlock0-1733987627644}

                    `
                      mkdir -p /Users/nukix/Library/Frameworks/tesseract.framework
`

                    </code>
                  </pre>


`
 `

复制文件
`
 `

                        
                          cp /opt/homebrew/Cellar/tesseract/5.5.0/share/tessdata /Users/nukix/Library/Frameworks/tesseract.framework

                        </code>
                      </pre>



     
    ### 2.4 Error opening data file ./eng.traineddata {#error-opening-data-file-.eng.traineddata}


     
    #### 问题 {#问题-1}


     
    不出意外的话,会提示找不到语言包。

     
                            
                              Error opening data file ./eng.traineddata
        Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
        Failed loading language 'eng'
        Tesseract couldn't load any languages!

                            </code>
                          </pre>



         
        #### 解决 {#解决-1}


         
        这时候就需要配置环境变量,如果是使用 IDEA 进行开发,可以在菜单栏找到 ` Run->Edit Configurations...->Environment variables ` 进行设置。

         
        比如我需要设置 ` TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract/5.5.0/share/tessdata ` ,路径上面有提,这里就不在赘述。

         
        再次运行正常情况就能打印出识别的结果。

         
        参考文献 {#参考文献}
        ------------


         
                               

          
                                
        * https://tesseract-ocr.github.io/tessdoc/Installation.html

          
                                
        * https://tesseract-ocr.github.io/tessdoc/AddOns.html#tesseract-wrappers

          
                                
        * https://tess4j.sourceforge.net/tutorial/

          
                                
        * https://segmentfault.com/a/1190000042039342

赞(4)
未经允许不得转载:工具盒子 » Mac Java 使用 tesseract 进行 ORC 识别