51工具盒子

依楼听风雨
笑看云卷云舒,淡观潮起潮落

Mac Java 使用 tesseract 进行 ORC 识别

在 Java 开发中使用图片转文字时,难免会遇到问题,比如我使用 Mac (M1 芯片) 系统进行开发,就出现报错。

博主博客 {#博主博客}

一、直接使用 {#一、直接使用}

1. 使用 brew 进行安装 {#使用-brew-进行安装}

* 
                          01
                        

  
  
                    
                      brew install tesseract

                    </code>
                  </pre>



 
如果是其他系统的, 建议看 https://tesseract-ocr.github.io/tessdoc/Installation.html 进行安装。

 
### 2. 查看版本 {#查看版本}


 `                    ``
                      nukix@nukixPC ~ % tesseract --version
tesseract 5.5.0
leptonica-1.85.0
libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.44 : libtiff 4.7.0 : zlib 1.2.12 : libwebp 1.4.0 : libopenjp2 2.5.2
Found NEON
Found libarchive 3.7.7 zlib/1.2.12 liblzma/5.6.3 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.6
Found libcurl/8.7.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.12 nghttp2/1.62.0
`

                    </code>
                  </pre>


`
 `

### 3. 查看安装路径 {#查看安装路径}

`
 `{#codeBlock0-1733987627644}

                        `
                          brew list tesseract
    `

                        </code>
                      </pre>


    `
     `

    比如我安装路径是 ` /opt/homebrew/Cellar/tesseract/5.5.0 ` , 下面会用到。
    `
     `

    ### 4. 在图片中识别文字(英文) {#在图片中识别文字(英文)}

    `
     ``                    ``
                          tesseract abc.jpg out.txt -l eng
    `

                        </code>
                      </pre>


    `
     `

    命令 ` tesseract 图片路径 输出文件 -l 语言码 ` , 我上面的命令识别出来的结果会在 ` out.txt ` 文件中。
    `
     `

    语言码可以在 https://tesseract-ocr.github.io/tessdoc/Data-Files 找到。
    `
     `

    ### 5. 语言包下载 {#语言包下载}

    `
     `

    语言包传送门: https://github.com/tesseract-ocr/tessdata
    `
     `

    根据自己需要下载对应的语言包, 比如中文的语言包是 https://github.com/tesseract-ocr/tessdata/raw/refs/heads/main/chi_tra.traineddata 。
    `
     `

    下载后放到 ` tessdata ` 目录下面, 比如我的目录在 ` /opt/homebrew/Cellar/tesseract/5.5.0/share/tessdata ` 。
    `
     `

    二、Java API 调用 {#二、java-api-调用}
    ------------------------------

    `
     `

    ### 2.1 导入 Maven 库 {#导入-maven-库}

    `
     `

    因为 ` tesseract ` 是一个 ` C ` 语言库, 所以不能直接使用,官方推荐了其他语言一些第三方封装库可以在 https://tesseract-ocr.github.io/tessdoc/AddOns.html#tesseract-wrappers 查看, 而 ` Java ` 是 https://github.com/nguyenq/tess4j 。
    `
     `

    #### Gradle(short) {#gradleshort}

    `
     `{#codeBlock0-1733987627644}

                            `
                              implementation 'net.sourceforge.tess4j:tess4j:5.13.0'
        `

                            </code>
                          </pre>


        `
         `

        #### Maven {#maven}

        `
         ``                    ``
                              <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>5.13.0</version>
        </dependency>
        `

                            </code>
                          </pre>


        `
         `

        ### 2.2 Java 调用 {#java-调用}

        `
         `

        根据 https://tess4j.sourceforge.net/tutorial/ 的案例,可以这样调用。
        `
         `{#codeBlock0-1733987627644}

                                `
                                  package tess4j.example;
            `

            import java.io.File;
            import net.sourceforge.tess4j.*;
            `
            `

            public class TesseractExample {
            public static void main(String[] args) {
            // System.setProperty("jna.library.path", "32".equals(System.getProperty("sun.arch.data.model")) ? "lib/win32-x86" : "lib/win32-x86-64");
            `
            `

                    File imageFile = new File("eurotext.tif");
                    ITesseract instance = new Tesseract();  // JNA Interface Mapping
                    // ITesseract instance = new Tesseract1(); // JNA Direct Mapping
                    // File tessDataFolder = LoadLibs.extractTessResources("tessdata"); // Maven build bundles English data
                    // instance.setDatapath(tessDataFolder.getPath());

                    try {
                        String result = instance.doOCR(imageFile);
                        System.out.println(result);
                    } catch (TesseractException e) {
                        System.err.println(e.getMessage());
                    }
                }


            `
            `

            }
            `
            `

                                </code>
                              </pre>


            `
             `

            ### 2.3 java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract' {#java.lang.unsatisfiedlinkerror-unable-to-load-library-tesseract}

            `
             `

            #### 问题 {#问题}

            `
             `

            不出意外的话会有这一个异常,运行库中没有 ` tesseract ` 库。
            `
             ``                    ``
                                  java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract':
            dlopen(libtesseract.dylib, 0x0009): tried: 'libtesseract.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OSlibtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/./libtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/../lib/libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache), 'libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache)
            dlopen(libtesseract.dylib, 0x0009): tried: 'libtesseract.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OSlibtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/./libtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/../lib/libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache), 'libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache)
            dlopen(/Users/nukix/Library/Frameworks/tesseract.framework/tesseract, 0x0009): tried: '/Users/nukix/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/nukix/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/Users/nukix/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file, not in dyld cache)
            dlopen(/Library/Frameworks/tesseract.framework/tesseract, 0x0009): tried: '/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file, not in dyld cache)
            dlopen(/System/Library/Frameworks/tesseract.framework/tesseract, 0x0009): tried: '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/System/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file, not in dyld cache)
            Native library (darwin-aarch64/libtesseract.dylib) not found in resource path (/Users/nukix/AndroidStudioProjects/nukix/TestWeb/build/classes/java/main:/Users/nukix/AndroidStudioProjects/nukix/TestWeb/build/resources/main)
            	at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:325) ~[jna-5.14.0.jar:5.14.0 (b0)]
            	at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:481) ~[jna-5.14.0.jar:5.14.0 (b0)]
            	at com.sun.jna.Library$Handler.<init>(Library.java:197) ~[jna-5.14.0.jar:5.14.0 (b0)]
            	at com.sun.jna.Native.load(Native.java:618) ~[jna-5.14.0.jar:5.14.0 (b0)]
            	at com.sun.jna.Native.load(Native.java:592) ~[jna-5.14.0.jar:5.14.0 (b0)]
            	at net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:83) ~[tess4j-5.13.0.jar:5.13.0]
            	at net.sourceforge.tess4j.TessAPI.<clinit>(TessAPI.java:42) ~[tess4j-5.13.0.jar:5.13.0]
            	at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:353) ~[tess4j-5.13.0.jar:5.13.0]
            	at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:202) ~[tess4j-5.13.0.jar:5.13.0]
            	at net.sourceforge.tess4j.ITesseract.doOCR(ITesseract.java:59) ~[tess4j-5.13.0.jar:5.13.0]
            `

                                </code>
                              </pre>


            `
             `

            #### 解决 {#解决}

            `
             `

            根据报错的路径, 把文件复制过去, 比如我报错的路径是 ` /Users/nukix/Library/Frameworks/tesseract.framework/tesseract ` 。
            `
             `

            先创建文件夹
            `
             `{#codeBlock0-1733987627644}

                                    `
                                      mkdir -p /Users/nukix/Library/Frameworks/tesseract.framework
                `

                                    </code>
                                  </pre>


                `
                 `

                复制文件
                `
                 `

                                        
                                          cp /opt/homebrew/Cellar/tesseract/5.5.0/share/tessdata /Users/nukix/Library/Frameworks/tesseract.framework

                                        </code>
                                      </pre>



                     
                    ### 2.4 Error opening data file ./eng.traineddata {#error-opening-data-file-.eng.traineddata}


                     
                    #### 问题 {#问题-1}


                     
                    不出意外的话,会提示找不到语言包。

                     
                                            
                                              Error opening data file ./eng.traineddata
                        Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
                        Failed loading language 'eng'
                        Tesseract couldn't load any languages!

                                            </code>
                                          </pre>



                         
                        #### 解决 {#解决-1}


                         
                        这时候就需要配置环境变量,如果是使用 IDEA 进行开发,可以在菜单栏找到 ` Run->Edit Configurations...->Environment variables ` 进行设置。

                         
                        比如我需要设置 ` TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract/5.5.0/share/tessdata ` ,路径上面有提,这里就不在赘述。

                         
                        再次运行正常情况就能打印出识别的结果。

                         
                        参考文献 {#参考文献}
                        ------------


                         
                                               

                          
                                                
                        * https://tesseract-ocr.github.io/tessdoc/Installation.html

                          
                                                
                        * https://tesseract-ocr.github.io/tessdoc/AddOns.html#tesseract-wrappers

                          
                                                
                        * https://tess4j.sourceforge.net/tutorial/

                          
                                                
                        * https://segmentfault.com/a/1190000042039342

赞(0)
未经允许不得转载:工具盒子 » Mac Java 使用 tesseract 进行 ORC 识别