在 Java 开发中使用图片转文字时,难免会遇到问题,比如我使用 Mac (M1 芯片) 系统进行开发,就出现报错。
博主博客 {#博主博客}
一、直接使用 {#一、直接使用}
1. 使用 brew 进行安装 {#使用-brew-进行安装}
*
01
brew install tesseract
</code>
</pre>
如果是其他系统的, 建议看 https://tesseract-ocr.github.io/tessdoc/Installation.html 进行安装。
### 2. 查看版本 {#查看版本}
` ``
nukix@nukixPC ~ % tesseract --version
tesseract 5.5.0
leptonica-1.85.0
libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.44 : libtiff 4.7.0 : zlib 1.2.12 : libwebp 1.4.0 : libopenjp2 2.5.2
Found NEON
Found libarchive 3.7.7 zlib/1.2.12 liblzma/5.6.3 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.6
Found libcurl/8.7.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.12 nghttp2/1.62.0
`
</code>
</pre>
`
`
### 3. 查看安装路径 {#查看安装路径}
`
`{#codeBlock0-1733987627644}
`
brew list tesseract
`
</code>
</pre>
`
`
比如我安装路径是 ` /opt/homebrew/Cellar/tesseract/5.5.0 ` , 下面会用到。
`
`
### 4. 在图片中识别文字(英文) {#在图片中识别文字(英文)}
`
`` ``
tesseract abc.jpg out.txt -l eng
`
</code>
</pre>
`
`
命令 ` tesseract 图片路径 输出文件 -l 语言码 ` , 我上面的命令识别出来的结果会在 ` out.txt ` 文件中。
`
`
语言码可以在 https://tesseract-ocr.github.io/tessdoc/Data-Files 找到。
`
`
### 5. 语言包下载 {#语言包下载}
`
`
语言包传送门: https://github.com/tesseract-ocr/tessdata
`
`
根据自己需要下载对应的语言包, 比如中文的语言包是 https://github.com/tesseract-ocr/tessdata/raw/refs/heads/main/chi_tra.traineddata 。
`
`
下载后放到 ` tessdata ` 目录下面, 比如我的目录在 ` /opt/homebrew/Cellar/tesseract/5.5.0/share/tessdata ` 。
`
`
二、Java API 调用 {#二、java-api-调用}
------------------------------
`
`
### 2.1 导入 Maven 库 {#导入-maven-库}
`
`
因为 ` tesseract ` 是一个 ` C ` 语言库, 所以不能直接使用,官方推荐了其他语言一些第三方封装库可以在 https://tesseract-ocr.github.io/tessdoc/AddOns.html#tesseract-wrappers 查看, 而 ` Java ` 是 https://github.com/nguyenq/tess4j 。
`
`
#### Gradle(short) {#gradleshort}
`
`{#codeBlock0-1733987627644}
`
implementation 'net.sourceforge.tess4j:tess4j:5.13.0'
`
</code>
</pre>
`
`
#### Maven {#maven}
`
`` ``
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>5.13.0</version>
</dependency>
`
</code>
</pre>
`
`
### 2.2 Java 调用 {#java-调用}
`
`
根据 https://tess4j.sourceforge.net/tutorial/ 的案例,可以这样调用。
`
`{#codeBlock0-1733987627644}
`
package tess4j.example;
`
import java.io.File;
import net.sourceforge.tess4j.*;
`
`
public class TesseractExample {
public static void main(String[] args) {
// System.setProperty("jna.library.path", "32".equals(System.getProperty("sun.arch.data.model")) ? "lib/win32-x86" : "lib/win32-x86-64");
`
`
File imageFile = new File("eurotext.tif");
ITesseract instance = new Tesseract(); // JNA Interface Mapping
// ITesseract instance = new Tesseract1(); // JNA Direct Mapping
// File tessDataFolder = LoadLibs.extractTessResources("tessdata"); // Maven build bundles English data
// instance.setDatapath(tessDataFolder.getPath());
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}
`
`
}
`
`
</code>
</pre>
`
`
### 2.3 java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract' {#java.lang.unsatisfiedlinkerror-unable-to-load-library-tesseract}
`
`
#### 问题 {#问题}
`
`
不出意外的话会有这一个异常,运行库中没有 ` tesseract ` 库。
`
`` ``
java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract':
dlopen(libtesseract.dylib, 0x0009): tried: 'libtesseract.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OSlibtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/./libtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/../lib/libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache), 'libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache)
dlopen(libtesseract.dylib, 0x0009): tried: 'libtesseract.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OSlibtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/./libtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/../lib/libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache), 'libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache)
dlopen(/Users/nukix/Library/Frameworks/tesseract.framework/tesseract, 0x0009): tried: '/Users/nukix/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/nukix/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/Users/nukix/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file, not in dyld cache)
dlopen(/Library/Frameworks/tesseract.framework/tesseract, 0x0009): tried: '/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file, not in dyld cache)
dlopen(/System/Library/Frameworks/tesseract.framework/tesseract, 0x0009): tried: '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/System/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file, not in dyld cache)
Native library (darwin-aarch64/libtesseract.dylib) not found in resource path (/Users/nukix/AndroidStudioProjects/nukix/TestWeb/build/classes/java/main:/Users/nukix/AndroidStudioProjects/nukix/TestWeb/build/resources/main)
at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:325) ~[jna-5.14.0.jar:5.14.0 (b0)]
at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:481) ~[jna-5.14.0.jar:5.14.0 (b0)]
at com.sun.jna.Library$Handler.<init>(Library.java:197) ~[jna-5.14.0.jar:5.14.0 (b0)]
at com.sun.jna.Native.load(Native.java:618) ~[jna-5.14.0.jar:5.14.0 (b0)]
at com.sun.jna.Native.load(Native.java:592) ~[jna-5.14.0.jar:5.14.0 (b0)]
at net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:83) ~[tess4j-5.13.0.jar:5.13.0]
at net.sourceforge.tess4j.TessAPI.<clinit>(TessAPI.java:42) ~[tess4j-5.13.0.jar:5.13.0]
at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:353) ~[tess4j-5.13.0.jar:5.13.0]
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:202) ~[tess4j-5.13.0.jar:5.13.0]
at net.sourceforge.tess4j.ITesseract.doOCR(ITesseract.java:59) ~[tess4j-5.13.0.jar:5.13.0]
`
</code>
</pre>
`
`
#### 解决 {#解决}
`
`
根据报错的路径, 把文件复制过去, 比如我报错的路径是 ` /Users/nukix/Library/Frameworks/tesseract.framework/tesseract ` 。
`
`
先创建文件夹
`
`{#codeBlock0-1733987627644}
`
mkdir -p /Users/nukix/Library/Frameworks/tesseract.framework
`
</code>
</pre>
`
`
复制文件
`
`
cp /opt/homebrew/Cellar/tesseract/5.5.0/share/tessdata /Users/nukix/Library/Frameworks/tesseract.framework
</code>
</pre>
### 2.4 Error opening data file ./eng.traineddata {#error-opening-data-file-.eng.traineddata}
#### 问题 {#问题-1}
不出意外的话,会提示找不到语言包。
Error opening data file ./eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
</code>
</pre>
#### 解决 {#解决-1}
这时候就需要配置环境变量,如果是使用 IDEA 进行开发,可以在菜单栏找到 ` Run->Edit Configurations...->Environment variables ` 进行设置。
比如我需要设置 ` TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract/5.5.0/share/tessdata ` ,路径上面有提,这里就不在赘述。
再次运行正常情况就能打印出识别的结果。
参考文献 {#参考文献}
------------
* https://tesseract-ocr.github.io/tessdoc/Installation.html
* https://tesseract-ocr.github.io/tessdoc/AddOns.html#tesseract-wrappers
* https://tess4j.sourceforge.net/tutorial/
* https://segmentfault.com/a/1190000042039342