英文:
Write to Iceberg/Glue table from local PySpark session
问题 {#heading}
我想要能够从我的本地机器使用Python操作托管在AWS Glue上的Iceberg表(读/写)。
我已经完成了以下工作:
- 创建了一个Iceberg表并在AWS Glue上注册了它
- 使用Athena将Iceberg表填充了有限的数据
我可以使用PyIceberg从我的本地笔记本访问(只读)远程Iceberg表,现在我想向其写入数据。问题在于Athena对写操作施加了一些严格的限制,而我最终希望使用Python中类似数据框的接口向Iceberg表写入数据,目前唯一的选择似乎是PySpark。
所以,我正在尝试在我的本地笔记本上运行一个PySpark集群,使用我在以下引用中找到的配置:
- https://github.com/developer-advocacy-dremio/quick-guides-from-dremio/blob/main/icebergpyspark.md#aws-glue
- https://www.youtube.com/watch?v=ogP-HUmpmPk&ab_channel=Dremio
设置代码似乎运行良好,打印输出与参考视频非常相似:
# 代码示例
现在,当我尝试使用以下代码运行查询时:
# 代码示例
我收到以下错误:
# 错误示例
我一直在尝试通过更改配置并将Iceberg的jar包版本复制到Spark主目录中来解决此问题,但目前还没有成功... 总的来说,使用Spark/Iceberg/Glue一直是一个困难和令人沮丧的经验。希望有人能帮助我。 英文:
I want to be able to operate (read/write) to an Iceberg table hosted on AWS Glue, from my local machine, using Python.
I have already:
- Created an Iceberg table and registered it on AWS Glue
- Populated the Iceberg table with limited data using Athena
I can access (read-only) the remote Iceberg table from my local laptop using PyIceberg, and now I want to write data to it. The problem is that Athena imposes some strict limits on write operations, and at the end of the day I'd like to write to the Iceberg table using a dataframe-like interface from Python, and the only option seems to be PySpark for now.
So, I'm, trying to do it, running a PySpark cluster on my local laptop, using the configurations I found on those refs:
- https://github.com/developer-advocacy-dremio/quick-guides-from-dremio/blob/main/icebergpyspark.md#aws-glue
- https://www.youtube.com/watch?v=ogP-HUmpmPk&ab_channel=Dremio
The setup code seems to run fine, with the prints very similar to the reference video:
from pyspark.sql import SparkSession
from pyspark.sql.types import DoubleType, FloatType, LongType, StructType,StructField, StringType
import pyspark
import os
conf = (
pyspark.SparkConf()
.setAppName('luiz-session')
#packages
.set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.1,software.amazon.awssdk:bundle:2.20.18,software.amazon.awssdk:url-connection-client:2.20.18,org.apache.spark:spark-hadoop-cloud_2.12:3.2.0')
#SQL Extensions
.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
#Configuring Catalog
.set('spark.sql.catalog.glue', 'org.apache.iceberg.spark.SparkCatalog')
.set('spark.sql.catalog.glue.catalog-impl', 'org.apache.iceberg.aws.glue.GlueCatalog')
.set('spark.sql.catalog.glue.warehouse', "s3://my-bucket/iceberg-data")
.set('spark.sql.catalog.glue.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')
#AWS CREDENTIALS
.set('spark.hadoop.fs.s3a.access.key', os.environ.get("AWS_ACCESS_KEY_ID"))
.set('spark.hadoop.fs.s3a.secret.key', os.environ.get("AWS_SECRET_ACCESS_KEY"))
)
## Start Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark Running")
Now, when I try to run a query using this:
spark.sql("SELECT * FROM glue.iceberg_table LIMIT 10;").show()
I get the following error:
IllegalArgumentException: Cannot initialize Catalog implementation org.apache.iceberg.aws.glue.GlueCatalog: Cannot find constructor for interface org.apache.iceberg.catalog.Catalog
Missing org.apache.iceberg.aws.glue.GlueCatalog [java.lang.NoClassDefFoundError: software/amazon/awssdk/services/glue/model/InvalidInputException]
I've been trying to change the fix this by changing the conf and copying the Iceberg jar releases to the spark home folder, but no luck so far... Overall it has been a difficult and frustrating experience with Spark/Iceberg/Glue.
I hope someone can help me.
答案1 {#1}
得分: 0
我找到的在最后使用amazon/aws-glue-libs:glue_libs_4.0.0_image_01
Docker镜像,并将Spark配置中的设置包移除,是在本地使用Glue和Iceberg进行开发的唯一方法,同时设置DATALAKE_FORMATS=iceberg
。
参考链接:
- https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html
- https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/
英文:
the only way I found to develop local with glue and iceberg at the end was using the amazon/aws-glue-libs:glue_libs_4.0.0_image_01
docker image with DATALAKE_FORMATS=iceberg
, and removing the set packages from spark configuration.
Refs:
- https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html
- https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/