英文:
Spark driver stopped unexpectedly (Databricks)
问题 {#heading}
我在Azure Databricks中有一个Python笔记本,其中包含一个包含137次迭代的for循环。对于每次迭代,它使用dbutils.notebook.run
调用另一个Scala笔记本。Scala笔记本通过对MongoDB数据库的查询创建一个DataFrame。我在Scala笔记本中使用df.createOrReplaceGlobalTempView("<<view_name>>")
创建一个全局临时视图,因为我需要从Python笔记本中恢复这些数据并进行处理。从Python笔记本中读取的代码如下所示:
global_temporary_database = spark.conf.get("spark.sql.globalTempDatabase")
for _ in range(137):
dbutils.notebook.run(path="<<path_to_scala_notebook>>", timeout_seconds=600, arguments=<<current_configuration>>)
# 恢复数据并删除全局临时视图
df = table(f"{global_temporary_database}.<<view_name>>")
spark.catalog.dropGlobalTempView("<<view_name>>")
# 进行一些处理,如过滤行和重命名列
这对于有限次数的迭代是有效的。然而,当我尝试运行整个循环时,我会遇到以下错误:
The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached
我尝试使用time.sleep()
在迭代之间添加一些延迟,以避免过载集群。我还尝试在每次迭代后使用spark.catalog.clearCache()
,但它不起作用。
以下是集群规格:
- Databricks Runtime版本:9.1 LTS(包括Apache Spark 3.1.2,Scala 2.12)
- 2个工作节点:61 GB内存,8个核心
- 1个驱动节点:16 GB内存,4个核心
不幸的是,我需要使用2个笔记本,因为我正在使用Scala库对DataFrame应用一些操作,然后我需要在笔记本之间共享这些数据,所以无法避免这一部分。
任何帮助将不胜感激。 英文:
I have a Python notebook in Azure Databricks which performs a for loop with 137 iterations. For each iteration, it calls another Scala notebook using dbutils.notebook.run
. The Scala notebook creates a DataFrame from a query to a MongoDB database. I create a global temporary view in the Scala notebook using df.createOrReplaceGlobalTempView("<<view_name>>")
because I need to recover this data from the Python notebook and keep processing it. The code to read from the Python notebook looks like this:
global_temporary_database = spark.conf.get("spark.sql.globalTempDatabase")
for _ in range(137):
dbutils.notebook.run(path="<<path_to_scala_notebook>>", timeout_seconds=600, arguments=<<current_configuration>>)
# Recover the data and drop the global temporary view
df = table(f"{global_temporary_database}.<<view_name>>")
spark.catalog.dropGlobalTempView("<<view_name>>")
# Do some processing like filtering rows and rename columns
This works for a limited number of iterations. However, when I try to run the whole loop I get the following error:
The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached
I've tried to use time.sleep()
to add some delay between iterations to avoid overloading the cluster. I've also tried to use spark.catalog.clearCache()
after each iteration but it doesn't work.
Below are the cluster specs:
- Databricks Runtime Version: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
- 2 Workers: 61 GB Memory, 8 Cores
- 1 Driver: 16 GB Memory, 4 Cores
Unfortunately, I need to use 2 notebooks because I'm using a Scala library to apply some operations to the DataFrames, and then I need to share this data between notebooks, so there's no way to avoid that part.
Any help would be appreciated.
答案1 {#1}
得分: 1
我在复制问题时在我的环境中遇到了相同的错误:
我从代码中删除了 spark.stop()
并重新运行代码。它成功运行而没有任何错误:
更多信息,请查看以下页面:
I got the same error in my environment while replicating the issue:
I removed spark.stop()
from the code and run the code it again. It run successfully without any error:
For more information you can check below page:
答案2 {#2}
得分: 0
我通过只调用一次我的Scala笔记本,并将所有的循环逻辑移到Scala笔记本中来解决了这个问题,这有点繁琐。但我猜调用笔记本137次并产生大量开销并不是一个好主意。
执行时间显著提高。现在只需要大约20分钟,而以前需要1小时才能最终停止驱动程序并完成。 英文:
I resolved this by calling my Scala notebook only once and moving all the loop logic into the Scala notebook, which was a bit tedious. But I guess it's not a good idea to call a notebook 137 times as it produces a lot of overhead.
The execution time has improved significantly. Now it takes around 20 minutes, whereas previously it took 1 hour to eventually stop the driver and not finish.