Does ANALYZE TABLE work with Spark 3.5 and are statistics exposed to AQE?

Question

Does ANALYZE TABLE work with Spark 3.5 and are statistics exposed to AQE?

fred31330 81

I am running a notebook in a Spark 3.5 cluster in which I am basically using MERGE to incrementally add to my delta tables.

While I was monitoring my spark log execution, I noticed messages like this:

2025-11-24 08:36:36,924 INFO EnsureOptimalPartitioningHelper [Thread-1477]: column stats for List(Z_ENTRY_DATE#692746) does not exist

2025-11-24 08:36:36,924 INFO EnsureOptimalPartitioningHelper [Thread-1477]: stats doesn't allow to use List(Z_ENTRY_DATE#692746), returning default shuffle keys

So I tried to manually compute statistics with ANALYZE TABLE command, however I am not sure whether those statistics are actually computed since I cannot see anything with DESC EXTENDED or SHOW TBLPROPERTIES - only numFiles and sizeInBytes are available.

So while ANALYZE TABLE executes successfully, does this command actually work?

If it does, how can I ensure that such statistics are exposed to Spark’s Adaptive Query Execution (AQE)?

Pilladi Padma Sai Manisha 495 Reputation points Microsoft External Staff Moderator

2025-11-25T06:06:59.5566667+00:00

Hi fred31330,
I hope you had a chance to review the information shared earlier, and I hope this information has been helpful! If you still have questions, please let us know what is needed in the comments so the question can be answered.

4 answers

Your answer

Pilladi Padma Sai Manisha 495 Reputation points Microsoft External Staff Moderator

2025-11-25T06:06:59.5566667+00:00

Hi fred31330,
I hope you had a chance to review the information shared earlier, and I hope this information has been helpful! If you still have questions, please let us know what is needed in the comments so the question can be answered.

Answer 1

The ANALYZE TABLE command is applicable in Spark 3.5 and is used to collect statistics about a specific table or all tables in a specified schema. This command can help in generating optimal query plans by providing the necessary statistics to the query optimizer. However, the effectiveness of the command and the visibility of the statistics can depend on the specific configurations and the context in which it is used.

In your case, if you are not seeing the expected statistics after running ANALYZE TABLE, it could be due to limitations in how statistics are collected or exposed in your Spark environment. Specifically, the statistics collected may not be fully integrated with Adaptive Query Execution (AQE) if they are not properly maintained or updated.

To ensure that the statistics are exposed to AQE, it is recommended to run ANALYZE TABLE after any significant changes to the table, such as after a MERGE operation. Additionally, enabling predictive optimization for Unity Catalog managed tables can also help in automatically running ANALYZE and keeping statistics up to date, which is beneficial for AQE.

If you continue to experience issues with missing statistics, you may want to check the configurations related to statistics collection and AQE in your Spark setup, as well as ensure that the statistics are being computed correctly during the ANALYZE TABLE execution.

References:

Answer 2

Hi fred31330,
Welcome to microsoft Q&A!
It sounds like you’re trying to understand if the ANALYZ TABLE command works with Spark 3.5 and how to expose its statistics to Adaptive Query Execution (AQE).
Yes, ANALYZE TABLE absolutely works in Spark 3.5 and is a key tool for improving query performance. When you run this command, Spark collects important statistics about your table,things like the number of rows, file sizes, and detailed stats on specific columns if you ask for them. While you may not see these detailed stats right away using commands like DESC EXTENDED or SHOW TBLPROPERTIES (which just show basic info like file count), Spark uses these statistics internally to make smarter decisions when running your queries.

To get the most out of ANALYZE TABLE and its statistics for Spark’s Adaptive Query Execution (AQE), here’s what you should do:
Run

ANALYZE TABLE your_table COMPUTE STATISTICS FOR ALL COLUMNS;

If your data changes a lot, update the stats regularly so they stay accurate.
Make sure AQE is turned on by setting spark.sql.adaptive.enabled = true; this allows Spark to automatically optimize query plans using these stats.
Also enable related features like dynamic partition pruning to speed things up further.
If you’re using Unity Catalog with managed tables, predictive optimization can be enabled to keep stats fresh and automatically help performance.

Even if you don’t see the stats directly, Spark is using them behind the scenes to help things like join choices and shuffle strategies during MERGE operations or other queries. Without these stats, Spark falls back to less efficient defaults, which explains the "column stats does not exist" message you saw. Running ANALYZE TABLE properly and keeping stats updated ensures your incremental writes and queries run smoothly and faster in Spark 3.5.

To assist you better, could you clarify:

Are you executing ANALYZE TABLE on Delta tables?
Is there any specific error or message you received that led you to question if statistics are computed correctly?
Are there any specific queries or performance issues you are experiencing that you think are related to missing statistics?

Hope this helps! Let me know if you have more questions

Answer 3

fred31330 81

Hi @Pilladi Padma Sai Manisha

Thanks for your reply.

Yes - In case that makes any difference, my tables are not managed (i.e. I control the LOCATION where they are stored).
I do not have any error - Just the message that I mentioned above. And from what I've read, the AQE would somnehow leverage stats that are stored as part of the metadata of the table itself, and since Synapse does not expose them, I was wondering whether this was working.
It's just about the fact that logs keep on telling me about reverting to "default shuffle keys".

I will launch it again and revert with my results tomorrow.

fred31330 81 Reputation points

2025-11-26T09:41:02.1366667+00:00
Hi @Pilladi Padma Sai Manisha

I checked, and TBH I have some doubts that computed stats persist across notebook executions.

My current process:

Issue MERGE command

Run ANALYZE TABLE on specific columns (usually keys and fields used in JOIN) once MERGE is done

My take is that those stats are to be leveraged during next run, when the MERGE kicks off. But when I observe my logs, I have the same EnsureOptimalPartitioningHelper notices telling that:

column stats for ArrayBuffer(<my field>#768107) does not exist

Should I run ANALYZE for all columns as you mentioned, or is it fine if I run it like this:

ANALYZE TABLE <schema>.<table_name> COMPUTE STATISTICS FOR COLUMNS FIELD1 ,FIELD2 ,FIELD3
Pilladi Padma Sai Manisha 495 Reputation points Microsoft External Staff Moderator

2025-11-26T10:39:30.8233333+00:00
Hi fred31330 ,
Your approach of running ANALYZE TABLE on specific columns after each MERGE is generally sound since those columns are usually the most important for join and filter predicates. However, regarding whether to analyze all columns or only selected ones, here is some advice based on Spark 3.5 behavior and best practices:

Running
ANALYZE TABLE <schema>.<table_name> COMPUTE STATISTICS FOR COLUMNS FIELD1, FIELD2, FIELD3
is perfectly acceptable if those columns are the main keys for your joins and filters. This targeted approach reduces overhead compared to analyzing the entire table.

That said, running ANALYZE TABLE ... FOR ALL COLUMNS at least periodically can help refresh statistics comprehensively. Some Spark internal components and Adaptive Query Execution (AQE) might benefit from having statistics on all columns, which can minimize warnings like the "column stats does not exist" message you observe.

Statistics computed by ANALYZE TABLE should persist across notebook executions because they are stored in the table metadata. If you continue to see logs stating column stats do not exist on subsequent runs, it could mean your statistics are not fully refreshed or your metadata cache needs to be refreshed (e.g., with REFRESH TABLE).

To ensure stats are effectively utilized:

Confirm AQE is enabled (spark.sql.adaptive.enabled=true).

Run ANALYZE TABLE after each major data update (MERGE).

Optionally periodically run ANALYZE TABLE with FOR ALL COLUMNS to capture statistics more broadly.

Refresh table metadata if applicable

selectively analyzing key columns should work if you keep stats fresh and AQE enabled. If issues persist, try analyzing all columns from time to time and verify your environment is correctly persisting stats and refreshing metadata.

This balanced approach optimizes both performance and query planning effectiveness, and should reduce the warnings you see in your Spark logs.

Hope this helps! Let me know if you have more questions
fred31330 81 Reputation points

2025-11-26T13:50:49.63+00:00

Hi @Pilladi Padma Sai Manisha

The error message "column stats for ArrayBuffer(FIELD1#768107) does not exist" indicates that Spark's Cost-Based Optimizer (CBO) is attempting to use column statistics for FIELD1 but cannot find them in the Metastore, despite previous ANALYZE TABLE commands. This suggests an issue with how the statistics are being stored, retrieved, or applied.

I checked the value of both CBO and AQE, and they are both enabled:

I will try again by adding a REFRESH TABLE command prior to my MERGE to see if that changes anything, and will revert.

Still, is there a physical way to check whether these stats have been effectively saved as a blob or alike?
Pilladi Padma Sai Manisha 495 Reputation points Microsoft External Staff Moderator

2025-11-27T07:35:27.46+00:00
Hi fred31330,

The message means Spark can’t find the column stats in the Metastore even after ANALYZE TABLE runs. Adding REFRESH TABLE before MERGE is a good step to clear cached metadata.

To check if stats are saved:

Query your Metastore directly for column stats (e.g., Hive’s TAB_COL_STATS).

Use EXPLAIN EXTENDED on queries to see if row counts appear for those columns.

For Delta tables, inspect the _delta_log files for metadata with stats.

Enable Spark debug logs to confirm CBO is using stats during query planning.

Stats should persist across sessions if stored properly. If missing, it could be a persistence, caching, or permissions issue. Your REFRESH TABLE test will help clarify.
Pilladi Padma Sai Manisha 495 Reputation points Microsoft External Staff Moderator

2025-11-28T06:11:05.6633333+00:00

Hi fred31330,

Thankyou for your response.Could you please let us know is issue resolved from your side,if not Please feel free to ask.
fred31330 81 Reputation points

2025-11-28T08:55:32.6266667+00:00

HI @Pilladi Padma Sai Manisha
Yes, it's resolved, thanks. Will you summarize or can I mark my post as answer?
Pilladi Padma Sai Manisha 495 Reputation points Microsoft External Staff Moderator

2025-11-28T09:42:14.3733333+00:00

Hi fred31330,
If the provided solution helped you to resolved your issue, please accept the answer.

Answer 4

fred31330 81

Hi @Pilladi Padma Sai Manisha

I was not able to query Synapse's Hive at all, and _delta_log did not evidence any stat anywhere.

I can however now see that statistics are computed in stderr.

Moreover, I noticed that ANALYZE TABLE does not refresh the cache - I need to subsequently call REFRESH TABLE to load the latest statistics so that they can be used downstream.

Anyway, thanks for your support!

Share via

Does ANALYZE TABLE work with Spark 3.5 and are statistics exposed to AQE?

4 answers

Your answer