Does ANALYZE TABLE work with Spark 3.5 and are statistics exposed to AQE?

fred31330 81 Reputation points
2025-11-24T11:28:28.5+00:00

I am running a notebook in a Spark 3.5 cluster in which I am basically using MERGE to incrementally add to my delta tables.

While I was monitoring my spark log execution, I noticed messages like this:

2025-11-24 08:36:36,924 INFO EnsureOptimalPartitioningHelper [Thread-1477]: column stats for List(Z_ENTRY_DATE#692746) does not exist

2025-11-24 08:36:36,924 INFO EnsureOptimalPartitioningHelper [Thread-1477]: stats doesn't allow to use List(Z_ENTRY_DATE#692746), returning default shuffle keys

So I tried to manually compute statistics with ANALYZE TABLE command, however I am not sure whether those statistics are actually computed since I cannot see anything with DESC EXTENDED or SHOW TBLPROPERTIES - only numFiles and sizeInBytes are available.

So while ANALYZE TABLE executes successfully, does this command actually work?

If it does, how can I ensure that such statistics are exposed to Spark’s Adaptive Query Execution (AQE)?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
{count} votes

4 answers

Sort by: Most helpful
  1. Q&A Assist
    2025-11-24T11:28:39.84+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    The ANALYZE TABLE command is applicable in Spark 3.5 and is used to collect statistics about a specific table or all tables in a specified schema. This command can help in generating optimal query plans by providing the necessary statistics to the query optimizer. However, the effectiveness of the command and the visibility of the statistics can depend on the specific configurations and the context in which it is used.

    In your case, if you are not seeing the expected statistics after running ANALYZE TABLE, it could be due to limitations in how statistics are collected or exposed in your Spark environment. Specifically, the statistics collected may not be fully integrated with Adaptive Query Execution (AQE) if they are not properly maintained or updated.

    To ensure that the statistics are exposed to AQE, it is recommended to run ANALYZE TABLE after any significant changes to the table, such as after a MERGE operation. Additionally, enabling predictive optimization for Unity Catalog managed tables can also help in automatically running ANALYZE and keeping statistics up to date, which is beneficial for AQE.

    If you continue to experience issues with missing statistics, you may want to check the configurations related to statistics collection and AQE in your Spark setup, as well as ensure that the statistics are being computed correctly during the ANALYZE TABLE execution.


    References:

    0 comments No comments

  2. Pilladi Padma Sai Manisha 495 Reputation points Microsoft External Staff Moderator
    2025-11-24T12:49:22.26+00:00

    Hi fred31330,
    Welcome to microsoft Q&A!
    It sounds like you’re trying to understand if the ANALYZ TABLE command works with Spark 3.5 and how to expose its statistics to Adaptive Query Execution (AQE).
    Yes, ANALYZE TABLE absolutely works in Spark 3.5 and is a key tool for improving query performance. When you run this command, Spark collects important statistics about your table,things like the number of rows, file sizes, and detailed stats on specific columns if you ask for them. While you may not see these detailed stats right away using commands like DESC EXTENDED or SHOW TBLPROPERTIES (which just show basic info like file count), Spark uses these statistics internally to make smarter decisions when running your queries.

    To get the most out of ANALYZE TABLE and its statistics for Spark’s Adaptive Query Execution (AQE), here’s what you should do:
    Run

    ANALYZE TABLE your_table COMPUTE STATISTICS FOR ALL COLUMNS; 
    
    • If your data changes a lot, update the stats regularly so they stay accurate.
    • Make sure AQE is turned on by setting spark.sql.adaptive.enabled = true; this allows Spark to automatically optimize query plans using these stats.
    • Also enable related features like dynamic partition pruning to speed things up further.
    • If you’re using Unity Catalog with managed tables, predictive optimization can be enabled to keep stats fresh and automatically help performance.

    Even if you don’t see the stats directly, Spark is using them behind the scenes to help things like join choices and shuffle strategies during MERGE operations or other queries. Without these stats, Spark falls back to less efficient defaults, which explains the "column stats does not exist" message you saw. Running ANALYZE TABLE properly and keeping stats updated ensures your incremental writes and queries run smoothly and faster in Spark 3.5.

    To assist you better, could you clarify:

    1. Are you executing ANALYZE TABLE on Delta tables?
    2. Is there any specific error or message you received that led you to question if statistics are computed correctly?
    3. Are there any specific queries or performance issues you are experiencing that you think are related to missing statistics?

    Hope this helps! Let me know if you have more questions

    0 comments No comments

  3. fred31330 81 Reputation points
    2025-11-24T17:05:06.4866667+00:00

    Hi @Pilladi Padma Sai Manisha

    Thanks for your reply.

    1. Yes - In case that makes any difference, my tables are not managed (i.e. I control the LOCATION where they are stored).
    2. I do not have any error - Just the message that I mentioned above. And from what I've read, the AQE would somnehow leverage stats that are stored as part of the metadata of the table itself, and since Synapse does not expose them, I was wondering whether this was working.
    3. It's just about the fact that logs keep on telling me about reverting to "default shuffle keys".

    I will launch it again and revert with my results tomorrow.


  4. fred31330 81 Reputation points
    2025-11-27T16:47:46.4133333+00:00

    Hi @Pilladi Padma Sai Manisha

    I was not able to query Synapse's Hive at all, and _delta_log did not evidence any stat anywhere.

    I can however now see that statistics are computed in stderr.

    Moreover, I noticed that ANALYZE TABLE does not refresh the cache - I need to subsequently call REFRESH TABLE to load the latest statistics so that they can be used downstream.

    Anyway, thanks for your support!

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.