Should I worry about a high number of extents?

Question

Should I worry about a high number of extents?

Pål Kristian Halle 110

Hello ADX experts.

I am currently doing a POC to see if ADX will work for us. So far I enjoy the ADX developer experience.

Since I haven't found a clear recommendation for how to partition my table, I have created two tables where I ingest identical data. Then I do query tests on both tables.

Since this approach requires more resources and effort, I would like to conclude which partitioning settings to use (and drop one of the tables) before doing further tests.

Here is the table schema:

.create table [table] (
    MetricId: int,
    Timestamp: datetime,
    Value: real
)

Table A has two partitioning keys: MetricId (Hash, Uniform) and Timestamp Table B has one partitioning key: Timestamp Table A must define MetricId as string since a Hash key only accepts string and guid.

The query performance is noticeably better on Table A for the simple tests I have done so far with the current data. This is probably because all queries have the following filter:

| where MetricId in (...)

This will also be the case for future queries.

Here are the extent stats for both tables:

TableName  Rows            Extents       OriginalSize  CompressedSize  IndexSize  ExtentSize
Table_A    14,813,100,000  432,054       1.28 TB       48.37 GB          1.63 GB  50.00 GB
Table_B    14,813,100,000   13,299       1.30 TB       54.20 GB        256.25 MB  54.45 GB

As you can see, the number of extents and index size are considerably higher for Table A, but the total size is smaller.

Although it seems that Table A wins with regards to both query performance and storage size, there are also some cons with Table A:

MetricId must be string and has to be converted to/from int in application code
Range filters like between can't be used on strings in queries. This is however something I can live with.
Some control commands, e.g. .show table [table] extents takes about 30x longer than for Table B. This is probably due to the high number of extents.

So finally, here are my questions:

Are the high number of extents and large index size for Table A something to worry about? In this post we can see that a warning was given when the number of extents exceeded 500K.
The test tables currently contain 500 unique MetricIds with 30M unique timestamps each. If we decide to use this in production, the number of metrics will be in the range of 200K - 1M, but the number of timestamps per metric will be the same. I guess this will change the extent stats quite a bit. Based on this, is it possible to foresee whether we should go for Table A or Table B?
Why can't a hash partition key be a numeric type (int)?

Any feedback is greatly appreciated.

Accepted answer

0 additional answers

Your answer

Answer 1

@Pål Kristian Halle

Thanks for using MS Q&A platform and posting your query.

I'll address your questions one by one:

Should you worry about a high number of extents?

In general, a high number of extents can lead to: * Increased storage overhead due to additional metadata and index storage. * Slower performance for certain operations, like .show table [table] extents, as you've observed. * Potential issues with query performance, especially if the extent count grows exponentially.

However, in your case, the query performance is better on Table A, which has a higher extent count. This is likely due to the efficient filtering on MetricId using the hash partition key.

The warning threshold of 500K extents mentioned in the post you linked is not a hard limit. It's a guideline to encourage optimization and monitoring of extent growth. If your extent count continues to grow, it's essential to monitor performance and adjust your partitioning strategy as needed.

How will the extent stats change with more metrics and timestamps?

With more metrics (200K - 1M) and the same number of timestamps per metric (30M), the extent count will likely increase. However, the growth rate will depend on the distribution of data across the metrics.

If the data is evenly distributed across metrics, the extent count might not grow exponentially. On the other hand, if some metrics have a significantly larger number of timestamps, the extent count could increase more rapidly.

To mitigate this, you can consider: * Using a more granular partition key, like a combination of MetricId and a timestamp bucket (e.g., MetricId + Timestamp truncated to the hour or day level). * Implementing a data retention policy to remove older data and maintain a stable extent count.

Why can't a hash partition key be a numeric type (int)?

Hash partition keys in ADX are designed to work with string or GUID types. This is because the hash function used to distribute data across partitions is optimized for these data types.

Using a numeric type like int as a hash partition key could lead to uneven data distribution, as the hash function might not be able to effectively spread the data across partitions. This could result in poor query performance and increased storage overhead.

If you need to use a numeric type as a partition key, consider using a range-based partition key (like Timestamp) or a combination of columns (like MetricId + Timestamp).

In conclusion, while a high extent count can be a concern, it's not necessarily a deal-breaker in your case, given the better query performance on Table A. Monitor your extent growth and adjust your partitioning strategy as needed to ensure optimal performance and storage efficiency.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Pål Kristian Halle 110 Reputation points

2024-08-13T19:51:21.5666667+00:00

Thank you for your detailed answer. There are so many parameters to play with, so your guidance is very helpful.

Interesting approach of using a partition key with a truncated datetime. I'm not sure how I can do that other than adding a new column and truncating it before ingesting.

When you say a combination, do you mean two separate partition keys or a single string key that is the concatenation of let's say MetricId + bin(Timestamp, 1d).

In either case I suspect that my queries would be much less effective since filtering will not be done on the partitioned column(s)?

I tried adjusting the RangeSize parameter for the datetime key from 1 to 12 days, but that didn't affect the number of extents (or any other extent stats).

One must-have feature is to have the retention policy honor the Timestamp column (not ingestion time), and it seems that the most efficient way to do this is with a datetime partition key that has OverrideCreationTime=true.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-08-14T09:10:15.35+00:00

@Pål Kristian Halle

You're correct that using a truncated datetime as a partition key would require adding a new column and truncating it before ingesting. This approach can be useful if you want to group data by a specific time granularity (e.g., daily, hourly).

When I mentioned a combination of columns, I meant a single string key that is the concatenation of multiple columns, such as MetricId + bin(Timestamp, 1d). This approach can help distribute data more evenly across partitions, but as you suspected, it might affect query performance if the filtering is not done on the partitioned column(s).

In your case, since you have a strong filtering requirement on MetricId, using a combination key might not be the best approach. Sticking with the two separate partition keys, MetricId and Timestamp, might be a better choice.

Regarding the RangeSize parameter, it's used to control the size of the ranges for range-based partition keys. Adjusting this parameter can affect the number of extents, but it might not have a significant impact in your case, especially if the data is relatively evenly distributed.

I completely agree with you on the importance of having a retention policy that honors the Timestamp column. Using a datetime partition key with OverrideCreationTime=true is a great approach to achieve this. This way, you can ensure that data is retained based on the Timestamp column, rather than the ingestion time.

Overall, it seems like you're on the right track with your current partitioning strategy. Monitoring extent growth and adjusting your partitioning strategy as needed will be crucial to ensure optimal performance and storage efficiency.
Pål Kristian Halle 110 Reputation points

2024-08-14T13:18:37.8433333+00:00

Perfect. Thanks again!

Share via

Should I worry about a high number of extents?

0 additional answers

Your answer