Migrate from Azure SQL Hyperscale - Is ADX suitable?

Question

Hi.

We want to migrate our time series table from Azure SQL Database (Hyperscale tier), and ADX looks promising.

I would like to do a POC first to see if it fits our needs, and I would be very grateful if someone can give me suggestions for different parameters for a good trade-off between performance, storage size and cost.

This is our current table:

CREATE TABLE [Timeseries] (
    [DeviceId]   int             NOT NULL, -- Unique across all tenants
    [Timestamp]  datetime2(3)    NOT NULL, -- Typically 10 seconds interval
    [Value]      float           NULL,     -- Often aggregated in queries
    [Details]    varchar(255)    NULL,     -- Usually null
    [InsertedAt] datetime2(3)    NOT NULL,
    CONSTRAINT   [PK_Timeseries] PRIMARY KEY CLUSTERED ([DeviceId], [Timestamp])
)

CREATE NONCLUSTERED INDEX [IX_Timeseries_InsertedAt]
ON [Timeseries] ([DeviceId], [InsertedAt])
INCLUDE ([Timestamp], [Value], [Details])

The table contains 200 billion rows spread on 80 000 unique DeviceIds. We have about 100 tenants, but are expanding fast. Most tenants have 100-500 devices, but a few have around 20 000 each.

Requirements:

10 years retention period based on Timestamp
- Historical values may be backfilled, therefore retention by ingestion time is undesirable
Delay between ingestion and visibility in query results must not be more than a few seconds
There must be some sort of confirmation from the client SDK that ingestion was successful
- The data loggers are programmed to retransmit if the server was unable to save the data
Values may be modified afterwards, e.g. preliminary values are sent before final ones with the same timestamps
- In such cases only the final (last ingested) value must be returned in query results (aggregated and non-aggregated)
We are only looking to migrate this specific table, so there will be no joins etc. in queries

The most typical queries are:

Get raw (non-aggregated) values filtered on DeviceId and Timestamp
Get hourly, daily and monthly aggregated values filtered on DeviceId and Timestamp
Get n last values filtered on DeviceId (order by Timestamp desc)
Get raw values filtered on DeviceId and InsertedAt

Questions:

Is ADX suitable for this?
Should we go with a single, multi-tenant database or one per tenant?
Should we partition by DeviceId and/or Timestamp, or introduce a new TenantId column (if single db)? *
What kind of partitioning? Hash, uniform range?
Is the InsertedAt column needed when we have ingestion_time() ?

* Since DeviceId is unique across all tenants, it is not needed in our current design, but it could be added if it benefits performance/scaling.

I appreciate any feedback. Thanks.

Accepted Answer

@Pål Kristian Halle

Thanks for using MS Q&A platform and posting your query.

Based on your requirements, ADX (Azure Data Explorer) seems like a suitable choice for migrating your time series table from Azure SQL Database (Hyperscale tier).

Here's a detailed response to your questions:

Is ADX suitable for this?

Yes, ADX is designed for handling large-scale time series data and provides features that align with your requirements, such as:

High-performance ingestion and query capabilities
Support for time series data with high granularity (10-second intervals)
Efficient data retention and purging mechanisms
Ability to handle large amounts of data (200 billion rows)

Should we go with a single, multi-tenant database or one per tenant?

Considering your expansion plans and the varying number of devices per tenant, a single, multi-tenant database might be a better choice. This approach allows for:

Simplified management and maintenance
Better resource utilization
Easier scalability

However, if you anticipate significant performance or security concerns, you may want to consider a separate database per tenant.

Should we partition by DeviceId and/or Timestamp, or introduce a new TenantId column (if single db)?

Partitioning by DeviceId and Timestamp can improve query performance. You can use a combination of both columns as the partition key. Introducing a new TenantId column can help with data organization and filtering, but it might not be necessary if DeviceId is unique across all tenants.

What kind of partitioning? Hash, uniform range?

Uniform range partitioning might be a better choice for your use case, as it allows for efficient querying and data retrieval based on the Timestamp column. Hash partitioning can lead to hotspots and reduced performance.

Is the InsertedAt column needed when we have ingestion_time()?

You can use the ingestion_time() function to track the ingestion time, making the InsertedAt column redundant. However, if you need to store the original insertion time for auditing or other purposes, you may want to keep the InsertedAt column.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Migrate from Azure SQL Hyperscale - Is ADX suitable?

0 additional answers

Your answer