Migrate from Azure SQL Hyperscale - Is ADX suitable?

Pål Kristian Halle 65 Reputation points
2024-08-05T22:20:57.7633333+00:00

Hi.

We want to migrate our time series table from Azure SQL Database (Hyperscale tier), and ADX looks promising.

I would like to do a POC first to see if it fits our needs, and I would be very grateful if someone can give me suggestions for different parameters for a good trade-off between performance, storage size and cost.

This is our current table:

CREATE TABLE [Timeseries] (
    [DeviceId]   int             NOT NULL, -- Unique across all tenants
    [Timestamp]  datetime2(3)    NOT NULL, -- Typically 10 seconds interval
    [Value]      float           NULL,     -- Often aggregated in queries
    [Details]    varchar(255)    NULL,     -- Usually null
    [InsertedAt] datetime2(3)    NOT NULL,
    CONSTRAINT   [PK_Timeseries] PRIMARY KEY CLUSTERED ([DeviceId], [Timestamp])
)

CREATE NONCLUSTERED INDEX [IX_Timeseries_InsertedAt]
ON [Timeseries] ([DeviceId], [InsertedAt])
INCLUDE ([Timestamp], [Value], [Details])

The table contains 200 billion rows spread on 80 000 unique DeviceIds. We have about 100 tenants, but are expanding fast. Most tenants have 100-500 devices, but a few have around 20 000 each.

Requirements:

  • 10 years retention period based on Timestamp
    • Historical values may be backfilled, therefore retention by ingestion time is undesirable
  • Delay between ingestion and visibility in query results must not be more than a few seconds
  • There must be some sort of confirmation from the client SDK that ingestion was successful
    • The data loggers are programmed to retransmit if the server was unable to save the data
  • Values may be modified afterwards, e.g. preliminary values are sent before final ones with the same timestamps
    • In such cases only the final (last ingested) value must be returned in query results (aggregated and non-aggregated)
  • We are only looking to migrate this specific table, so there will be no joins etc. in queries

The most typical queries are:

  • Get raw (non-aggregated) values filtered on DeviceId and Timestamp
  • Get hourly, daily and monthly aggregated values filtered on DeviceId and Timestamp
  • Get n last values filtered on DeviceId (order by Timestamp desc)
  • Get raw values filtered on DeviceId and InsertedAt

Questions:

  1. Is ADX suitable for this?
  2. Should we go with a single, multi-tenant database or one per tenant?
  3. Should we partition by DeviceId and/or Timestamp, or introduce a new TenantId column (if single db)? *
  4. What kind of partitioning? Hash, uniform range?
  5. Is the InsertedAt column needed when we have ingestion_time() ?

* Since DeviceId is unique across all tenants, it is not needed in our current design, but it could be added if it benefits performance/scaling.

I appreciate any feedback. Thanks.

Azure Data Explorer
Azure Data Explorer
An Azure data analytics service for real-time analysis on large volumes of data streaming from sources including applications, websites, and internet of things devices.
524 questions
0 comments No comments
{count} votes

Accepted answer
  1. phemanth 10,330 Reputation points Microsoft Vendor
    2024-08-06T09:53:21.33+00:00

    @Pål Kristian Halle

    Thanks for using MS Q&A platform and posting your query.

    Based on your requirements, ADX (Azure Data Explorer) seems like a suitable choice for migrating your time series table from Azure SQL Database (Hyperscale tier).

    Here's a detailed response to your questions:

    Is ADX suitable for this?

    Yes, ADX is designed for handling large-scale time series data and provides features that align with your requirements, such as:

    • High-performance ingestion and query capabilities
    • Support for time series data with high granularity (10-second intervals)
    • Efficient data retention and purging mechanisms
    • Ability to handle large amounts of data (200 billion rows)

    Should we go with a single, multi-tenant database or one per tenant?

    Considering your expansion plans and the varying number of devices per tenant, a single, multi-tenant database might be a better choice. This approach allows for:

    • Simplified management and maintenance
    • Better resource utilization
    • Easier scalability

    However, if you anticipate significant performance or security concerns, you may want to consider a separate database per tenant.

    Should we partition by DeviceId and/or Timestamp, or introduce a new TenantId column (if single db)?

    Partitioning by DeviceId and Timestamp can improve query performance. You can use a combination of both columns as the partition key. Introducing a new TenantId column can help with data organization and filtering, but it might not be necessary if DeviceId is unique across all tenants.

    What kind of partitioning? Hash, uniform range?

    Uniform range partitioning might be a better choice for your use case, as it allows for efficient querying and data retrieval based on the Timestamp column. Hash partitioning can lead to hotspots and reduced performance.

    Is the InsertedAt column needed when we have ingestion_time()?

    You can use the ingestion_time() function to track the ingestion time, making the InsertedAt column redundant. However, if you need to store the original insertion time for auditing or other purposes, you may want to keep the InsertedAt column.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    2 people found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.