Query acceleration SQL language reference

Query acceleration supports an ANSI SQL-like language for expressing queries over blob contents. The query acceleration SQL dialect is a subset of ANSI SQL, with a limited set of supported data types, operators, etc., but it also expands on ANSI SQL to support queries over hierarchical semi-structured data formats such as JSON.

SELECT Syntax

The only SQL statement supported by query acceleration is the SELECT statement. This example returns every row for which expression returns true.

SELECT * FROM table [WHERE expression] [LIMIT limit]

For CSV-formatted data, table must be BlobStorage. This means that the query will run against whichever blob was specified in the REST call. For JSON-formatted data, table is a "table descriptor." See the Table Descriptors section of this article.

In the following example, for each row for which the WHERE expression returns true, this statement will return a new row that is made from evaluating each of the projection expressions.

SELECT expression [, expression ...] FROM table [WHERE expression] [LIMIT limit]

You can specify one or more specific columns as part of the SELECT expression (for example, SELECT Title, Author, ISBN).

Note

The maximum number of specific columns that you can use in the SELECT expression is 49. If you need your SELECT statement to return more than 49 columns, then use a wildcard character (*) for the SELECT expression (For example: SELECT *).

The following example returns an aggregate computation (For example: the average value of a particular column) over each of the rows for which expression returns true.

SELECT aggregate_expression FROM table [WHERE expression] [LIMIT limit]

The following example returns suitable offsets for splitting a CSV-formatted blob. See the Sys.Split section of this article.

SELECT sys.split(split_size)FROM BlobStorage

Data Types

Data Type Description
INT 64-bit signed integer.
FLOAT 64-bit ("double-precision") floating point.
STRING Variable-length Unicode string.
TIMESTAMP A point in time.
BOOLEAN True or false.

When reading values from CSV-formatted data, all values are read as strings. String values may be converted to other types using CAST expressions. Values may be implicitly cast to other types depending on context. For more information, see Data type precedence (Transact-SQL).

Expressions

Referencing fields

For JSON-formatted data, or CSV-formatted data with a header row, fields may be referenced by name. Field names can be quoted or unquoted. Quoted field names are enclosed in double-quote characters ("), may contain spaces, and are case-sensitive. Unquoted field names are case-insensitive, and may not contain any special characters.

In CSV-formatted data, fields may also be referenced by ordinal, prefixed with an underscore (_) character. For example, the first field may be referenced as _1, or the eleventh field may be referenced as _11. Referencing fields by ordinal is useful for CSV-formatted data that does not contain a header row, in which case the only way to reference a particular field is by ordinal.

Operators

The following standard SQL operators are supported:

Operator Description
= Compares the equality of two expressions (a comparison operator).
!= Tests whether one expression is not equal to another expression (a comparison operator).
<> Compares two expressions for not equal to (a comparison operator).
< Compares two expressions for lesser than (a comparison operator).
<= Compares two expressions for lesser than or equal (a comparison operator).
> Compares two expressions for greater than (a comparison operator).
>= Compares two expressions for greater than or equal (a comparison operator).
+ Adds two numbers. This addition arithmetic operator can also add a number, in days, to a date.
- Subtracts two numbers (an arithmetic subtraction operator).
/ Divides one number by another (an arithmetic division operator).
* Multiplies two expressions (an arithmetic multiplication operator).
% Returns the remainder of one number divided by another.
AND Performs a bitwise logical AND operation between two integer values.
OR Performs a bitwise logical OR operation between two specified integer values as translated to binary expressions within Transact-SQL statements.
NOT Negates a Boolean input.
CAST Converts an expression of one data type to another.
BETWEEN Specifies a range to test.
IN Determines whether a specified value matches any value in a subquery or a list.
NULLIF Returns a null value if the two specified expressions are equal.
COALESCE Evaluates the arguments in order and returns the current value of the first expression that initially doesn't evaluate to NULL.

If data types on the left and right of an operator are different, then automatic conversion will be performed according to the rules specified here: Data type precedence (Transact-SQL).

The query acceleration SQL language supports only a very small subset of the data types discussed in that article. See the Data Types section of this article.

Casts

The query acceleration SQL language supports the CAST operator, according to the rules here: Data type conversion (Database Engine).

The query acceleration SQL language supports only a tiny subset of the data types discussed in that article. See the Data Types section of this article.

String functions

The query acceleration SQL language supports the following standard SQL string functions:

Function Description
CHAR_LENGTH Returns the length in characters of the string expression, if the string expression is of a character data type; otherwise, returns the length in bytes of the string expression (the smallest integer not less than the number of bits divided by 8). (This function is the same as the CHARACTER_LENGTH function.)
CHARACTER_LENGTH Returns the length in characters of the string expression, if the string expression is of a character data type; otherwise, returns the length in bytes of the string expression (the smallest integer not less than the number of bits divided by 8). (This function is the same as the CHAR_LENGTH function
LOWER Returns a character expression after converting uppercase character data to lowercase.
UPPER Returns a character expression with lowercase character data converted to uppercase.
SUBSTRING Returns part of a character, binary, text, or image expression in SQL Server.
TRIM Removes the space character char(32) or other specified characters from the start and end of a string.
LEADING Removes the space character char(32) or other specified characters from the start of a string.
TRAILING Removes the space character char(32) or other specified characters from the end of a string.

Here's a few examples:

Function Example Result
CHARACTER_LENGTH SELECT CHARACTER_LENGTH('abcdefg') from BlobStorage 7
CHAR_LENGTH SELECT CHAR_LENGTH(_1) from BlobStorage 1
LOWER SELECT LOWER('AbCdEfG') from BlobStorage abcdefg
UPPER SELECT UPPER('AbCdEfG') from BlobStorage ABCDEFG
SUBSTRING SUBSTRING('123456789', 1, 5) 23456
TRIM TRIM(BOTH '123' FROM '1112211Microsoft22211122') Microsoft

Date functions

The following standard SQL date functions are supported:

  • DATE_ADD
  • DATE_DIFF
  • EXTRACT
  • TO_STRING
  • TO_TIMESTAMP

Currently, all date formats of standard IS08601 are converted.

DATE_ADD function

The query acceleration SQL language supports year, month, day, hour, minute, second for the DATE_ADD function.

Examples:

DATE_ADD(datepart, quantity, timestamp)
DATE_ADD('minute', 1, CAST('2017-01-02T03:04:05.006Z' AS TIMESTAMP)

DATE_DIFF function

The query acceleration SQL language supports year, month, day, hour, minute, second for the DATE_DIFF function.

DATE_DIFF(datepart, timestamp, timestamp)
DATE_DIFF('hour','2018-11-09T00:00+05:30','2018-11-09T01:00:23-08:00') 

EXTRACT function

For EXTRACT other than date part supported for the DATE_ADD function, the query acceleration SQL language supports timezone_hour and timezone_minute as date part.

Examples:

EXTRACT(datepart FROM timestampstring)
EXTRACT(YEAR FROM '2010-01-01T')

TO_STRING function

Examples:

TO_STRING(TimeStamp , format)
TO_STRING(CAST('1969-07-20T20:18Z' AS TIMESTAMP),  'MMMM d, y')

This table describes strings that you can use to specify the output format of the TO_STRING function.

Format string Output
yy Year in 2 digit format - 1999 as '99'
y Year in 4 digit format
yyyy Year in 4 digit format
M Month of year - 1
MM Zero padded Month - 01
MMM Abbreviated month of the Year - JAN
MMMM Full month - May
d Day of the month (1-31)
dd Zero padded day of the month (01-31)
a AM or PM
h Hour of the day (1-12)
hh Zero padded Hours of the day (01-12)
H Hour of the day (0-23)
HH Zero Padded hour of the day (00-23)
m Minute of the hour (0-59)
mm Zero padded minute (00-59)
s Second of of minutes (0-59)
ss Zero padded seconds (00-59)
S Fraction of seconds (0.1-0.9)
SS Fraction of seconds (0.01-0.99)
SSS Fraction of seconds (0.001-0.999)
X Offset in hours
XX or XXXX Offset in hours and minutes (+0430)
XXX or XXXXX Offset in hours and minutes (-07:00)
x Offset in hours (7)
xx or xxxx Offset in hour and minute (+0530)
Xxx or xxxxx Offset in hour and minute (+05:30)

TO_TIMESTAMP function

Only IS08601 formats are supported.

Examples:

TO_TIMESTAMP(string)
TO_TIMESTAMP('2007T')

Note

You can also use the UTCNOW function to get the system time.

Aggregate Expressions

A SELECT statement may contain either one or more projection expressions or a single aggregate expression. The following aggregate expressions are supported:

Expression Description
COUNT(*) Returns the number of records which matched the predicate expression.
COUNT(expression) Returns the number of records for which expression is non-null.
AVG(expression) Returns the average of the non-null values of expression.
MIN(expression) Returns the minimum non-null value of expression.
MAX(expression Returns the maximum non-null value of expression.
SUM(expression) Returns the sum of all non-null values of expression.

MISSING

The IS MISSING operator is the only non-standard that the query acceleration SQL language supports. For JSON data, if a field is missing from a particular input record, the expression field IS MISSING will evaluate to the Boolean value true.

Table Descriptors

For CSV data, the table name is always BlobStorage. For example:

SELECT * FROM BlobStorage

For JSON data, additional options are available:

SELECT * FROM BlobStorage[*].path

This allows queries over subsets of the JSON data.

For JSON queries, you can mention the path in part of the FROM clause. These paths will help to parse the subset of JSON data. These paths can reference to JSON Array and Object values.

Let's take an example to understand this in more detail.

This is our sample data:

{
  "id": 1,
  "name": "mouse",
  "price": 12.5,
  "tags": [
    "wireless",
    "accessory"
  ],
  "dimensions": {
    "length": 3,
    "width": 2,
    "height": 2
  },
  "weight": 0.2,
  "warehouses": [
    {
      "latitude": 41.8,
      "longitude": -87.6
    }
  ]
}

You might be interested only in the warehouses JSON object from the above data. The warehouses object is a JSON array type, so you can mention this in the FROM clause. Your sample query can look something like this.

SELECT latitude FROM BlobStorage[*].warehouses[*]

The query gets all fields but selects only the latitude.

If you wanted to access only the dimensions JSON object value, you could use refer to that object in your query. For example:

SELECT length FROM BlobStorage[*].dimensions

This also limits your access to members of the dimensions object. If you want to access other members of JSON fields and inner values of JSON objects, then you might use a query such as shown in the following example:

SELECT weight,warehouses[0].longitude,id,tags[1] FROM BlobStorage[*]

Note

BlobStorage and BlobStorage[*] both refer to the whole object. However, if you have a path in the FROM clause, then you'll need to use BlobStorage[*].path

Sys.Split

This is a special form of the SELECT statement, which is available only for CSV-formatted data.

SELECT sys.split(split_size) FROM BlobStorage

Use this statement in cases where you want to download and then process CSV data records in batches. That way you can process records in parallel instead of having to download all records at one time. This statement doesn't return records from the CSV file. Instead, it returns a collection of batch sizes. You can then use each batch size to retrieve a batch of data records.

Use the split_size parameter to specify the number of bytes that you want each batch to contain. For example, if you want to process only 10 MB of data at a time, your statement would look like this: SELECT sys.split(10485760)FROM BlobStorage because 10 MB is equal to 10,485,760 bytes. Each batch will contain as many records as can fit into those 10 MB.

In most cases, the size of each batch will be slightly higher than the number that you specify. That's because a batch cannot contain a partial record. If the last record in a batch starts before the end of your threshold, the batch will be larger so that it can contain the complete record. The size of the last batch will likely be smaller than the size that you specify.

Note

The split_size must be at least 10 MB (10485760).

See also