Share via


Sql server 2016 storing foreign language data

Question

Thursday, May 23, 2019 9:50 PM

In a sql server 2016 database, I want to store data in one column for the following languages: Burmese, Arabic, Nepali, Spanish, Karen, and potentially English. I will declare the field nvarchar(max) so the foreign languages can be saved. The intent of the field is to store letters written to different customers. in the one field.

I would like to know the following for my requirements:

1. Is storing the information in a nvarchar(max) the best, should I use a blob, or some other data type?

2. Will the collation sequence at some level like the database, server level need to be changed? If so, what needs to change and can you show me how I would accomplish the change?

3. Do I need to change some 'fonts' in the database or server level? I am asking that question since 'fonts' need to change when accessing some of the languages in Word 2016. If the fonts need to change, would you show me how to change the fonts?

4. Is there anything else I would need to do to store any of the foreign languages that I listed above? If so, what would I need to change?

All replies (8)

Friday, May 24, 2019 9:04 AM ✅Answered

Hi Wendy,

First let's confirm that COLLATE is very important when we use UNICODE. It is important to remember that collation is a general parameter which actually represents multiple parameters regarding how to use the data including how to sort the data and how to compare values. The options associated with a collation are language rules, case sensitivity, accent sensitivity, Kana-sensitivity, width sensitivity, variation-selector-sensitivity.

The following example use 2 different collate in the column's level. When we execute the same query sorting by the same column, the order of the result SET is different:

/*************************************************** DDL */
DROP TABLE IF EXISTS MultipleLanguages1
DROP TABLE IF EXISTS MultipleLanguages2
GO
CREATE TABLE MultipleLanguages1 (
    id INT IDENTITY(1,2),
    UnicodeData NVARCHAR(200) Collate Hebrew_100_BIN
)
GO
CREATE TABLE MultipleLanguages2 (
    id INT IDENTITY(1,2),
    UnicodeData NVARCHAR(200) Collate Hebrew_CI_AS
)
GO
/*************************************************** DML */
INSERT INTO MultipleLanguages1 (UnicodeData)
VALUES(N'A 1'), (N'B 3'), (N'a 2')
GO
INSERT INTO MultipleLanguages2 (UnicodeData)
VALUES (N'A 1'), (N'B 3'), (N'a 2')
GO
/***************************************************  */
select * from MultipleLanguages1 ORDER BY UnicodeData
select * from MultipleLanguages2 ORDER BY UnicodeData
GO -- Order is different since the COLLATE is different!
/***************************************************  */

It is not "just" a matter of order! Different COLLATE might return totally different result SET, as you can see in the following example:

DROP TABLE IF EXISTS MyTbl;
Create Table MyTbl(ID Int,Coteret NVarchar(20) Collate Hebrew_CI_AI);
Go
Insert Into MyTbl Values(1,N'תן');
Insert Into MyTbl Values(2,N'תנין');
Insert Into MyTbl Values(3,N'תּן');
Insert Into MyTbl Values(4,N'תַּן');
Insert Into MyTbl Values(5,N'תנִין');
Insert Into MyTbl Values(6,N'תנ');
Go

Select * From MyTbl Where Coteret Collate Hebrew_CI_AI Like N'%תן%' Collate Hebrew_CI_AI --Return ID: 1,2,3,4,5,6
Select * From MyTbl Where Coteret Collate Hebrew_CS_AI Like N'%תן%' Collate Hebrew_CS_AI --Return ID: 1,3,4
Select * From MyTbl Where Coteret Collate Hebrew_CI_AS Like N'%תן%' Collate Hebrew_CI_AS --Return ID: 1,2,6
Select * From MyTbl Where Coteret Collate Hebrew_CS_AS Like N'%תן%' Collate Hebrew_CS_AS --Return ID: 1
GO

OK, now that it is clear that COLLATE is important for UNICODE as well, we can go back to the last question: which COLLATE should we use for multiple languages support and can we use COLLATE in the row/value level?!?

SQL Server is a tabular database server which mean the data is logically organizes in table's structure. The logical hierarchy of entities is basically: Server Instance -> Database -> Table -> rows -> Column -> value. SQL Server like other tabular databases enforce the same properties for all values in a column. The COLUMN is the bottom entity in the table structure, which make sense. after all the values in the column are dynamically changed so they cannot be related to the table structure. But this does not mean that we cannot configure properties in the value level. During the execution we can use different configuration in the query level in order to control the properties of the values.

The following article bring my tricks on how we Sort data by multiple languages using different COLLATE

https://social.technet.microsoft.com/wiki/contents/articles/31194.t-sql-sort-data-by-multiple-languages.aspx

Note! As I explained in my previous answer in this thread "The collation in NVARCHAR type does not impact the content since it does not impact the encoding or the code page", which mean that in theory we can use any COLLATE that we want in the table level, and yet we can implement my tricks for multiple languages in the query level. With that being said, for most scenario like this I will recommend to choose a binary COLLATE since it has impact on performance in some cases.

  Ronen Ariely
 [Personal Site]    [Blog]    [Facebook]    [Linkedin]

Thursday, May 23, 2019 10:29 PM

Hi wendy elizabeth,

IMHO, answers for you:

  1. NVARCHAR(MAX) is the best.
  2. COLLATION could be defined on the DB, and most important for you, on the individual column level. You would need to test how different collations affect ORDER BY and string functions.
  3. Fonts are installed on the operating system (OS) level. SSMS, or any other IDE, is using then from the OS.
  4. Nothing else, unless real life will throw a monkey wrench.

Additionally, the upcoming SQL Server 2019 adds support for UTF-8. It will allow to use VARCHAR(MAX) datatype and by doing that could potentially reduce storage requirements by the factor of two.

Big thanks goes to Ronen Ariely for clarifying UTF-8 Code Points ranges and their impact on the storage.


Thursday, May 23, 2019 11:09 PM

In a sql server 2016 database, I want to store data in one column for the following languages: Burmese, Arabic, Nepali, Spanish, Karen, and potentially English. I will declare the field nvarchar(max) so the foreign languages can be saved. The intent of the field is to store letters written to different customers. in the one field.

I would like to know the following for my requirements:

1. Is storing the information in a nvarchar(max) the best, should I use a blob, or some other data type?

2. Will the collation sequence at some level like the database, server level need to be changed? If so, what needs to change and can you show me how I would accomplish the change?

3. Do I need to change some 'fonts' in the database or server level? I am asking that question since 'fonts' need to change when accessing some of the languages in Word 2016. If the fonts need to change, would you show me how to change the fonts?

4. Is there anything else I would need to do to store any of the foreign languages that I listed above? If so, what would I need to change?

Good day wendy,

>> Is storing the information in a nvarchar(max) the best

No one in the forum can tell you what is the best without FULLY familiar with your system, your needs, your resources, and so on...

The only think we can do in the forum is give golden rules and answer specific (focused) questions

>> should I use a blob, or some other data type

For most cases using in order to support multiple languages you should use UNICODE and before SQL Server 2019 this mean using the type NVARCHAR/NCHAR. Using SQL Server 2019 you can use VARCHAR

>> Will the collation sequence at some level like the database, server level need to be changed?

The collation in NVARCHAR type does not impact the content since it does not impact the encoding or the code page, but it does impact other behaviors like the sorting rules, case, and accent sensitivity properties.

>> Do I need to change some 'fonts' in the database or server level?

There is no "font" in the database or server level. Fonts are symbols (images). Fonts are SET of composed of glyphs that share common design features. The symbols can represent characters and are related to the display of the data and not to the data. In other words FONTs related to the application side.

>> I am asking that question since 'fonts' need to change when accessing some of the languages in Word 2016. If the fonts need to change, would you show me how to change the fonts?

This depend on the application which you develop. For example if you develop web application and you use HTML4 then you configure the font using the tag <font> as can be seen here. This has nothing to do with the database except that you can store the name of the font in the database. For example you can have column with the name of the font which you want to use for each rows.

>> Is there anything else I would need to do to store any of the foreign languages that I listed above?

Depend on your application and the way you use the data.

When you work with multiple languages, then You should take into consideration issues that related to non-deterministic behavior which can raise as result of different formats for example, differences in configurations between languages (like first day of the week), and so on... there are a lot to take into account when working with multiple languages, but these are not related to the column type for the string

  Ronen Ariely
 [Personal Site]    [Blog]    [Facebook]    [Linkedin]

Friday, May 24, 2019 12:11 AM

Additionally, the upcoming SQL Server 2019 adds support for UTF-8. It will allow to use VARCHAR(MAX) datatype and by doing that reduce storage requirements by the factor of two.

Hi Yitzhak,

This is far from being accurate. The following image is slide 23 from my lecture at the PASS Global event sqlsaturday:

As you can notice, ONLY for the first 128 code points you use half of the storage size while for the code points 128 – 2047 you do not save anything since the size is exactly the same. More interesting is that for all the code points between 2048 – 65535 you actually use MORE space if you move to UTF-8 over the old UTF-16

* You can download all the code demo from the lecture and presentation from Microsoft gallery: https://gallery.technet.microsoft.com/UTF-8-in-SQL-Server-2019-8d97cca2

  Ronen Ariely
 [Personal Site]    [Blog]    [Facebook]    [Linkedin]

Friday, May 24, 2019 2:51 AM

What should the collation sequence value be for each of the following languages:

1. Burmese

2. , Arabi

3. Nepali

4, Spanish

5.Karen

6.English

Can you have more than 1 collation sequence on database, and/or on the server? if so, can you show me how

how to setup the multiple collations that can be setup?


Friday, May 24, 2019 6:10 AM

Collations are defined on column Level and matters for ANSI text = data type varchar; for Unicode = nvarchar the collation has no effect.

Olaf Helper

[ Blog] [ Xing] [ MVP]


Friday, May 24, 2019 9:49 PM

Collations are defined on column Level and matters for ANSI text = data type varchar; for Unicode = nvarchar the collation has no effect.

This is grossly wrong!

Collation affects sort order, comparison, upper/lower and a few more things, like which letters that are defined. So collation matters a lot for nvarchar too.

Wendy's original questions have already been answered, but I would like to give my answers as well.

1) Yes, you need to use nvarchar. You only need MAX, if you want to permit more than 4000 characters to be added.

Then again, this presumes that you want to store the data as text only. Later you talk about fonts. If you want to store formatted data, it depends on the formating you want to use. For instance, if you want store PDFs, you should use varbinary(MAX). nvarchar assumes that you are storing text only, or possibly HTML-formatted text.

2) The choice of collation is not a trivial one. If you want to use a single column, you will need to pick a one-size-fits-all collation, which for the languages you have indicated, I think Modern_Spanish_100_CI_AS is the best pick. Why I will return to in a second.

If you want to have separate column per language, you could have Latin1_General_CI_AS for English, Modern_Spanish for Spanish, an Arabic collation for Arabic, a Nepali collation for Nepali. But then it gets difficult, because I can't seem find any collations for Burmese or Karen. I don't why there isn't any, and I don't know if it matters. I know that Burmese has its own script, presumably Burmese is always sorted the same, no matter the collation. How Karen is written I have no idea at all. But since the Karen people lives in Burma, maybe they use the same script. (Checking Wikipedia, I find that this is indeed the case.)

So why did I suggest that Modern_Spanish_100_CI_AS would be the best pick? It will sort Spanish data correctle, and it will do a decent work with English too, since the only thing that is special is that ñ is a letter of its own in Spanish, and that character is rare in English text. 100 is important, because in the 80 collations (those with no number at all), the code points for the Nepali script, and I would assume that this apply to Burmese and Karen as well. (If you are really out of luck, Burmese and Karen may not be supported even in 100 collations.)

3) As others have answered, in nvarchar(MAX) you are supposed to store plain text. No fonts or anything.

4) Almost certainly. Multi-language handling is not really a walk in the part, not the least with such a disparate group of languages that you have. If you don't have a basic understanding of these languages, I recommend you that you are acquire that understanding. (And then I don't mean that you learn to say "Hello", "Thank you", but more how the script works and looks like. For instance, Arabic is left-to-right which certainly can cause some challenges.)

Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se


Monday, August 5, 2019 5:25 AM

Hi Erland,

>> You only need MAX, if you want to permit more than 4000 characters to be added.

This is not correct. 

The parameter n in the NVARCHAR(n) defines the string size in byte-pairs and NOT the number of characters. When we use NVARCHAR, then characters which are in the range of 65536-1114111 uses 4 bytes which is two byte-pairs. Therefore, if for example you want to use only 2001 characters which are in the range of 65536-1114111 then you also must use NVARCHAR(MAX).

Let's try to insert 2001 characters to NVARCHAR(4000) => this return error if the characters are in the range 65536-1114111

CREATE DATABASE HebrewDB COLLATE Hebrew_100_CS_AS_SC
GO
use HebrewDB
GO

DROP TABLE IF EXISTS T
GO
CREATE TABLE T (Txt NVARCHAR(4000))
GO

DECLARE @T NVARCHAR(MAX) = nchar(78178)
SELECT LEN(REPLICATE(@T, 2001)) -- clearly only 2001 characters
INSERT T (Txt) values(REPLICATE(@T, 2001))
GO
-- String or binary data would be truncated in table 'HebrewDB.dbo.T', column 'Txt'...
-- We need to use MAX in this case even so we do not want to permit more than 4000 characters
-- we want only 2001 characters

Collations are defined on column Level and matters for ANSI text = data type varchar; for Unicode = nvarchar the collation has no effect.

This is grossly wrong!

Collation affects sort order, comparison, upper/lower and a few more things, like which letters that are defined. So collation matters a lot for nvarchar too.

Wendy's original questions have already been answered, but I would like to give my answers as well.

1) Yes, you need to use nvarchar. You only need MAX, if you want to permit more than 4000 characters to be added.

Then again, this presumes that you want to store the data as text only. Later you talk about fonts. If you want to store formatted data, it depends on the formating you want to use. For instance, if you want store PDFs, you should use varbinary(MAX). nvarchar assumes that you are storing text only, or possibly HTML-formatted text.

2) The choice of collation is not a trivial one. If you want to use a single column, you will need to pick a one-size-fits-all collation, which for the languages you have indicated, I think Modern_Spanish_100_CI_AS is the best pick. Why I will return to in a second.

If you want to have separate column per language, you could have Latin1_General_CI_AS for English, Modern_Spanish for Spanish, an Arabic collation for Arabic, a Nepali collation for Nepali. But then it gets difficult, because I can't seem find any collations for Burmese or Karen. I don't why there isn't any, and I don't know if it matters. I know that Burmese has its own script, presumably Burmese is always sorted the same, no matter the collation. How Karen is written I have no idea at all. But since the Karen people lives in Burma, maybe they use the same script. (Checking Wikipedia, I find that this is indeed the case.)

So why did I suggest that Modern_Spanish_100_CI_AS would be the best pick? It will sort Spanish data correctle, and it will do a decent work with English too, since the only thing that is special is that ñ is a letter of its own in Spanish, and that character is rare in English text. 100 is important, because in the 80 collations (those with no number at all), the code points for the Nepali script, and I would assume that this apply to Burmese and Karen as well. (If you are really out of luck, Burmese and Karen may not be supported even in 100 collations.)

3) As others have answered, in nvarchar(MAX) you are supposed to store plain text. No fonts or anything.

4) Almost certainly. Multi-language handling is not really a walk in the part, not the least with such a disparate group of languages that you have. If you don't have a basic understanding of these languages, I recommend you that you are acquire that understanding. (And then I don't mean that you learn to say "Hello", "Thank you", but more how the script works and looks like. For instance, Arabic is left-to-right which certainly can cause some challenges.)

Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

  Ronen Ariely
 [Personal Site]    [Blog]    [Facebook]    [Linkedin]