Jump to content
TNG Community
Joyaa

Database collation. French accents. Polish Lettering

Recommended Posts

Joyaa

I would like to include both French accents (eg ç) and Polish lettering (eg Ł) in the letters that display and can be collated (eg to c and L) on my site.

My TNG Admin > Languages are English-UTF8 and French-UTF8.
My TNG Admin > Setup > Configuration > General Settings > Language is ... Folder = English-UTF8, character set = UTF8

Question 1:  When I look in my phpMyAdmin, I can see a "server connection collation", and a "server charset" and then also a tab called "charsets".   Which do I adjust to achieve the correct outcome?

a)   “Server connection collation”, which is currently set (not by me) at “utf8mb4_unicode_ci”
b)   “Server charset”, which is “UTF-8 Unicode (utf8)”
c)   Lastly, there is a “tab” in phpMyAdmin called “charsets”.  This shows a lengthy table of Collations and accompanying Descriptions; within this table, various collations are highlighted, including: this one for utf8:   utf8_general_ci                 Unicode (multilingual), case-insensitive;    and this one for utf8mb4:   utf8mb4_general_ci      Unicode (multilingual), case-insensitive.

Do I need to change the server connection collation OR the server charset?
Do I need to make sure a particular "charset" in the long table of charset is highlighted  under the "charset" tab?
Question:  where the wiki advises using a certain collation for the lettering, is it referring to a) b) or c), or some combination of this above? 

Question 2: Which is the correct "collation sequence"?  I have looked through a number of the options given in a chart on the "Database Collation - Explain Choosing" page on the wiki -  http://collation-charts.org/mysql60/ - but to no avail.

Thank you for your help.  

myPhpAdmin_1.jpg

myPhpAdmin_2.jpg

Share this post


Link to post
Share on other sites
Ken Roy

The Collation Sequence refers to the Collation in the database structure for tables and the individual fields when you look at the structure for the individual database tables.

The server connection sequence and other fields you mentioned should not be changed.  Those are set by your hosting service for how they have MySQL server configuredl

Share this post


Link to post
Share on other sites
Joyaa

Many thanks Ken.  That starts to get me ontrack ...

The database collation is utf8_general_ci   
But I can see from the attached that I have added latin1_swedish_ci, presumably in order to collate French accents.

1.  This hasn't worked, and I imagine I should have chosen "utf8_swedish_ci".   (Yes?)
However, whilst this may help with French accents, it doesn't cover the Polish alphabet ... and if push comes to shove, I can see that I'll have to drop my desire for Polish as well as French.

(I presume that I can only apply a single collation sequence, rather than both 'swedish' and 'polish').

2.  Re. "individual database tables" ... aaah - thank you.  I can see these haven't been changed.

Hopefully, someone will be able to advise me on my need for Polish AND French.  Once done, I will revisit the wiki page on "Changing to UTF-8" and apply.
https://tng.lythgoes.net/wiki/index.php?title=Changing_to_UTF-8   

myPhpAdmin_3.jpg

Share this post


Link to post
Share on other sites
Ken Roy

You have to choose a collation sequence that will cover both your desire for Polish and French.  And yes, it must be utf8_xxxxxxxx_ci

See the TNG Wiki article on Database Collation - Explain Choosing  also look at the Mr. Barkov's Collation Charts link in the article

Share this post


Link to post
Share on other sites
Joyaa

Thanks Ken.  I'm aware of both the article and Mr Barkov's charts.  They don't give a Polish AND French option so far as I have managed to ascertain.   I'll keep digging.  And I'll also rework my way through "https://tng.lythgoes.net/wiki/index.php?title=Changing_to_UTF-8" over the weekend.   Once done, I'll report back.

Share this post


Link to post
Share on other sites
Joyaa

Hi Brent,
Thanks for looking ...

That's an extensive collation, but it looks to me like the 

L
4C
004C
l
6C
006C

  is not the same as the 
 

Ł
C581
0141
ł
C582
0142

 

 

Share this post


Link to post
Share on other sites
bhemph

Yes, that is true that the L with the stroke looks like it is sorted after the L without the stroke and not treated as being the same.  In the Polish collation http://collation-charts.org/mysql60/mysql604.utf8_polish_ci.html it looks like that is the same sorting treatment though.  If you look at the Wikipedia article on the Polish alphabet, the English article says "Note that (unlike in languages such as French) Polish letters with diacritics are treated as fully independent letters in alphabetical ordering. For example, być comes after bycie. The diacritic letters also have their own sections in dictionaries (words beginning with ć are not usually listed under c)."  When looking at the article in Polski, it seems to be that the language treats the letters in the way described in the English article by saying there are 32 letters.  The defined collations generally are trying to sort according to the alphabets of the languages and their rules so that people can set the sorting to work for their alphabet natively.  That unicode collation I found is trying to be as diverse as possible with the support of European languages, but because of the alphabet sorting conflict between a few languages that treat the same character in different ways cannot be perfect.  Therefore finding a collation that treats Polish letters under alphabetic rules for French is going to be difficult.  So it looks like if you want L and L stroke to be treated the same in the collation sorting, you may have to define and implement a new collation, or change all your L stroke characters to just be L.

Share this post


Link to post
Share on other sites
Joyaa

I'm digging away at this and not getting far at present.   Issues noted are:

1. This Wiki page -  https://tng.lythgoes.net/wiki/index.php?title=TNG_charset  
refers to   $charset = "UTF-8";
under UTF-8 / TNG7.
It may not be relevant but my config.php file has the lines:

$language = "English-UTF8";
$charset = "UTF8";

So the $charset line differs, which is confusing.  Which is correct?

2.  I used the rather aptly named PHOCA tool, trying Brent's suggestion of  http://collation-charts.org/mysql60/mysql604.utf8_unicode_ci.european.html


myPhpAdmin_5.jpg

 

Then I went through via phpMyAdmin and selected a raft of about a dozen tables, including tng_places, and selected the "Structure" for each, one at a time, and then selected fields with "Type" of "blob" or "text", and some of "varbinary( [a number] )" and manually changed the collation to utf8_unicode_ci.
NOTE: I couldn't change individual fields to utf8_unicode_ci, European alphabets as this option wasn't available.

Nevertheless, once I had done this, I noticed two things.
Firstly, that the "Collation" didn't record the new setting EXCEPT if the Type was "text" (which is very rare).  See below, which was saved AFTER I set the collation for the individual fields;  notice that the "collation" column is still full of blank entries.
myPhpAdmin_6.jpg

 

Secondly, that although each table still seems to have a setting of utf8_general_ci, there is a "bottom line" to the screenshot below that refers to "utf8_unicode_ci" (it is in bold).  This suggests to me that the database has acknowledged utf8_unicode_ci in some way.  (What way?).

myPhpAdmin_4.jpg

However, at the end of all of this, I seem to have got nowhere.

The current/existing problems are that searching on places and names etc are case specific, which is a nightmare for my users.   Searching on paris nets 6 places, each of which have the word 'parish' in the place notes.   Searching on Paris nets 20 places with the string 'Paris' either in the place name (i.e. the city in France) or the notes (usually a reference to Parish).  

Secondly, the collation simply isn't working.  If I search on 14ème I get places in Paris in the 14th arrondissement.    But if I search on 14eme I get nothing. 

Thoughts / ideas / suggestions gratefully received.

Share this post


Link to post
Share on other sites
Ken Roy

The TNG  charset in Admin > Setup > General Settings > Language section should have UTF-8 and the config.php have $charset = "UTF-8";

Share this post


Link to post
Share on other sites
Joyaa
22 minutes ago, Ken Roy said:

The TNG  charset in Admin > Setup > General Settings > Language section should have UTF-8 and the config.php have $charset = "UTF-8";

Thanks Ken.  Have implemented that change.  Have corrected the French language characters so now says Français.   Not done anything else yet.

Still seeing that searching for places & names requires me to be CaSE spECIfic ... which it shouldn't of course.
Also, searching places for     4eme  doesn't bring up  4ème   which I would expect that it should.

www.zausmerforest.com

user    -  TNG guest
passw  - tng101

Open to thoughts & suggestions.  I suspect there are clues in my post immediately upthread, but I'm not seeing the way through.    
 

Share this post


Link to post
Share on other sites
Ken Roy

Joyaa,

Years ago when I first implemented French on my site, I had to use utf8_swedish_ci  to get the search to ignore the French accents.

I don't know whether the accents get ignored when using utf8_general_ci.  I can only speak to the French part since I know nothing about the Polish language.

Share this post


Link to post
Share on other sites
Joyaa
16 hours ago, Ken Roy said:

// ... snip ... I don't know whether the accents get ignored when using utf8_general_ci.  I can only speak to the French part since I know nothing about the Polish language. ... snip ... //

Following Brent's advice, I am no longer looking for a collation sequence that might collate L-stroke with L.   So, forget about Polish.  Just considering French.

Re. utf8_general_ci --- I have no wish to use this.   I was trying to use utf8_unicode_ci, European alphabets (MySQL 6.0.4), however this has failed.

I now realise that my site's current inability to distinguish between upper case and lower case when searching names, places etc shows that my efforts at collating have failed even at this very basic level.

I am in the midst of having a go with utf8_swedish_ci.  I will report back very shortly.

Share this post


Link to post
Share on other sites
Joyaa

Using the Phoca script to apply "utf8_swedish_ci" DEFINITELY gives a much cleaner result than when I used Phoca to apply "utf8_unicode_ci, European alphabets".

I am posting TWO PAIRS of images below to help anyone who encounters a similar problem in the future.   Following these images is some potentially advice from my hosting provider, who suggest that I need to change the TNG database structure... !!   Comments on that please.

The first image used "utf8_unicode_ci, European alphabets" (believe it or not)  - 


myPhpAdmin_3.jpg

 

and this one used "utf8_swedish_ci":

myPhpAdmin_3-swedish.jpg

 

Here is the second pair.  Again, this used  "utf8_unicode_ci, European alphabets" 

 

myPhpAdmin_4.jpg

 

and this one used "utf8_swedish_ci":

 

myPhpAdmin_4-swedish.jpg

The bottom line seems to be that the Swedish collation sequence is properly recognised by the version of phpMyAdmin / MySQL / or something that I am using.  The unicode collation sequence was not recognised.

HOWEVER,  none of this has done anything for my collation error.  My searches still will not collate 'smith' with 'SMITH' or 'SMITH'.  Same issue with places.

I have posted a support ticket to my hosting provider, ICDSoft.  They have replied - and I would appreciate help from the Forum on this - 

 we noticed a potential problem with your database design, that appears to be causing the problems you are having - your text fields are of type "varbinary". Here is an excerpt from the MySQL manual, showing how searches work on varbinary fields:

This means they have the binary character set and collation, and comparison and sorting are based on the numeric values of the bytes in the values.


https://dev.mysql.com/doc/refman/5.7/en/binary-varbinary.html


For a test, I copied your database into a new database and changed the type of the "lastname" field in the "tng_people" table to "CHAR". After that, performing a sample test search:


 
  1. SELECT * FROM `tng_people` WHERE `lastname` LIKE "%ántares%"


returns the results as expected - it collates "á", "a", "A" and so on, so it properly shows the "ANTARES" name as result.

Our recommendation would be to convert your fields from the "VARBINARY" type, to a more suitable string type. 

This sounds drastic!   Has anyone else ever encountered advice like this?

Thanks, Joyaa

Share this post


Link to post
Share on other sites
Ken Roy

Joyaa,

What is the definition of lastname in tng_people? 

I see no varbinary fields in the TNG database.  The text fields are varchar or text for those that do not have limits on the string length.  So if you have varbinary fields, how did they get to be that way?

Share this post


Link to post
Share on other sites
Joyaa

Hi Ken,

Screenshot below shows varbinary all over the place (not just in this table either).  No idea how they got that way.   Absolutely none.   How does one best deal with a situation once it has got like this?   (I am guessing there are a no. of options).   Joyaa

myPhpAdmin_7 (varbinary).jpg

Share this post


Link to post
Share on other sites
Ken Roy

Joyaa,

You might check whether you have a database backup where the field was varchar instead of varbinary.

Alternatively, you might need to try creating a new database and see if you can restore the data content using the TNG Utilities

It is possible that one of your attempts to use Phoca to create the http://collation-charts.org/mysql60/mysql604.utf8_unicode_ci.european.html

caused the problem.  So check whether your hosting service has a database backup prior to Saturday it looks like in the above post sequences.

TNG does not use varbinary but rather varchar

Share this post


Link to post
Share on other sites
Joyaa

Thanks Ken.  Really appreciate your help.  

Re. cause - yes, I suppose using the Phoca script with unicode/European may have caused it ...   or perhaps when I had the language set in TNG to "UTF8" instead of "UTF-8"  - just a thought.

I use ICDSoft for hosting.  Isn't this the same as you?   Their backups only last 3 days, I think, but in any case, I performed a "permanent" back-up of the MySQL database on Saturday before I began this current round of tinkering.  I have also downloaded it as a mysql5.tar.gz file.   

(My last TNG>Utilities>backup was 4th Feb.  Rather too long ago, to be honest).

Re. your suggestion of creating a new database to see if my Saturday backup will have the same VARBINARY or not:
Obviously, it's no good my using phpMyAdmin to copy my existing database structure in phpMyAdmin, and restoring the data to this one, because the existing structure is the very problem. 
I am not experienced in setting up a new database, or where to set it up.   Using TNG's Setup>creation is clearly not the way to go, as it will erase my current database.  There's no wiki article on this (that I can see).  May I need to call for help from someone who uses TNG offline? (WAMP or similar? Is that the right thing?).  

I agree with you re VARBINARY.  TNG doesn't use it.  

Thanks again.  Joyaa

Share this post


Link to post
Share on other sites
Joyaa

I have now restored my back-up from 3 days ago.  This back-up preceded my move to change the character collation, but didn't precede a much earlier move to change my character set from ISO... to UTF-8.  

The back-up contains the same VARBINARY issues as I have now.  What to do next?

Route 1:
Can I simply use phpMyAdmin to run through each table, one by one, and change the field "types" to whatever they are meant to be (for which I imagine I will need to ask Darrin or someone for a list)?  Or might there be a certain "sequence" in which field types need to be changed, and might this require someone very familiar with the database structure to do this?

Route 2:
Back-up my TNG data using TNG utilities (as the previous one was done too long ago).   Then do a fresh install of the TNG database, and then restore the just-backed-up data.  (This is outside of my capability set, and I'll enlist someone's help with the fresh installation, how to work my modified home page plus the small no. of Mods I use into this install, and how to restore data from the old database into the new one).

Route 3:
Dunno.  But expertise here might.

Share this post


Link to post
Share on other sites
theKiwi
3 hours ago, Joyaa said:

The back-up contains the same VARBINARY issues as I have now.  What to do next?

There are no TNG tables that should be varbinary - this is why your searches have become case sensitive.

It might be possible to create a GEDCOM export of your site, then create a new set of tables using utf8_general_ci or maybe utf8_unicode_ci and then reimport the exported GEDCOM file to restore your data.

Roger

Share this post


Link to post
Share on other sites
Ken Roy

The only problem with a TNG GEDCOM export, is that TNG does not export everything that it supports.  If you have Cemetery entries in TNG, they do not get exported. 

Roger,

Do you know if utf8_general_ci or utf8_unicode_ci handles the French accented characters correctly.  I know that utf8_swedish_ci does since that is what I am using.

Share this post


Link to post
Share on other sites
theKiwi
7 hours ago, Ken Roy said:

Do you know if utf8_general_ci or utf8_unicode_ci handles the French accented characters correctly.  I know that utf8_swedish_ci does since that is what I am using.

The should, because accented French characters are part of the Unicode set of characters.

Whether it sorts them by the French rules would be a different matter - and that's where it gets hard, if not impossible, to select a collation that obeys more than one set of rules - e.g. sorting characters by say French and Polish - it is probably too much to expect that any one collation could do that as expected for both languages.

https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci

has some notes that suggest that unicode might be a better one to use.

I might tomorrow setup a new TNG install and try a utf8mb4_unicode_ci collation and see how that goes with TNG too.

Roger

Share this post


Link to post
Share on other sites
Joyaa
9 hours ago, Ken Roy said:

The only problem with a TNG GEDCOM export, is that TNG does not export everything that it supports.  If you have Cemetery entries in TNG, they do not get exported. 

Roger,

Do you know if utf8_general_ci or utf8_unicode_ci handles the French accented characters correctly.  I know that utf8_swedish_ci does since that is what I am using.

I use TNG as my only database / tool.  So yes, lots of cemetery entries , plus anything else that gedcoms might not handle.

To be clear on the collation sequence I am now aiming for: utf8_swedish_ci

I did try putting "utf8_unicode_ci, European alphabets" through the PHOCA script, but whilst this *appeared* to work, at the time of script running, I am unsure it really did.  I won't be re-trying that.   ((http://collation-charts.org/mysql60/mysql604.utf8_unicode_ci.european.html))

13 hours ago, Joyaa said:

Route 1:
Can I simply use phpMyAdmin to run through each table, one by one, and change the field "types" to whatever they are meant to be ...

I will post a separate thread on the forum or list about this in the hope that a MySQL expert can pick it put & run with it.  If it's possible, it would be far simpler - as my data is fine (so far as I can see), just the field types that are at fault.  Unfortunately my last back-up of table structures was 4 months ago, taken immediately prior to converting from TNG 10 to TNG 11, so I won't use that.

1 hour ago, theKiwi said:

...

I might tomorrow setup a new TNG install and try a utf8mb4_unicode_ci collation and see how that goes with TNG too.

Roger


I am still a little confused about this.  

At the start of the thread I wrote that my
a)   “Server connection collation”, which is currently set (not by me) at “utf8mb4_unicode_ci”
b)   “Server charset”, which is “UTF-8 Unicode (utf8)”

Am I correct in thinking that both "server connection collation" and "server charset" are different to the "collation" to the vast majority of the above discussion refers to?

Share this post


Link to post
Share on other sites
theKiwi
18 minutes ago, Joyaa said:

Am I correct in thinking that both "server connection collation" and "server charset" are different to the "collation" to the vast majority of the above discussion refers to?

Yes, they are different - the collation being talked about is how data is stored and sorted in the database.

The server connection collation is how the server talks to the database.

See https://dev.mysql.com/doc/refman/5.6/en/charset-connection.html

for some stuff that will make your head hurt LOL

Roger

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×