Reply
Thread Tools
Community Council | Posts: 4,920 | Thanked: 12,867 times | Joined on May 2012 @ Southerrn Finland
#21
Originally Posted by ljo View Post
Yes, I agree the Suomi-24 corpus is the best to start with.
Would'n that be a bit biased... taken off from a forum which is full of halfwits banging their heads off on marginal topics?
I am pretty sure we'd get a Tay.ai type prediction engine out of that corpus!
__________________
Dave999: Meateo balloons. What’s so special with em? Is it a ballon?
 
Posts: 36 | Thanked: 118 times | Joined on Nov 2018
#22
Originally Posted by juiceme View Post
Would'n that be a bit biased... taken off from a forum which is full of halfwits banging their heads off on marginal topics?
I am pretty sure we'd get a Tay.ai type prediction engine out of that corpus!
I don't know how it will work out. I have downloaded the files and uploaded them to the drive (65Gb). I can share a link if someone wants to try it out. If not then I might try with The National Library's journal's Finnish n-grams by myself because it is easier that way.
 

The Following User Says Thank You to FlyingAntero For This Useful Post:
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#23
Originally Posted by FlyingAntero View Post
I don't know how it will work out. I have downloaded the files and uploaded them to the drive (65Gb). I can share a link if someone wants to try it out. If not then I might try with The National Library's journal's Finnish n-grams by myself because it is easier that way.
OK. I bought a larger hard drive today since I have been hitting the storage limit over and over for a few weeks. So I could give it a try in a few days when I have migrated to the new drive.
 

The Following 2 Users Say Thank You to ljo For This Useful Post:
Posts: 1,414 | Thanked: 7,547 times | Joined on Aug 2016 @ Estonia
#24
With such a huge file, we may have to split it into smaller parts. Otherwise RAM will probably become an issue.
 

The Following 3 Users Say Thank You to rinigus For This Useful Post:
Posts: 36 | Thanked: 118 times | Joined on Nov 2018
#25
Originally Posted by ljo View Post
OK. I bought a larger hard drive today since I have been hitting the storage limit over and over for a few weeks. So I could give it a try in a few days when I have migrated to the new drive.
Nice! Here are the files:
 

The Following 3 Users Say Thank You to FlyingAntero For This Useful Post:
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#26
Originally Posted by FlyingAntero View Post
Nice! Here are the files:
Thanks, I will get on it as soon as my harddrive is replaced.
 

The Following 4 Users Say Thank You to ljo For This Useful Post:
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#27
Originally Posted by ljo View Post
Thanks, I will get on it as soon as my harddrive is replaced.
So, now there is something to test. I noticed some hyphenation here and there that felt a bit strange but most of the words i typed were predicted. And it learns fast so I can't make the same tests twice ...
I might need to adjust the dictionary size a bit, but as a non-native speaker I await your opinions before doing something more for Finnish.
I will try to find some time to continue to work on the hyphenation problems that are really annoying in Swedish at least.
 

The Following 3 Users Say Thank You to ljo For This Useful Post:
Posts: 36 | Thanked: 118 times | Joined on Nov 2018
#28
I had time to test it this morning and it seems to work pretty good after quick testing . I can confirm that there is a hyphenation problem with some words. However, it is not a big problem in normal use since the issue seems to be linked to compound words. Here is few examples:
English: Finnish: my input: text-prediction
  • text input: tekstinsyöttö: tekstinsyö: tekstin-syö
  • shoe rack: kenkäteline: kenkäte: kenkä-te
  • (space) alien: avaruusolio: avaruusoli: avaruus-oli
I think that most Finns write compound words separately (tekstin and syöttö) and remove the space later (if they aren't too lazy). If you do that the prediction knows those separate words.

I put the text-prediction for comparison with an Android phone and both predictions were working quite similarly with most common words. Sometimes the most obvious conjugation is among the last words in the list but I believe that will improve after use (in Sailfish).

Also the prediction knows every bad words in Finnish and some name-calling slang words. I believe that it is not a surprise since the corpus was from forum.

EDIT: And I almost forgot: huge thanks for you, tusen tack!
 

The Following User Says Thank You to FlyingAntero For This Useful Post:
Posts: 1,414 | Thanked: 7,547 times | Joined on Aug 2016 @ Estonia
#29
Profanity is an issue and would be great to get rid of it. I had the same problem when composing the database for English, large fraction of the time was spent on that. I would suggest to filter the database and remove all n-grams that include any of the words that are classified as "bad". For that, we need a list of the words (possibly as substrings). That would have to be provided by native speakers though. Maybe such list is composed already somewhere...
 

The Following 2 Users Say Thank You to rinigus For This Useful Post:
Posts: 36 | Thanked: 118 times | Joined on Nov 2018
#30
Originally Posted by rinigus View Post
Profanity is an issue and would be great to get rid of it. I had the same problem when composing the database for English, large fraction of the time was spent on that. I would suggest to filter the database and remove all n-grams that include any of the words that are classified as "bad". For that, we need a list of the words (possibly as substrings). That would have to be provided by native speakers though. Maybe such list is composed already somewhere...
I can try to find that kind of list or make it by myself. Should that list also include every conjugation of specific word? Finnish words have
dozens of conjugation forms. Here are few examples:
Word: run = juosta
  • I run = Minä juoksen
  • You run = Sinä juokset
  • He/she runs = Hän juoksee
Word: box = laatikko
  • The color of a box = Laatikon väri
  • Look at that box = Katso tuota laatikkoa
  • The cat went inside the box = Kissa meni laatikkoon
 

The Following User Says Thank You to FlyingAntero For This Useful Post:
Reply

Tags
predictive text, presage, text-prediction


 
Forum Jump


All times are GMT. The time now is 01:57.