View Single Post
Posts: 1,414 | Thanked: 7,547 times | Joined on Aug 2016 @ Estonia
#359
Originally Posted by mattiviljanen View Post
First, huge thanks for @eber42 for making OKboard!

I'm working on a Finnish dictionary, but it's really hard to find quality corpuses. I'm currently experimenting with Wikipedia-based and news based, but it seems I need bigger and better corpora... Does anyone know any good sources?

I did manage to get the thing to build (by cruely skipping the very last step that causes the build to fail and suggesting a bigger corpora - I wanted a proof of concept, won't be skipping the test in release version) but there are problems. I cut the original word list in half, but I'm still getting kinda huge (12MB...30MB) fi.tre file, predict-fi.db is 26kB and predict-fi.ng is 813kB. In comparison the English en.tre is below two megabytes... As a result, the delay between the gesture and the word appearing is...very noticable to be modest. What would be a good size to aim at?

Thanks all!
For Estonian, I contacted an academic lab and got the corpus from them. I would expect that you could do so for Finnish similarly. Find some Finnish language institute and they may help you.

Formats are different, but for presage in its new Marisa-based format (trie and counts stored separately), I have 12MB database for Estonian. For English, its 6MB. So, its expected for languages, such as Finnish, to have larger database.

To regulate the size of the database, you would have to increase/decrease cut-off n-gram count. In this aspect, keep the corpus full size and just change that parameter.

Additional note - we are missing Finnish among languages supported by presage-based predictor. Would you mind to generate n-grams database for that too
 

The Following 6 Users Say Thank You to rinigus For This Useful Post: