View Single Post
Posts: 10 | Thanked: 12 times | Joined on Jun 2013
#276
Originally Posted by ferlanero View Post
The steps to do that are these:

-You need a linux environment (I'm using Archlinux, but Ubuntu or some other works too)

- You need to download the tarball first: http://git.tuxfamily.org/okboard/okb...master.tar.bz2 and uncompress it at your /home directory

- You need the dictionaries. I take it from https://github.com/titoBouzout/Dictionaries but it needs to be adjusted, so I attach the file already processed (see Spanish.dic.txt.zip on this post)

-You need the corpora files of your language (e.g. Spanish)
http://corpora2.informatik.uni-leipzig.de/download.html
http://www.cs.upc.edu/~nlp/wikicorpus/
http://opus.lingfil.uu.se/OpenSubtitles2016.php
http://www.lllf.uam.es/ESP/Corlec.html
https://tatoeba.org/spa/downloads

- You need the "aspell-es" package (in case of Spanish) instaled from the repos of your distro.

- You need "lbzip2" package installed in your system too.

-You need "rsync" installed in your system.

-You need "QT5" installed in your system.

- Now you need to create a folder somewhere and put the dictionary inside (e.g. /home/username/okboard/langs)

-If you have several corpora files, then:

Code:
cat file1 file2 file3 file4 file5 > corpus-es.txt
- Open a terminal window

- And set the two environment variables:

Code:
export CORPUS_DIR=/home/username/okboard/langs
Code:
export WORK_DIR=/home/username/okboard/langs
- You can see those variables with

Code:
echo $VARIABLE_NAME
if you're curious

- You need to compress the file (Spanish.dic.txt) you put before in /home/username/okboard/langs:

Code:
bzip2 Spanish.dic.txt
- Now should be named corpus-$LANG.txt.bz2 In our case: corpus-es.txt.bz2 because of Spanish

- There should be a single file inside.

- The next thing is to do is to move in okboard files inside the same Terminal window in our case "/home/username/okb-engine-master/". Here is the okboard's source code.

Code:
cd /home/username/okb-engine-master/
- In 'db' folder you must create a lang-es.cf file first. You can copy it from another .cf file in the same folder (e.g. copy lang-en.cf and rename it into lang-es.cf)

-And left only ASCII characteres on those files:

Code:
lbzip2 -d < corpus.txt.bz2 | clean_corpus.py | lbzip2 > new_corpus.txt.bz2
- Execute
Code:
db/build.sh es
("es" in case of Spanish)

- After this, the script create the dictionaries for OKBoard with next list of files:

add-words-fr.txt
es-predict.dict
lang-fr.cf
clusters-es.log
es-test.txt.bz2
lang-nl.cf
clusters-es.txt
es.tre
predict-es.db
corpus-es.txt.bz2
grams-es-full.csv.bz2
predict-es.ng
db.version
grams-es-learn.csv.bz2
predict-es.rpt.bz2
es-full.dict
grams-es-test.csv.bz2
predict-es.txt.bz2
es-full.tre
lang-en.cf
words-es.txt
es-learn.txt.bz2
lang-es.cf

- So, now we have the Spanish dictionary created.

After this. I don't know what to do with these files. So any help is welcome

-----------------------------------


I'm trying to make Finnish support for OKBoard, but have to ask some tips from you guys. I'm not experienced in stuff like this. Anyway here is my current check list:

1) I have Linux distribution to use

2) I've downloaded OKBoard tarball

3) Dictionaries... There's no dictionary file for Finninh at the link provided.

4) Corpora file. I first tried to use http://www.corpora.heliohost.org/download.html But file has CRC error (the 2016 version), so I ended up to get Finnish version from here instead: http://opus.lingfil.uu.se/OpenSubtitles2016.php

5) I think Finnish spellchecking doesn't use aspell, but Malaga based Voikko: http://voikko.puimula.org/ and if I'm not misunderstood, voikko is used by ispell for example. But how to get that finnish dictionary file is somehow unclear to me.

After all this is done I could try to get forward with this but still lot of work as it seems. Also, What do you think, would it be good to include some additional sources too (like more official source ( http://kielitoimistonsanakirja.fi / http://kaino.kotus.fi/sanat/nykysuomi/ ) and if multible sources, how to easilly remove duplicates?
 

The Following 2 Users Say Thank You to uggeli For This Useful Post: