Active Topics

 


Poll: What advanced text entry method(s) would you like to see on Sailfish?
Poll Options
What advanced text entry method(s) would you like to see on Sailfish?

Reply
Thread Tools
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#231
Originally Posted by ferlanero View Post
Now currently working on Swedish language
Err, why, I already maintain and published the Swedish language resources during new year's weekend.
 
Posts: 105 | Thanked: 205 times | Joined on Dec 2015 @ Spain
#232
Originally Posted by ljo View Post
Err, why, I already maintain and published the Swedish language resources during new year's weekend.
Ha, ha! It's true I didn't realize about it! Sorry. Focusing now in Portuguese
 
Posts: 27 | Thanked: 35 times | Joined on Jan 2016 @ Sweden
#233
Originally Posted by ljo View Post
@spidernik84 et al, this should rather be between 0.7-1.8 million wordforms but not much more based on the 92034 stems (roughly what we count as words) which is about the size of a standard working vocabulary of other latin script languages like french (0.63 million aspell wordforms). So there is something wrong with the assumptions in the expansion processing.
I think you are right. I just failed another generation attempt (ran out of 20GB of RAM plus 5GB of swap... ).
I did a comparison with the English language, this is what I see:

Code:
nico@hendrix:~/aspell/aspell6-it-2.4-20070901-0$ aspell -l en dump master | aspell -l en expand | wc
 119789  119789 1153336
nico@hendrix:~/aspell/aspell6-it-2.4-20070901-0$ aspell -l it dump master | aspell -l it expand | wc
  95193 36636439 655315062
The number of words generated for the Italian language is INSANE.
You seem to know a lot of this. Have you got any idea of what can be done to keep the dictionary smaller? I've been searching for aspell alternative dictionaries with no luck...

Thanks. I surely hope we don't need to rent a Cray cluster to generate this dict...
 

The Following 3 Users Say Thank You to spidernik84 For This Useful Post:
Posts: 86 | Thanked: 362 times | Joined on Dec 2007 @ Paris / France
#234
As discussed with spidernik84, the Italian aspell dictionary contains 34M words (with affix expansion support that was added for Spanish).
Try this :
Code:
aspell -l it dump master | aspell -l it expand | wc -w
In the current process, aspell is used for filtering out badly written words (because available texts sometimes contains errors).

Even if we fix the corpus reader script the keyboard has not been built to work with this volume: My largest language (French) contains ~100k words (and only 45k used by the word prediction engine, others are in "best effort" mode).

From a quick look I see the following causes for the large size:
  • lots of words are repeated with prefixes such as dall', sull'. At the moment my model handles words separated by quotation marks or hyphens as single words, so words with different prefixes are treated as different words. OKBoard roadmap contains an item for managing prefixes and suffixes (explicit ones with punctuation signs, or linked together as in German) as distinct words, but I don't know when (and if) I will work on it.
  • some words are repeated multiple times with weird capitalization: are these different words: Sull'Acclimatatele, sull'Acclimatatele, sull'acclimatatele ? At the moment words with different capitalizations are treated as different words (unless they are at the beginning of a sentence). But the case of words with two different capitalization is not very well handled.


Spidernik84's text corpus only contains 315k words (only counting those which are also known by aspell), so my short term suggestion is to add an option to provide a (smaller) dictionary instead of using aspell's one or to trust the input corpus to be flawless.

What do you think ?

Edit: ouch, spidernik84 was faster with wc
 

The Following 5 Users Say Thank You to eber42 For This Useful Post:
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#235
Originally Posted by spidernik84 View Post
The number of words generated for the Italian language is INSANE.
You seem to know a lot of this. Have you got any idea of what can be done to keep the dictionary smaller? I've been searching for aspell alternative dictionaries with no luck...
I reduced the size by 3/4 by removing different capitalisations of the same words in the Italian dictionary. It is true some small fraction might actually be different words, but the majority is just lowercase initial letter vs uppercase initial letter differences. Comment out the %-full.dict target in the db/makefile and put the filtered word list content directly in your it-full.dict file (reduce it by axing further parts of it off if needed still).
 

The Following 4 Users Say Thank You to ljo For This Useful Post:
Posts: 27 | Thanked: 35 times | Joined on Jan 2016 @ Sweden
#236
Originally Posted by eber42 View Post
From a quick look I see the following causes for the large size:
  • lots of words are repeated with prefixes such as dall', sull'. At the moment my model handles words separated by quotation marks or hyphens as single words, so words with different prefixes are treated as different words. OKBoard roadmap contains an item for managing prefixes and suffixes (explicit ones with punctuation signs, or linked together as in German) as distinct words, but I don't know when (and if) I will work on it.
  • some words are repeated multiple times with weird capitalization: are these different words: Sull'Acclimatatele, sull'Acclimatatele, sull'acclimatatele ? At the moment words with different capitalizations are treated as different words (unless they are at the beginning of a sentence). But the case of words with two different capitalization is not very well handled.
Hello Eber!
I never heard those words before
I can tell you for sure that the form dall' sull' is surely correct, but a bit too formulaic. Also, those are "articulated prepositions" in front of nouns, hence should be considered on their own. Example:

dall'anima
dall'oceano

The nouns are "anima" and "oceano", while "dall'" is the preposition. That does not justify creating a word for each preposition+word combination!
There are additional rules, naturally: for instance, that form is only used with words starting with vocals...
Good that you are thinking of handling this situation.

As for the capitalization: I would not consider common to have capitalised variants of words. Most words are either capitalised or not, so I'd prioritise lower case words when multiple variants are found.


Spidernik84's text corpus only contains 315k words (only counting those which are also known by aspell), so my short term suggestion is to add an option to provide a (smaller) dictionary instead of using aspell's one or to trust the input corpus to be flawless.

What do you think ?

Edit: ouch, spidernik84 was faster with wc
We can try to skip aspell just for my language, for sure... I'm afraid of the results though: spelling mistakes are definitely common
It's worth a shot, I'll see what happens. Thanks for your help.
 

The Following 2 Users Say Thank You to spidernik84 For This Useful Post:
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#237
Originally Posted by eber42 View Post
1) But the case of words with two different capitalization is not very well handled.
...
2) so my short term suggestion is to add an option to provide a (smaller) dictionary instead of using aspell's one or to trust the input corpus to be flawless.

What do you think ?
1) It is definitely true. I saw this with the Spanish dictionary too when I did the full corpus.

2) Yes, providing an alternative dictionary is good. Maybe just keep the dict if its there? Instruct to build clean otherwise? Assuming they are flawless is not too bad either since people still write a lot of stuff which is not covered by the aspell dictionary.
 
Posts: 529 | Thanked: 988 times | Joined on Mar 2015
#238
sorry guys, i dont know how many of you are italian, but i am.
Dall' Sull' and other words could be just inserted as single words.
When you write sentences you actually left a space between preposition and other word.
so i think it would be better to have two words splitted:
dall/dall' (showing both option when swyped d-a-l-l ) and anima for example.

However sull' Acclimatatele for example doesnt make sense.
Sull is a preposition that preceed some noun and means over/on/regarding. for example Sull' Oceano. it means literally "over the ocean".
the ' is inserted just becaus Oceano starts with a vocal letter!

And however acclimatatele is such a very unusual word in common speech. "acclimatare" means "to get habitued to some climate condition" (for example, when you are out in the cold winter and come into your home, you spend your first minutes just to "get used" to the hotter condition).

"Acclimatate " is one of the possible conjugations (participio passato) of this verb when referring to female &plural nouns (a group of women for example).

"AcclimatateLE" it literally means "make them acclimatized/ambiented"

so i mean, those are words not very frequently used in speech.
sorry for my bad teacher skills.
 

The Following 2 Users Say Thank You to itdoesntmatt For This Useful Post:
Posts: 27 | Thanked: 35 times | Joined on Jan 2016 @ Sweden
#239
Originally Posted by itdoesntmatt View Post
sorry guys, i dont know how many of you are italian, but i am.
Dall' Sull' and other words could be just inserted as single words.
When you write sentences you actually left a space between preposition and other word.
Ciao!
I am pretty confident there should be no space between article and nouns and articulated prepositions and nouns. This is the only input I can give
 
Posts: 529 | Thanked: 988 times | Joined on Mar 2015
#240
Ciao a voi,ragazzi e grazie tante per il vostro impegno!


i know, but i explained mayself badly.
When you write a sentence :
example : il gatto e' sull'Amaca
i swipe in this way: .. I-L..G-A-T-T-O... E'.. S-U-L-L(') ..A-M-A-C-A

is not comfortable to swipe S-U-L-L-'-A-M-A-C-A
because we consider them as separated words when we think about that. Sull'Amaca is considered just like Sul Letto, as two separated words, even if you formally shouldnt leave the space.
and moreover in common written language (included SMS,chat and other stuff ) is really the same to leave space between preposition with ' and the other following word.
i dont know how to explain better i hope it is understandable.

Last edited by itdoesntmatt; 2016-01-25 at 21:22.
 

The Following User Says Thank You to itdoesntmatt For This Useful Post:
Reply

Tags
bettertxtentry, huntnpeck sucks, okboard, sailfish, swype


 
Forum Jump


All times are GMT. The time now is 16:00.