Title: INTERNET EVOLUTION AND PROGRESS IN FULL AUTOMATIC FRENCH LANGUAGE MODELLING
Authors: Dominique Vaufreydaz, Mathias Géry
Abstract:
The World Wide Web is the greatest information space unseen until now, distributed all over the world, in many languages, on many various topics. In a first part of this paper, we study the evolution of a French subset of this space during the last 3 years. During this time, the size of automatically extracted text for language modelling was multiplied by 6.5. Moreover, the French coverage has grown from 140,000 to 200,000 lexical forms. So, we show that we can get more and more reliable data in order to train our trigrams models. At last, recognition experiments, made on a French 'state of the art' evaluation set, show that word accuracy increase from 51% up to 62.30% using two different models automatically calculated on Web corpora. The first corpus was gathered at the beginning of 1999 and the last one at the end of 2000.
|