DictImport (dict)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

dict.prepare
Class DictImport

java.lang.Object
  dict.prepare.DictImport

All Implemented Interfaces:: IDictParserHandler

public class DictImport
extends java.lang.Object
implements IDictParserHandler
extends java.lang.Object
implements IDictParserHandler

Builds files and a directory structure suitable for using a dictionary in a midlet. The class does the following tasks:

Splits a dictionary file into multiple files, each having size near to MAXFILESIZE (presently 1MB) and changes the character encoding to UTF-8. Depnding on the dictionary type the encoding of the source dictionary file may be WINDOWS-1250 for dict.cc dictionaries, UTF-8 for universal dictionary database files and ISO-8859-1 or 2 for the thesaurus files. Mobile devices accept generally only UTF-8 with very few exceptions. For example a with WINDOWS-1250 encoded above 18MB large source file for a German-English dictionary obtained from dict.cc named deen.txt will be splited into 19 files deen.txt0 ... deen.txt18 with the same format and each encoded as UTF-8.
Builds a number of index files spread across directories with both directory and file name corresponding to the hash code of the index keyword. The source dictionary files have a slightly different format, the keywords from a particular dictionary file are extracted depending on the dictionary type using the methods processLine() in either the class DictParser (dict.cc), UDDLParser (universal dictionary database) or ThesParser (thesaurus files from www.openthesaurus.de or synonimy.ux.pl). Each keyword is handled similar to a key in a hashtable, for which the value is an array of dictionary entry references (see also class dict.common.DictEntryRef), having the index of the dictionary file (eg. 0 for deen.txt0, ..., 18 for deen.txt18), the start position of the explanation of the entry in the dictionary file in bytes and the end position of the explanation. One keyword may have up to dict.common.DictIndex.MAXENTRIES = 150 explanations located at different positions in different dictionary files. The keyword and the array of dictionary references is then binarly appended to a file, which name is estimated - according to dict.common.DictIndex.mkpath() - as:
- directory - bytes 8, 9, 10 of the hash code of the index keyword in hex
- file - bytes 0 - 7 of the hash code of the index keyword in hex
For example for a keyword JUNGE having hash code 0x43a9c81 and having explanations in dict.txt0 from bytes 90 to 120 and dict.txt10 from bytes 30 to 70 will be stored as follows:
- bytes 8, 9 and 10 of the hash code result in a number 4 and this is the directory name.
- bytes 0-7 result in a hex number 81 and this is the file name
- the whole path is prepended by dict.common.DictIndex.BASEDIR "/d_/"
- The resulting full name of the file is then /d_/7/81
- Into this file the following information is appended:
  - The keyword Junge (DataOutputstream.writeUTF())
  - Number of dictionary references, here 2 (writeInt())
  - The index of the dictionary file of the first reference, 0 for deen.txt0 (writeInt())
  - The begin postion 90 of the first reference (writeInt())
  - The length of the entry, 120 - 90 = 30 (writeInt())
  - Similar for 2nd reference, 10 as index of the dictionary file, 10 for dict.txt10
  - Begin position 30
  - Length 70 - 30 = 40
In fact each index file also contains the number of keywords stored in it as the first number (writeInt()). Please remember, that many keywords may have the same bytes 0-10 of the hash code and therefore the file created contains usually data for many different keywords. In order to get the number of such keywords the class first prepares data structures in the memory and then writes them at once to the file using the algorith as described above. Usually you'll get exactly 8 * 255 such files having - hopefully - each is quite similar size (not guaranteed, but the sizes are really very comparable).
Finally the class prepares an index configuration file /d_/idx.bin (dict.common.DictIndex.INDEXFILE). This file contains the following informations:
- Dictionary type, one of the constants dict.common.DictType.DICTCC, UDDL, THES stored binarly using DataOutputStream.writeInt().
- Number of index keywords in the dictionary, eg. number of German keywords in the German-English dictionary, writeInt().
- Sorted index keywords, one after another using writeUTF().
The long list of keywords is used for the function "Search similar" in the JavaME frontend, where the string entered by the user is checked, if the keyword contains it.

This class uses a lot of memory, because all data is cached in internal hash tables before dumping it into files. The reciproke algorithm for reading this information in the JavaME frontend is implemented in the class dict.common.DictIndex.

Version:: $Revision: 21 $
Author:: Daniel Stoinski

Field Summary
`private static int`	`BUFLEN` Internal buffer length used for parsing files.
`private java.lang.String`	`m_basedir` Base directory for the resulting files.
`private java.lang.String`	`m_dstdictbasefn` Pattern for basenames of the resulting UTF-8 dictionary files.
`private java.lang.String`	`m_dstdictfullfn` Pattern for full names of the resulting UTF-8 dictionary files.
`private java.util.Hashtable`	`m_index` The map of imported dictionary entries.
`private java.lang.String`	`m_srcdictfn` The name of the dictionary file to import.
`private int`	`m_type` The type of the dictionary file.
`private static int`	`MAXFILESIZE` Maximal size of a single file containing dictionary data.

Constructor Summary
`DictImport(java.lang.String aDictFileName, java.lang.String aType, java.lang.String aBasedir)` Initializes the object for the given file name and dictionary type.

Method Summary
`boolean`	`addIndex(java.lang.String index, int fileno, long from, long to)` Adds the given keyword and description to the map.
`private static int`	`conv(java.lang.String anSrcFile, java.lang.String aDstFilePat, java.lang.String anSrcEnc, java.lang.String aDstEnc, int maxsize)` Copies source dictionary file to a set of destination dictionary files changing its character encoding and splitting the origin file.
`java.util.Hashtable`	`get()` Returns the index created during the import process.
`private static java.lang.String`	`getEncoding(int aDictType)` Returns characted encoding for the input dictionary file.
`void`	`saveIdx(int fileno)` Creates the index files and the index configuration file.
`void`	`start()` Does the whole job.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

MAXFILESIZE

private static final int MAXFILESIZE

Maximal size of a single file containing dictionary data.

See Also:: Constant Field Values

BUFLEN

private static final int BUFLEN

Internal buffer length used for parsing files.

See Also:: Constant Field Values

m_index

private java.util.Hashtable m_index

The map of imported dictionary entries. Keys are index keywords (String), values are arrays (ArrayList) of DictEntryRef objects with the explanations of the keywords.

m_srcdictfn

private java.lang.String m_srcdictfn

The name of the dictionary file to import.

m_dstdictbasefn

private java.lang.String m_dstdictbasefn

Pattern for basenames of the resulting UTF-8 dictionary files. Pattern accroding to MessageFormat. Files in UTF-8 needed for JavaME. From these files the resulting index will be build. One dictionary file must be splitted into multiple smaller files in order to make the processing faster and in order to make it possible at all for huge dictionary files.

m_dstdictfullfn

private java.lang.String m_dstdictfullfn

Pattern for full names of the resulting UTF-8 dictionary files. Pattern accroding to MessageFormat. m_dstdictbasefn + the installation base directory.

m_basedir

private java.lang.String m_basedir

Base directory for the resulting files.

m_type

private int m_type

The type of the dictionary file.

Constructor Detail

DictImport

public DictImport(java.lang.String aDictFileName,
                  java.lang.String aType,
                  java.lang.String aBasedir)

Initializes the object for the given file name and dictionary type. Will prepare the splitted dictionary files, index files and the index configuration file in aBasedir/d_. Remember, that you must call start() in order to start the creation of the files.

Parameters:: aDictFileName - name of the dictionary file to import into the map.; aType - the dictionary type, string DICTCC, UDDL or THES.; aBasedir - base directory for resulting files.

Method Detail

addIndex

public final boolean addIndex(java.lang.String index,
                              int fileno,
                              long from,
                              long to)

Adds the given keyword and description to the map. If the map already kontains a key equaling the given index keyword, then the DictEntryRef object created for the given from and to position will be appended to the array of objects for this key. Else a new key + is created in the map and the DictEntryRef object is put as the only and first element of the erray for the key.

Specified by:: addIndex in interface IDictParserHandler

Parameters:: index - the index keyword used as a key in the map.; from - the position of the begin of the keyword explanation in the dictionary file. Used for instantiating a new DictEntryRef object.; to - the position of the end of the keyword explanation in the dictionary file. Used for instantiating a new DictEntryRef object.; fileno - index of the dictionary file, for which the index entry reference has to be created.
Returns:: always true, meaning we don't want to break the import process.

getEncoding

private static java.lang.String getEncoding(int aDictType)

Returns characted encoding for the input dictionary file.

Parameters:: aDictType - dictionary type, one of the constants defined in DictType
Returns:: encoding for the dictionary type.

conv

private static int conv(java.lang.String anSrcFile,
                        java.lang.String aDstFilePat,
                        java.lang.String anSrcEnc,
                        java.lang.String aDstEnc,
                        int maxsize)
                 throws java.io.IOException

Copies source dictionary file to a set of destination dictionary files changing its character encoding and splitting the origin file. The input file is assumed to be a text file with lines separated with CR, LF or both and having the correct encoding.

Parameters:: anSrcFile - the source file; aDstFilePat - pattern for the destination files created Pattern accroding to MessageFormat.; anSrcEnc - encoding of the source file; aDstEnc - encoding of the destination file; maxsize - size after which to create a new destination file.
Returns:: number of destination files created.
Throws:: java.io.IOException - on read or write errors.

start

public final void start()
                 throws java.io.IOException

Does the whole job.

Creates splitted dictionary files using conv()
Creates internal data structures using ParserBase.read()
Writes the index files and the index configuration file usng saveIdx()

Throws:: java.io.IOException - on read errors.

get

public final java.util.Hashtable get()

Returns the index created during the import process.

Returns:: m_index

saveIdx

public final void saveIdx(int fileno)
                   throws java.io.IOException

Creates the index files and the index configuration file. Creates the files in the hash based directory structure and the file idx.bin.

Parameters:: fileno - number of dictionary files created
Throws:: java.io.IOException - on write errors