dict.prepare
Class DictImport

java.lang.Object
  extended by dict.prepare.DictImport
All Implemented Interfaces:
IDictParserHandler

public class DictImport
extends java.lang.Object
implements IDictParserHandler

Builds files and a directory structure suitable for using a dictionary in a midlet. The class does the following tasks:

  1. Splits a dictionary file into multiple files, each having size near to MAXFILESIZE (presently 1MB) and changes the character encoding to UTF-8. Depnding on the dictionary type the encoding of the source dictionary file may be WINDOWS-1250 for dict.cc dictionaries, UTF-8 for universal dictionary database files and ISO-8859-1 or 2 for the thesaurus files. Mobile devices accept generally only UTF-8 with very few exceptions. For example a with WINDOWS-1250 encoded above 18MB large source file for a German-English dictionary obtained from dict.cc named deen.txt will be splited into 19 files deen.txt0 ... deen.txt18 with the same format and each encoded as UTF-8.
  2. Builds a number of index files spread across directories with both directory and file name corresponding to the hash code of the index keyword. The source dictionary files have a slightly different format, the keywords from a particular dictionary file are extracted depending on the dictionary type using the methods processLine() in either the class DictParser (dict.cc), UDDLParser (universal dictionary database) or ThesParser (thesaurus files from www.openthesaurus.de or synonimy.ux.pl). Each keyword is handled similar to a key in a hashtable, for which the value is an array of dictionary entry references (see also class dict.common.DictEntryRef), having the index of the dictionary file (eg. 0 for deen.txt0, ..., 18 for deen.txt18), the start position of the explanation of the entry in the dictionary file in bytes and the end position of the explanation. One keyword may have up to dict.common.DictIndex.MAXENTRIES = 150 explanations located at different positions in different dictionary files. The keyword and the array of dictionary references is then binarly appended to a file, which name is estimated - according to dict.common.DictIndex.mkpath() - as: For example for a keyword JUNGE having hash code 0x43a9c81 and having explanations in dict.txt0 from bytes 90 to 120 and dict.txt10 from bytes 30 to 70 will be stored as follows: In fact each index file also contains the number of keywords stored in it as the first number (writeInt()). Please remember, that many keywords may have the same bytes 0-10 of the hash code and therefore the file created contains usually data for many different keywords. In order to get the number of such keywords the class first prepares data structures in the memory and then writes them at once to the file using the algorith as described above. Usually you'll get exactly 8 * 255 such files having - hopefully - each is quite similar size (not guaranteed, but the sizes are really very comparable).
  3. Finally the class prepares an index configuration file /d_/idx.bin (dict.common.DictIndex.INDEXFILE). This file contains the following informations: The long list of keywords is used for the function "Search similar" in the JavaME frontend, where the string entered by the user is checked, if the keyword contains it.
This class uses a lot of memory, because all data is cached in internal hash tables before dumping it into files. The reciproke algorithm for reading this information in the JavaME frontend is implemented in the class dict.common.DictIndex.

Version:
$Revision: 21 $
Author:
Daniel Stoinski

Field Summary
private static int BUFLEN
          Internal buffer length used for parsing files.
private  java.lang.String m_basedir
          Base directory for the resulting files.
private  java.lang.String m_dstdictbasefn
          Pattern for basenames of the resulting UTF-8 dictionary files.
private  java.lang.String m_dstdictfullfn
          Pattern for full names of the resulting UTF-8 dictionary files.
private  java.util.Hashtable m_index
          The map of imported dictionary entries.
private  java.lang.String m_srcdictfn
          The name of the dictionary file to import.
private  int m_type
          The type of the dictionary file.
private static int MAXFILESIZE
          Maximal size of a single file containing dictionary data.
 
Constructor Summary
DictImport(java.lang.String aDictFileName, java.lang.String aType, java.lang.String aBasedir)
          Initializes the object for the given file name and dictionary type.
 
Method Summary
 boolean addIndex(java.lang.String index, int fileno, long from, long to)
          Adds the given keyword and description to the map.
private static int conv(java.lang.String anSrcFile, java.lang.String aDstFilePat, java.lang.String anSrcEnc, java.lang.String aDstEnc, int maxsize)
          Copies source dictionary file to a set of destination dictionary files changing its character encoding and splitting the origin file.
 java.util.Hashtable get()
          Returns the index created during the import process.
private static java.lang.String getEncoding(int aDictType)
          Returns characted encoding for the input dictionary file.
 void saveIdx(int fileno)
          Creates the index files and the index configuration file.
 void start()
          Does the whole job.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAXFILESIZE

private static final int MAXFILESIZE
Maximal size of a single file containing dictionary data.

See Also:
Constant Field Values

BUFLEN

private static final int BUFLEN
Internal buffer length used for parsing files.

See Also:
Constant Field Values

m_index

private java.util.Hashtable m_index
The map of imported dictionary entries. Keys are index keywords (String), values are arrays (ArrayList) of DictEntryRef objects with the explanations of the keywords.


m_srcdictfn

private java.lang.String m_srcdictfn
The name of the dictionary file to import.


m_dstdictbasefn

private java.lang.String m_dstdictbasefn
Pattern for basenames of the resulting UTF-8 dictionary files. Pattern accroding to MessageFormat. Files in UTF-8 needed for JavaME. From these files the resulting index will be build. One dictionary file must be splitted into multiple smaller files in order to make the processing faster and in order to make it possible at all for huge dictionary files.


m_dstdictfullfn

private java.lang.String m_dstdictfullfn
Pattern for full names of the resulting UTF-8 dictionary files. Pattern accroding to MessageFormat. m_dstdictbasefn + the installation base directory.


m_basedir

private java.lang.String m_basedir
Base directory for the resulting files.


m_type

private int m_type
The type of the dictionary file.

Constructor Detail

DictImport

public DictImport(java.lang.String aDictFileName,
                  java.lang.String aType,
                  java.lang.String aBasedir)
Initializes the object for the given file name and dictionary type. Will prepare the splitted dictionary files, index files and the index configuration file in aBasedir/d_. Remember, that you must call start() in order to start the creation of the files.

Parameters:
aDictFileName - name of the dictionary file to import into the map.
aType - the dictionary type, string DICTCC, UDDL or THES.
aBasedir - base directory for resulting files.
Method Detail

addIndex

public final boolean addIndex(java.lang.String index,
                              int fileno,
                              long from,
                              long to)
Adds the given keyword and description to the map. If the map already kontains a key equaling the given index keyword, then the DictEntryRef object created for the given from and to position will be appended to the array of objects for this key. Else a new key + is created in the map and the DictEntryRef object is put as the only and first element of the erray for the key.

Specified by:
addIndex in interface IDictParserHandler
Parameters:
index - the index keyword used as a key in the map.
from - the position of the begin of the keyword explanation in the dictionary file. Used for instantiating a new DictEntryRef object.
to - the position of the end of the keyword explanation in the dictionary file. Used for instantiating a new DictEntryRef object.
fileno - index of the dictionary file, for which the index entry reference has to be created.
Returns:
always true, meaning we don't want to break the import process.

getEncoding

private static java.lang.String getEncoding(int aDictType)
Returns characted encoding for the input dictionary file.

Parameters:
aDictType - dictionary type, one of the constants defined in DictType
Returns:
encoding for the dictionary type.

conv

private static int conv(java.lang.String anSrcFile,
                        java.lang.String aDstFilePat,
                        java.lang.String anSrcEnc,
                        java.lang.String aDstEnc,
                        int maxsize)
                 throws java.io.IOException
Copies source dictionary file to a set of destination dictionary files changing its character encoding and splitting the origin file. The input file is assumed to be a text file with lines separated with CR, LF or both and having the correct encoding.

Parameters:
anSrcFile - the source file
aDstFilePat - pattern for the destination files created Pattern accroding to MessageFormat.
anSrcEnc - encoding of the source file
aDstEnc - encoding of the destination file
maxsize - size after which to create a new destination file.
Returns:
number of destination files created.
Throws:
java.io.IOException - on read or write errors.

start

public final void start()
                 throws java.io.IOException
Does the whole job.
  1. Creates splitted dictionary files using conv()
  2. Creates internal data structures using ParserBase.read()
  3. Writes the index files and the index configuration file usng saveIdx()

Throws:
java.io.IOException - on read errors.

get

public final java.util.Hashtable get()
Returns the index created during the import process.

Returns:
m_index

saveIdx

public final void saveIdx(int fileno)
                   throws java.io.IOException
Creates the index files and the index configuration file. Creates the files in the hash based directory structure and the file idx.bin.

Parameters:
fileno - number of dictionary files created
Throws:
java.io.IOException - on write errors