dict.prepare
Class DictImport
java.lang.Object
dict.prepare.DictImport
- All Implemented Interfaces:
- IDictParserHandler
public class DictImport
- extends java.lang.Object
- implements IDictParserHandler
Builds files and a directory structure suitable for using a dictionary
in a midlet.
The class does the following tasks:
- Splits a dictionary file into multiple files, each having size near
to MAXFILESIZE (presently 1MB) and changes the character encoding
to UTF-8. Depnding on the dictionary type the encoding of the
source dictionary file may be WINDOWS-1250 for dict.cc dictionaries,
UTF-8 for universal dictionary database files and ISO-8859-1 or 2
for the thesaurus files. Mobile devices accept generally only
UTF-8 with very few exceptions. For example a with WINDOWS-1250
encoded above 18MB large source file for a German-English dictionary
obtained from dict.cc named deen.txt will be splited into
19 files deen.txt0 ... deen.txt18 with the same
format and each encoded as UTF-8.
- Builds a number of index files spread across directories with
both directory and file name corresponding to the hash code of
the index keyword. The source dictionary files have a slightly
different format, the keywords from a particular dictionary file
are extracted depending on the dictionary type using the methods
processLine() in either the class DictParser (dict.cc), UDDLParser
(universal dictionary database) or ThesParser (thesaurus files
from www.openthesaurus.de or synonimy.ux.pl). Each keyword is
handled similar to a key in a hashtable, for which the value is
an array of dictionary entry references (see also class
dict.common.DictEntryRef), having the index of the dictionary file
(eg. 0 for deen.txt0, ..., 18 for deen.txt18), the start position
of the explanation of the entry in the dictionary file in bytes
and the end position of the explanation. One keyword may have
up to dict.common.DictIndex.MAXENTRIES = 150 explanations located
at different positions in different dictionary files.
The keyword and the array of dictionary references is then binarly
appended to a file, which name is estimated - according to
dict.common.DictIndex.mkpath() - as:
- directory - bytes 8, 9, 10 of the hash code of the index
keyword in hex
- file - bytes 0 - 7 of the hash code of the index keyword
in hex
For example for a keyword JUNGE having hash code 0x43a9c81 and having
explanations in dict.txt0 from bytes 90 to 120 and dict.txt10 from bytes
30 to 70 will be stored as follows:
- bytes 8, 9 and 10 of the hash code result in a number 4 and this
is the directory name.
- bytes 0-7 result in a hex number 81 and this is the file name
- the whole path is prepended by dict.common.DictIndex.BASEDIR
"/d_/"
- The resulting full name of the file is then /d_/7/81
- Into this file the following information is appended:
- The keyword Junge (DataOutputstream.writeUTF())
- Number of dictionary references, here 2 (writeInt())
- The index of the dictionary file of the first reference,
0 for deen.txt0 (writeInt())
- The begin postion 90 of the first reference (writeInt())
- The length of the entry, 120 - 90 = 30 (writeInt())
- Similar for 2nd reference, 10 as index of the dictionary
file, 10 for dict.txt10
- Begin position 30
- Length 70 - 30 = 40
In fact each index file also contains the number of keywords stored
in it as the first number (writeInt()). Please remember, that many
keywords may have the same bytes 0-10 of the hash code and therefore
the file created contains usually data for many different keywords.
In order to get the number of such keywords the class first prepares
data structures in the memory and then writes them at once to the
file using the algorith as described above.
Usually you'll get exactly 8 * 255 such files having - hopefully -
each is quite similar size (not guaranteed, but the sizes are
really very comparable).
- Finally the class prepares an index configuration file /d_/idx.bin
(dict.common.DictIndex.INDEXFILE).
This file contains the following informations:
- Dictionary type, one of the constants
dict.common.DictType.DICTCC, UDDL, THES stored binarly using
DataOutputStream.writeInt().
- Number of index keywords in the dictionary, eg. number of
German keywords in the German-English dictionary, writeInt().
- Sorted index keywords, one after another using writeUTF().
The long list of keywords is used for the function
"Search similar" in the JavaME frontend, where the string entered by
the user is checked, if the keyword contains it.
This class uses a lot of memory, because all data is cached in internal
hash tables before dumping it into files.
The reciproke algorithm for reading this information in the JavaME
frontend is implemented in the class dict.common.DictIndex.
- Version:
- $Revision: 21 $
- Author:
- Daniel Stoinski
Field Summary |
private static int |
BUFLEN
Internal buffer length used for parsing files. |
private java.lang.String |
m_basedir
Base directory for the resulting files. |
private java.lang.String |
m_dstdictbasefn
Pattern for basenames of the resulting UTF-8 dictionary files. |
private java.lang.String |
m_dstdictfullfn
Pattern for full names of the resulting UTF-8 dictionary files. |
private java.util.Hashtable |
m_index
The map of imported dictionary entries. |
private java.lang.String |
m_srcdictfn
The name of the dictionary file to import. |
private int |
m_type
The type of the dictionary file. |
private static int |
MAXFILESIZE
Maximal size of a single file containing dictionary data. |
Constructor Summary |
DictImport(java.lang.String aDictFileName,
java.lang.String aType,
java.lang.String aBasedir)
Initializes the object for the given file name and dictionary type. |
Method Summary |
boolean |
addIndex(java.lang.String index,
int fileno,
long from,
long to)
Adds the given keyword and description to the map. |
private static int |
conv(java.lang.String anSrcFile,
java.lang.String aDstFilePat,
java.lang.String anSrcEnc,
java.lang.String aDstEnc,
int maxsize)
Copies source dictionary file to a set of destination dictionary files
changing its character encoding and splitting the origin file. |
java.util.Hashtable |
get()
Returns the index created during the import process. |
private static java.lang.String |
getEncoding(int aDictType)
Returns characted encoding for the input dictionary file. |
void |
saveIdx(int fileno)
Creates the index files and the index configuration file. |
void |
start()
Does the whole job. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
MAXFILESIZE
private static final int MAXFILESIZE
- Maximal size of a single file containing dictionary data.
- See Also:
- Constant Field Values
BUFLEN
private static final int BUFLEN
- Internal buffer length used for parsing files.
- See Also:
- Constant Field Values
m_index
private java.util.Hashtable m_index
- The map of imported dictionary entries.
Keys are index keywords (String), values are arrays (ArrayList) of
DictEntryRef objects with the explanations of the keywords.
m_srcdictfn
private java.lang.String m_srcdictfn
- The name of the dictionary file to import.
m_dstdictbasefn
private java.lang.String m_dstdictbasefn
- Pattern for basenames of the resulting UTF-8 dictionary files.
Pattern accroding to MessageFormat.
Files in UTF-8 needed for JavaME. From these files the resulting
index will be build. One dictionary file must be splitted into
multiple smaller files in order to make the processing faster
and in order to make it possible at all for huge dictionary files.
m_dstdictfullfn
private java.lang.String m_dstdictfullfn
- Pattern for full names of the resulting UTF-8 dictionary files.
Pattern accroding to MessageFormat.
m_dstdictbasefn + the installation base directory.
m_basedir
private java.lang.String m_basedir
- Base directory for the resulting files.
m_type
private int m_type
- The type of the dictionary file.
DictImport
public DictImport(java.lang.String aDictFileName,
java.lang.String aType,
java.lang.String aBasedir)
- Initializes the object for the given file name and dictionary type.
Will prepare the splitted dictionary files, index files and the
index configuration file in aBasedir/d_.
Remember, that you must call start() in order to start the
creation of the files.
- Parameters:
aDictFileName
- name of the dictionary file to import into the map.aType
- the dictionary type, string DICTCC, UDDL or THES.aBasedir
- base directory for resulting files.
addIndex
public final boolean addIndex(java.lang.String index,
int fileno,
long from,
long to)
- Adds the given keyword and description to the map.
If the map already kontains a key equaling the given index keyword,
then the DictEntryRef object created for the given from and to position
will be appended to the array of objects for this key. Else a new key
+ is created in the map and the DictEntryRef object is put as the only
and first element of the erray for the key.
- Specified by:
addIndex
in interface IDictParserHandler
- Parameters:
index
- the index keyword used as a key in the map.from
- the position of the begin of the keyword explanation
in the dictionary file. Used for instantiating a new
DictEntryRef object.to
- the position of the end of the keyword explanation
in the dictionary file. Used for instantiating a new
DictEntryRef object.fileno
- index of the dictionary file, for which the index entry
reference has to be created.
- Returns:
- always true, meaning we don't want to break the import process.
getEncoding
private static java.lang.String getEncoding(int aDictType)
- Returns characted encoding for the input dictionary file.
- Parameters:
aDictType
- dictionary type, one of the constants defined
in DictType
- Returns:
- encoding for the dictionary type.
conv
private static int conv(java.lang.String anSrcFile,
java.lang.String aDstFilePat,
java.lang.String anSrcEnc,
java.lang.String aDstEnc,
int maxsize)
throws java.io.IOException
- Copies source dictionary file to a set of destination dictionary files
changing its character encoding and splitting the origin file.
The input file is assumed to be a text file with lines separated
with CR, LF or both and having the correct encoding.
- Parameters:
anSrcFile
- the source fileaDstFilePat
- pattern for the destination files created
Pattern accroding to MessageFormat.anSrcEnc
- encoding of the source fileaDstEnc
- encoding of the destination filemaxsize
- size after which to create a new destination file.
- Returns:
- number of destination files created.
- Throws:
java.io.IOException
- on read or write errors.
start
public final void start()
throws java.io.IOException
- Does the whole job.
- Creates splitted dictionary files using conv()
- Creates internal data structures using ParserBase.read()
- Writes the index files and the index configuration file
usng saveIdx()
- Throws:
java.io.IOException
- on read errors.
get
public final java.util.Hashtable get()
- Returns the index created during the import process.
- Returns:
- m_index
saveIdx
public final void saveIdx(int fileno)
throws java.io.IOException
- Creates the index files and the index configuration file.
Creates the files in the hash based directory structure
and the file idx.bin.
- Parameters:
fileno
- number of dictionary files created
- Throws:
java.io.IOException
- on write errors