|
|
William wrote:
>
> Thanks for the reply.
>
>> Have you also modified the index.noun file to account for your changes?
>
>> index.noun contains a list of byte offsets into data.noun, and any changes to
>> the latter mean the former is invalid.
>
> I have modified the index.noun too,
>
>> Alternatively, I wonder what platform you are working on? Records in the
>> WordNet
>> files must be terminated by just a single "\x0A". If you are working on a
>> non-Unix platform that uses a multi-character record separator then the
>> records
>> will be a different length, so invalidating the index file.
>
> I am working on Linux william-pc 2.6.24-16-generic #1 SMP Thu Apr 10 13:23:42
> UTC 2008 i686 GNU/Linux
>
> Ok,
> I got to admit something, after knowing the seek function, only today I
> realize how actually determine the synset id which is equivalient to
> byte offset that you said. Before this I thought the synset id is
> determined by some kind of database auto-increment id/ primary key
> thing. lol.
>
> Now I realized of course when I added let's say 3 character to the first line
> and when the seek function try to seek(FH, 00001930, 0) ,
> I will get
> g)\n00001930 03 n 01 physical_entity 0 007 @ 00001740 n 0000 ~ 00002452 n
> 0000 ~ 00002684 n 0000 ~ 00007347 n 0000 ~ 00020827 n 0000 ~ 00029677 n
> 0000 ~ 14580597 n 0000 | an entity that has physical existence
>
> 00001740 03 n 02 entity 0 003 ~ 00001930 n 0000 ~ 00002137 n 0000 ~ 04424418
> n 0000 | that which is perceived or known or inferred to have its own
> distinct existence (living or nonliving)
> 00001930 03 n 01 physical_entity 0 007 @ 00001740 n 0000 ~ 00002452 n 0000 ~
> 00002684 n 0000 ~ 00007347 n 0000 ~ 00020827 n 0000 ~ 00029677 n 0000 ~
> 14580597 n 0000 | an entity that has physical existence
>
> Not wonder it's invalid.
>
> I wonder what is the reason they arrange the database in such a way ? Is it,
> it would make the lookup faster ? And what is that index.noun file used for
> when all the information in there is also in data.noun ?
>
> So now how can I add new synonym words to the WordNet database without
> affecting the original offset bytes ?
You clearly haven't come across file indexing before! Using seek() to locate a
record is incomparably faster than reading through it until you find the data
you need.
Using the file offset as a record ID is a good idea because
- It is bound to be unique
- it is easy to verify that the data hasn't been corrupted
The separate index.noun file is there to make it quick to find all records in
data.noun that apply to a given word.
Editing the database is a non-trivial task. You've found the documentation
already, so take a look at that and write something that allows you to move data
around while keeping the record IDs valid.
Rob
|
|