WhiteDB command line utilities
==============================

wgdb - general database management
----------------------------------

This is a simple command-line utility that allows creating and freeing
a database, dump, import, run some tests and more.

Usage:

 wgdb [shmname] <command> [command arguments]

The shared memory name identifies the database to use and is an arbitrary
numeric value. If it is omitted, the default ("1000") will be used.

Commands commonly available:

 help (or "-h") - display this text.
 version (or "-v") - display libwgdb version.
 free - free shared memory.
 export [-f] <filename> - write memory dump to disk
       (-f: force dump even if unable to get lock)
 import [-l] <filename> - read memory dump from disk. Overwrites  existing
       memory contents (-l: enable logging after import).
 exportcsv <filename> - export data to a CSV file.
 importcsv <filename> - import data from a CSV file.
 replay <filename> - replay a journal file.
 test - run quick database tests.
 fulltest - run in-depth database tests.
 header - print header data.
 fill <nr of rows> [asc | desc | mix] - fill db with integer data.
 add <value1> .. - store data row (only int or str recognized)
 select <number of rows> [start from] - print db contents.
 query <col> "<cond>" <value> .. - basic query.
 del <col> "<cond>" <value> .. - like query. Matching rows are deleted from database.
 server [-l] [size b] - provide persistent shared memory for other processes (Windows).
        (-l: enable logging in the database).
 create [-l] [size] - create empty db of given size (non-Windows).
        (-l: enable logging in the database).

Importing and exporting data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Data may be exported to and imported from text files. This provides a way
to exchange data between WhiteDB on different platforms or other data sources.
The simplest format that is always available is CSV (comma separated values).

Since there is no straightforward mapping between most of WhiteDB types and
the CSV format and as CSV is not standardized, only limited support for
data types is available. The following data types are recognized when
importing from CSV:

 - NULL - empty string
 - int - plain integer
 - double - floating point number in fixed decimal notation
 - date - ISO8601 date
 - time - ISO8601 time+fractions of second.
 - string - input data that does not match the above types

The field separator is ',' (comma). The decimal point separator is '.' (dot).

WhiteDB may also provide RDF (if libraptor is available) and JSON (ongoing
development, undocumented) support.

Memory dumps and journal logs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the default configuration, WhiteDB runs in shared memory only. There
are two methods of providing data persistence: memory dumps, which are
snapshots of the current database contents and journal logs which contain
incremental updates to the database contents. Logs are only available
if the database is configured with the `./configure --enable-logging`
option and need to be explicitly enabled.

Managing memory dumps
^^^^^^^^^^^^^^^^^^^^^

A memory dump is the memory image of a WhiteDB database, saved into a file.
It stores everything except concurrency control and journal file metadata.
Images are not compatible between systems of different endianness or word
size (the most common case would probably be 32-bit vs 64-bit systems).
Also, different versions of the database library may use different format
for the memory image, therefore WhiteDB automatically refuses to import
an image made with a different library version.

Type `wgdb -v` to list the compatibility information of the database
library. It will display something like this:

  libwgdb version: 0.7.0
  byte order: little endian
  compile-time features:
  64-bit encoded data: yes
  queued locks: yes
  chained nodes in T-tree: yes
  record backlinking: yes
  child databases: no
  index templates: yes

Memory dumps are suitable for creating snapshots of the database for
backup purposes. Type

 wgdb export imagename.bin

to create a backup of the current database (since the shared memory name
was omitted, the `wgdb` utility will use "1000" by default).

 wgdb import imagename.bin

Will restore the image from disk. It will completely overwrite the current
memory contents. Note that if there is an existing memory segment, it needs
to be large enough to fit the image. Otherwise the import will fail with an
error message (and the shared memory segment will not be modified).

If the `wgdb` tool is unable to access the memory image (for example, due
to a programming error that causes the database to become permanently locked),
a "rescue" dump can be created with

 wgdb export -f rescue.bin

Note however, that in such cases care must be taken, as it is unknown what
type of errors the image may contain.

Managing journal logs
^^^^^^^^^^^^^^^^^^^^^

Journal logs provide a way of keeping a continuous backup of the database.
All of the changes to the shared memory are logged incrementally and the
journal logs may be played back to repeat all of those changes, bringing
the database to the same state as it was at the time the log file ended.

To enable journal logging, use

  ./configure --enable-logging --with-logdir=./logs

during the building of the database. Replace './logs' with wherever you'd
like the library to store the journal files. This location and the names
of the journal files are not changeable during runtime due to security
reasons. If you do not specify the log directory, '/tmp' (or '\windows\temp')
will be used by default, which may work for testing, but is probably
undesirable in the long term.

Journaling must be enabled at the time of database creation. To do this from
the command line, supply the `-l` switch to either the `wgdb create` command
or when importing an image with the `wgdb import` command. This will cause a
journal file to be created. It will be placed in 'logdir/wgdb.journal.shmname'
where shmname is the database name, for example, "1000". The journal is then
incrementally written, until one of the three things happens:

 1. A dump file is created, either by `wgdb export` or by calling the
    `wg_dump()` function
 2. A dump file is imported (again, by command line or API)
 3. A journal file is replayed.

In each of these cases, the current journal file is backed up, appending
'.0' to its name (or '.1', if '.0' already exists and so forth) and a fresh
journal file is started.

The first case can be considered normal usage, as it creates a snapshot of
the database. Since this snapshot contains everything that the journal has
logged up to this point, the journal is no longer necessary for recovery
and a fresh one will be used.

Importing the dump file will either be part of a recovery process or to
simply work with a new image. In either case, the previous journal has
become irrelevant to the current database contents, making it necessary
to start a new one.

Finally, a journal replay itself will be not logged in a journal. Therefore,
the database contents after the replay and the journal file that was in
use during the replay have become inconsistent, similarly than whan happens
with importing a memory dump. Generally, journal backups caused by these
recovery actions should be cleaned up or moved away. The user is expected
to handle this manually, case by case.

Journal log example
^^^^^^^^^^^^^^^^^^^

Assuming we use the './logs' directory to store the journal files and the
database has been compiled with journal support, let's start by enabling the
journal and adding data to the database:

 wgdb create -l
 wgdb 1011 add 1 2 3

The contents of the './logs' directory will now be:

 -rw-rw-rw- 1 user group 27 Dec  7 22:00 wgdb.journal.1011

And the contents of the database, by typing `wgdb 1011 select 10`:

 [1,2,3]

Let's create a memory dump of this database. Make sure you don't have
'example.bin' already, as it will be overwritten.

 wgdb 1011 export example.bin

Now we have these log files:

 -rw-rw-rw- 1 user group  4 Dec  7 22:04 wgdb.journal.1011
 -rw-rw-rw- 1 user group 27 Dec  7 22:00 wgdb.journal.1011.0

Note that the original file received the suffix '.0' and a new one was
created in it's place. Let's add more data:

 wgdb 1011 add Hello world

The database now contains:

 [1,2,3]
 ["Hello","world"]

Assume next that something destroyed our database. Try `wgdb 1011 free`.
We now know that we have a recent dump called 'example.bin' and some
journals:

 -rw-rw-rw- 1 user group 47 Dec  7 22:06 wgdb.journal.1011
 -rw-rw-rw- 1 user group 27 Dec  7 22:00 wgdb.journal.1011.0

The newest journal is what we're interested in. However, *keep in mind* that
importing the dump restarts the journal. Nothing will happen to our precious
log file, but in order to avoid confusion, let's move it somewhere safe
first:

 mv -i ./logs/wgdb.journal.1011 ./logs/recover.me.1011

Import the dump (note that the '-l' switch is used to re-enable journaling so
that our updates will be logged again after we've completed the recovery):

 wgdb 1011 import -l example.bin

Try to list the database contents (`wgdb 1011 select 10`):

 [1,2,3]

Finally, recover the log:

 wgdb 1011 replay ./logs/recover.me.1011

And the database contents after this will be

 [1,2,3]
 ["Hello","world"]

We managed to restore our latest state of the database by first importing
the dump file, then replaying the journal file that was written after
the dump was created. If we would now continue to modify the database, the
state could be recovered by importing 'example.bin' and then replaying
'recover.me.1011' and the latest journal file 'wgdb.journal.1011' in that
order.

The recovery process creates a number of intermediate journals which we
may now clean up by `rm ./logs/wgdb.journal.1011.?`. CAUTION: in this case,
we moved the log file that we cared about, away first. This is why this
command is safe to use here. In general, however, it is better to make
sure that the deleted log files do not contain anything not backed up
elsewhere. The easiest way would be to verify that the database is
healthy and create a fresh memory dump.

An automated backup procedure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A reasonable procedure that helps keeping track of image dumps and
journals could be implemented with a following script:

  1. dump the memory segment into a file such as 'backup.YYYYMMDD.shmname.bin'
  2. check that the dump was successful
  3. move 'logdir/wgdb.journal.shmname.0' to 'journal.YYYYMMDD.shmname'

This way, 'logdir' always contains only the current journal. Assuming that
YYYYMMDD represents the current date, memory dumps and journals can be
archived and accessed by date. If recovery is needed, the database
can be restored from the latest image and the current journal
in 'logdir/wgdb.journal.shmname' (which should first be archived separately).
Any journal backups in 'logdir' should then be removed after the restore
is successful, to ensure that step 3. archives the correct journal file
next time.

indextool - index management
----------------------------

`indextool` is for listing and managing database indexes. It takes the shared
memory name as an argument. If it is omitted, the default ("1000") will be
used. The `logtree` and `dumphash` commands are for debugging and may produce
a lot of output.

 indextool [shmname] createindex <column> - create ttree index
 indextool [shmname] createhash <columns> - create hash index (JSON support)
 indextool [shmname] dropindex <index id> - delete ttree index
 indextool [shmname] list - list all indexes in database
 indextool [shmname] logtree <index id> [filename] - log tree
 indextool [shmname] dumphash <index id> - print hash table
 
 
dserve - simple REST queries with json 
--------------------------------------

This is a simple REST service tool taking a query described with cgi parameters 
and printing json or csv output. 

`dserve` is useful both as a ready-made tool and as an example/template for making
your own tools for WhiteDB data handling. 
`dserve` is not compiled by default. Find it under the Examples folder as a 
single source file dserve.c, modify and compile yourself by doing:

  gcc dserve.c -o dserve -O2 -lwgdb
  
Copy the resulting executable under the cgi folder of the Apache server.

You can use `dserve` as a cgi program taking a few GET parameters like this

  http://myserver.com/cgi-bin/dserve?op=search&from=0&count=5

or as a command line program: on the command line simply pass the 
urlencoded query string as a single argument, for example

  dserve 'op=search&from=2&count=3'

This will print rows 2..4 of the database, like this:
----
content-length: 110
content-type: application/json

[
[1,1.1,"a simple string",[10,[],"point to me"],"2013-10-24","23:17:36.68"],
[12,"an:uri",-2000],
[23,-12]
]
----
Observe that the fourth field contains a pointer to another record, 
printed as a sublist. The same example with a small addition:

  dserve 'op=search&from=2&count=3&showid=yes'
  
The added `showid=yes` parameter prepends automatic record id-s 
(encoded offsets) to each record as a first element:  
----
content-length: 134
content-type: application/json

[
[23368,1,1.1,"a simple string",[23304,10,[],"point to me"],"2013-10-24","23:17:36.68"],
[23408,12,"an:uri",-2000],
[23432,23,-12]
]
----
The third example asks for three first records with field 0 less than 20:
----
dserve 'op=search&field=0&value=20&compare=lessthan&count=3'

content-length: 178
content-type: application/json

[
[10,[],"point to me"],
[0,0.1,"a simple string",[10,[],"point to me"],"2013-10-24","23:17:36.68"],
[1,1.1,"a simple string",[10,[],"point to me"],"2013-10-24","23:17:36.68"]
]
----

Dserve does not facilitate adding or updating records in the
database. We may add such a cgi utility in the future, but anyway, it is 
a good idea to write your own utilities which fit your own needs exactly.

Most of the dserve parameters are optional, i.e. have sensible defaults. Limits,
error messages etc are all configurable by changing the macro definitions at
the beginning of the C source.


Query by field values
~~~~~~~~~~~~~~~~~~~~~

You have to indicate the field numbers and the values
to compare the contents of these fields with. 
The whole set of cgi parameters is as follows:
 
  - *db* : numeric database name. Default 1000.
  - *op* : either `search` or `recids`. Here we assume `search`, 
    will cover `recids` later.
  - *field* : field number (0,1,2, etc) to check against the compared value.
  - *value* : value to check the field contents against. 
    Examples: value=32, value=sometext.
  - *type* : datatype of the value: null, int, double, str, char or record.
    Guessed from the value by default.
  - *compare* : equal, not_equal, lessthan, greater, ltequal or gtequal. 
    Default `equal`.
  - *from* : skip initial matching records, start from the result nr given here. 
    Default 0.
  - *count* : max number of matching records output. Default 10000.
  - *depth* : max depth of nested trees of records. Default 100, hard limit 10000.
    Set to 0 for hiding all sublists. 
    Modify the macro MAX_DEPTH_HARD to change the hard limit.
  - *format* : either `json` or `csv`. Default `json`.
  - *escape* : escaping special characters in json strings, 
    either `no` (replace nothing) , `url` (urlencode %, " and all 
    non-printable and non-ascii characters, i.e. under 32 and over 126, to be 
    completely ascii-safe) or `json` (not ascii-safe: replace only the 
    minimal set indicated in the rfc). 
    Default `json`. NB! this parameter has no effect for csv, 
    where only " gets replaced with "".
  - *showid : `yes` or `no`, if `yes`, print the record offset (automatic id) as the
    first element of each record, moving all the other fields one position to the right.
    Default `no`.

All the input values are assumed to be urlencoded.

The json output is a list of matching records. Each record is also presented as a list. 
The list elements are integers, doubles, strings or records (again represented as lists) 
pointed to from the field.  Null is printed as an empty list [].

All the other datatypes, like dates, times, URI-s, strings with a language attribute etc
in WhiteDB are converted to standard strings in a fairly intuitive manner. 
Blobs are url-encoded. For csv and too deep json branches the full internal record 
is represented by its offset (encoded pointer, i.e. automatic id).

Notice that for a graph database the json output can be a complex tree of records. Cycles are
possible, but can be inhibited by the depth parameter.

Errors are reported by printing an error string as a single element of the output list, both
for json and csv, like this:
----
content-length: 48
content-type: application/json

["unrecognized op: use op=search or op=recids"]
----
On Linux `dserve` should be able to free locks and detach the database even in case of
hard errors like segfaults. 

Importantly, you can give several sets of field/value/compare/count parameters to perform a 
complex and-query. Example:

  dserve 'op=search&field=1&value=100&type=int&compare=lessthan&field=2&value=10.3&type=double&compare=greater'

In case you use several fields in the query, you have to fill the otherwise optional type
and compare parameters for all the fields indicated. BTW, it is OK to put the same field
into several comparisons.

You can also have no fields in the query at all. In this case the full database will be traversed
and printed according to the parameters given.



Query by record id-s
~~~~~~~~~~~~~~~~~~~~

The second way to query is to simply indicate a list of record id-s (offsets: encoded record
pointers) like this:

  dserve 'op=recids&recids=12000,10236,22458'

You can learn the record id-s by using the `showid=yes` parameter described before.

The database selection parameter `db` and the generic formatting parameters 
`depth`,`format`,`escape` and `showid` function exactly as described above.


Good to know
~~~~~~~~~~~~

Things to note and useful code examples in dserve:

- does not require additional libraries except wgdb
- uses readlocks, does not use writelocks
- cgi parameter parsing
- various printing routines for WhiteDB values
- text is printed into a an automatically growing string buffer
- error and timeout handling with signals: when a signal arrives, free the readlock, 
  detach database and output ["internal error"] or ["timeout"]. Available on Linux only. 



