WARNING

This text was automatically converted from troff me macros to HTML. Information may have been lost, added, or changed in the process. Lars Aronsson and Lysator do not guarantee the correctness of this HTML document.

"SECTION 7 *- LARGE OBJECTS"

NAME

Large Object Interface *- interface to POSTGRES large objects

DESCRIPTION

In POSTGRES, data values are stored in tuples, and individual tuples cannot span multiple data pages. Since the size of a data page is 8192 bytes, the upper limit on the size of a data value is relatively low. To support the storage of larger atomic values, POSTGRES provides a "large object" interface. This interface provides file-oriented access to user data that has been explicitly declared to be a large type.

Version 4 of POSTGRES supports two different implementations of large objects. These two implementations allow users to trade off speed of access against transaction protection and crash recovery on large object data. Applications that can tolerate lost data may store object data in conventional files that are fast to access, but cannot be recovered in the case of system crashes. For applications that require stricter guarantees of durability, a transaction-protected large object implementation is available. This section describes the two implementations and the programmatic and query language interfaces to large object data.

Unlike the BLOB support provided by most commercial relational database management systems, POSTGRES allows users to define specific large object types. POSTGRES large objects are first-class objects in the database, and any operation that can be applied to a conventional (small) abstract data type (ADT) may also be applied to a large one. For example, two different large object types, such as image and voice , may be created. Functions that operate on image data, and other functions that operate on voice data, may be declared to the database system. The data manager will distinguish between image and voice data automatically, and will allow users to invoke the appropriate functions on values of each of these types. In addition, indices may be created large data values, or on functions of them. Finally, operators may be defined that operate on large values. Users may invoke these functions and operators from the query language. The database system will enforce type restrictions on large object data values.

The POSTGRES large object interface is modeled after the Unix file system interface, with analogs of open(), read(), write(), lseek(), etc. User functions call these routines to retrieve only the data of interest from a large object. For example, if a large object type called mugshot existed that stored photographs of faces, then a function called beard could be declared on mugshot data. Beard could look at the lower third of a photograph, and determine the color of the beard that appeared there, if any. The entire large object value need not be buffered, or even examined, by the beard function. As mentioned above, POSTGRES supports functional indices on large object data. In this example, the results of the beard function could be stored in a B-tree index to provide fast searches for people with red beards.

UNIX FILES AS LARGE OBJECT ADTS

The simplest large object interface supplied with POSTGRES is also the least robust. It does not support transaction protection, crash recovery, or time travel. On the other hand, it can be used on existing data files (such as word-processor files) that must be accessed simultaneously by the database system and existing application programs.

This implementation stores large object data in a UNIX file, and stores only the file name in the database. Importing a large object into the database is as simple as storing the file name in a distinguished large object name relation. Interface routines allow the database system to open, seek, read, write, and close these UNIX files by an internal large object identifier.

The functions lo_filein and lo_fileout convert between UNIX filenames and internal large object identifiers. These functions are POSTGRES registered functions, meaning they can be used directly in Postquel queries as well as from dynamically loaded C functions. If you are defining a simple large object ADT, these functions can be used as your input and output functions (see "define type" and the POSTGRES Manual sections concerning user-defined types for details).


char *lo_filein(filename)
 char *filename;

Import a new UNIX file storing large object data into the database system. This routine stores the filename in a large object naming relation and assigns it a unique large object identifier.

char * lo_fileout (object) LargeObject *object;

This routine returns the UNIX filename associated with a large object.

The file storing the large object must be accessible on the machine on which POSTGRES is running. The data is not copied into the database system, so if the file is later removed, it is unrecoverable.

Large objects are accessible from both the POSTGRES backend, using dynamically-loaded functions, and from the front-end, using the LIBPQ interface. These interfaces will be described in detail below.

INVERSION LARGE OBJECTS

In contrast to UNIX files as large objects, the Inversion large object implementation guarantees transaction protection, crash recovery, and time travel on user large object data. This implementation breaks large objects up into chunks and stores the chunks in tuples in the database. A B-tree index guarantees fast searches for the correct chunk number when doing random access reads and writes.

If a transaction that has made changes to an Inversion large object subsequently aborts, the changes are backed out in the normal way. Inversion large objects are stored in the database, and so are not directly accessible to other programs. Only programs that use the POSTGRES data manager can read and write Inversion large objects.

To use Inversion large objects, a new large object should be created using the LOcreat() interface, defined below. Afterwards, the name of the large object can be stored in an ordinary tuple.

The next section describes the programmatic interface to both UNIX and Inversion large objects.

BACKEND INTERFACE TO LARGE OBJECTS

Large object data is accessible from front-end programs linked with the LIBPQ library, and from dynamically-loaded routines that execute in the POSTGRES backend. This section describes access from dynamically loaded C functions.

Creating New Large Objects

The routine


int LOcreat(path, mode, objtype)
    char *path;
    int mode;
    int objtype;
creates a new large object.

The pathname is a slash-separated list of components, and must be a unique pathname in the POSTGRES large object namespace. There is a virtual root directory (``/'') in which objects may be placed.

The objtype parameter can be one of Inversion or Unix, which are symbolic constants defined in


~postgres/src/lib/H/catalog/pg_lobj.h
The interpretation of the mode argument depends on the objtype selected.

For UNIX files, mode is the mode used to protect the file on the UNIX file system. On creation, the file is open for reading and writing.

For Inversion large objects, mode is a bitmask describing several different attributes of the new object. The symbolic constants listed here are defined in


~postgres/src/lib/H/tmp/libpq-fs.h
The access type (read, write, or both) is controlled by OR'ing together the bits INV_READ and INV_WRITE. If the large object should be archived *- that is, if historical versions of it should be moved periodically to a special archive relation *- then the INV_ARCHIVE bit should be set. The low-order sixteen bits of mask are the storage manager number on which the large object should reside*.


* In the distributed version of POSTGRES, only the magnetic disk storage manager is supported. For users running POSTGRES at UC Berkeley, additional storage managers are available.

For sites other than Berkeley, these bits should always be zero. At Berkeley, storage manager zero is magnetic disk, storage manager one is a Sony optical disk jukebox, and storage manager two is main memory.

The commands below open large objects of the two types for writing and reading. The Inversion large object is not archived, and is located on magnetic disk:


unix_fd = LOcreat("/my_unix_obj", 0600, Unix);

inv_fd = LOcreat("/my_inv_obj", INV_READ|INV_WRITE, Inversion);

Opening Large Objects

Existing large objects may be opened for reading or writing by calling the routine


int LOopen(path, mode)
    char *path;
    int mode;
The path argument specifies the large object's pathname, and is the same as the pathname used to create the object. The mode argument is interpreted by the two implementations differently. For UNIX large objects, values should be chosen from the set of mode bits passed to the open system call; that is, O_CREAT, O_RDONLY, O_WRONLY, O_RDWR, and O_TRUNC. For Inversion large objects, only the bits INV_READ and INV_WRITE have any meaning.

To open the two large objects created in the last example, a programmer would issue the commands


unix_fd = LOopen("/my_unix_obj", O_RDWR);

inv_fd = LOopen("/my_inv_obj", INV_READ|INV_WRITE);

If a large object is opened before it has been created, then a new large object is created using the UNIX implementation, and the new object is opened.

Seeking on Large Objects

The command


int
LOlseek(fd, offset, whence)
    int fd;
    int offset;
    int whence;
moves the current location pointer for a large object to the specified position. The fd parameter is the file descriptor returned by either LOcreat or LOopen. Offset is the byte offset in the large object to which to seek. The only legal value for whence in the current release of the system is L_SET, as defined in <sys/files.h>.

UNIX large objects allow holes to exist in objects; that is, a program may seek well past the end of the object and write bytes. Intervening blocks will not be created; reading them will return zero-filled blocks. Inversion large objects do not support holes.

The following code seeks to byte location 100000 of the example large objects:


unix_status = LOlseek(unix_fd, 100000, L_SET);

inv_status = LOlseek(inv_fd, 100000, L_SET);

On error, LOlseek returns a value less than zero. On success, the new offset is returned.

Writing to Large Objects

Once a large object has been created, it may be filled by calling


int
LOwrite(fd, wbuf)
    int fd;
    struct varlena *wbuf;
Here, fd is the file descriptor returned by LOcreat or LOopen, and wbuf describes the data to write. The varlena structure in POSTGRES consists of four bytes in which the length of the datum is stored, followed by the data itself. The four length bytes include themselves.

For example, to write 1024 bytes of zeroes to the sample large objects:


struct varlena *vl;

vl = (struct varlena *) palloc(1028);
VARSIZE(vl) = 1028;
bzero(VARDATA(vl), 1024);

nwrite_unix = LOwrite(unix_fd, vl);

nwrite_inv = LOwrite(inv_fd, vl);

LOwrite returns the number of bytes actually written, or a negative number on error. For Inversion large objects, the entire write is guaranteed to succeed or fail. That is, if the number of bytes written is non-negative, then it equals VARSIZE(vl).

The VARSIZE() and VARDATA() macros are declared in the file


~postgres/src/lib/H/tmp/postgres.h

Reading from Large Objects

Data may be read from large objects by calling the routine


struct varlena *
LOread(fd, len)
    int fd;
    int len;
This routine returns the byte count actually read and the data in a varlena structure. For example,

struct varlena *unix_vl, *inv_vl;
int nread_ux, nread_inv;
char *data_ux, *data_inv;

unix_vl = LOread(unix_fd, 100);
nread_ux = VARSIZE(unix_vl);
data_ux = VARDATA(unix_vl);

inv_vl = LOread(inv_fd, 100); nread_inv = VARSIZE(inv_vl); data_inv = VARDATA(inv_vl);

The returned varlena structures have been allocated by the POSTGRES memory manager palloc, and may be pfreed when they are no longer needed.

Closing a Large Object

Once a large object is no longer needed, it may be closed by calling

int
LOclose(fd)
    int fd;
where fd is the file descriptor returned by LOopen or LOcreat. On success, LOclose returns zero. A negative return value indicates an error.

For example,


if (LOclose(unix_fd) < 0)
    /* error */;

if (LOclose(inv_fd) < 0) /* error */

LIBPQ LARGE OBJECT INTERFACE

Large objects may also be accessed from database client programs that link the LIBPQ library. This library provides a set of routines that support opening, reading, writing, closing, and seeking on large objects. The interface is similar to that provided via the backend, but rather than using varlena structures, a more conventional UNIX-style buffer scheme is used.

In version 4 of POSTGRES, large object operations must be enclosed in a transaction block. This is true even for UNIX large objects, which are not transaction-protected. This is due to a shortcoming in the memory management scheme for large objects, and will be rectified in version 4.1. The end of this section shows a short example program that correctly transaction-protects its file system operations.

This section describes the LIBPQ interface in detail.

Creating a Large Object

The routine


int
p_creat(path, mode, objtype)
    char *path;
    int mode;
    int objtype;
creates a new large object. The path argument specifies a large-object system pathname.

The objtype parameter can be one of Inversion or Unix, which are symbolic constants defined in


~postgres/src/lib/H/catalog/pg_lobj.h
The interpretation of the mode argument depends on the objtype selected.

For UNIX files, mode is the mode used to protect the file on the UNIX file system. On creation, the file is open for reading and writing.

For Inversion large objects, mode is a bitmask describing several different attributes of the new object. The symbolic constants listed here are defined in


~postgres/src/lib/H/tmp/libpq-fs.h
The access type (read, write, or both) is controlled by OR'ing together the bits INV_READ and INV_WRITE. If the large object should be archived *- that is, if historical versions of it should be moved periodically to a special archive relation *- then the INV_ARCHIVE bit should be set. The low-order sixteen bits of mask are the storage manager number on which the large object should reside. For sites other than Berkeley, these bits should always be zero. At Berkeley, storage manager zero is magnetic disk, storage manager one is a Sony optical disk jukebox, and storage manager two is main memory.

The commands below open large objects of the two types for writing and reading. The Inversion large object is not archived, and is located on magnetic disk:


unix_fd = p_creat("/my_unix_obj", 0600, Unix);

inv_fd = p_creat("/my_inv_obj", INV_READ|INV_WRITE, Inversion);

Opening an Existing Large Object

To open an existing large object, call


int
p_open(path, mode)
    char *path;
    int mode;

The path argument specifies the large object pathname for the object to open. The mode bits control whether the object is opened for reading, writing, or both. For UNIX large objects, the appropriate flags are O_CREAT, O_RDONLY, O_WRONLY, O_RDWR, and O_TRUNC. For Inversion large objects, only INV_READ and INV_WRITE are recognized.

If a large object is opened before it is created, it is created by default using the UNIX file implementation.

Writing Data to a Large Object

The routine


int
p_write(fd, buf, len)
    int fd;
    char *buf;
    int len;
writes len bytes from buf to large object fd. The fd argument must have been returned by a previous p_creat or p_open.

The number of bytes actually written is returned. In the event of an error, the return value is negative.

Reading Data from a Large Object

The routine


int
p_read(fd, buf, nbytes)
    int fd;
    char *buf;
    int nbytes;
reads nbytes bytes into buffer buf from the large object descriptor fd. The number of bytes actually read is returned. In the event of an error, the return value is less than zero.

Seeking on a Large Object

To change the current read or write location on a large object, call


int
p_lseek(fd, offset, whence)
    int fd;
    int offset;
    int whence;
This routine moves the current location pointer for the large object described by fd to the new location specified by offset. For this release of POSTGRES, only L_SET is a legal value for whence.

Closing a Large Object

A large object may be closed by calling


int
p_close(fd)
    int fd;
where fd is a large object descriptor returned by p_creat or p_open. On success, p_close returns zero. On error, the return value is negative.

SAMPLE LARGE OBJECT PROGRAMS

The POSTGRES large object implementation serves as the basis for a file system (the Inversion file system) built on top of the data manager. This file system provides time travel, transaction protection, and fast crash recovery to clients of ordinary file system services. It uses the Inversion large object implementation to provide these services.

The programs that comprise the Inversion file system are included in the POSTGRES source distribution, in directories


$POSTGRESHOME/test/postfs
$POSTGRESHOME/test/postfs.usr.bin
These directories contain a set of programs for manipulating files and directories. These programs are based on the Berkeley Software Distribution NET-2 release.

These programs are useful in manipulating inversion files, but they also serve as examples of how to code large object accesses in LIBPQ. All of the programs are LIBPQ clients, and all use the interfaces that have been described in this section.

Interested readers should refer to the files in the postfs directories for in-depth examples of the use of large objects. Below, a more terse example is provided. This code fragment creates a new large object managed by Inversion, fills it with data from a UNIX file, and closes it.


#include "tmp/c.h"
#include "tmp/libpq-fe.h"
#include "tmp/libpq-fs.h"
#include "catalog/pg_lobj.h"

#define MYBUFSIZ 1024

main()
{
 int inv_fd;
 int fd;
 char *qry_result;
 char buf[MYBUFSIZ];
 int nbytes;
 int tmp;

 PQsetdb("mydatabase");

 /* large object accesses must be */
        /* transaction-protected         */
 qry_result = PQexec("begin");

 if (*qry_result == 'E') /* error */
  exit (1);

 /* open the unix file */
 fd = open("/my_unix_file", O_RDONLY, 0666);
 if (fd < 0) /* error */
  exit (1);

 /* open the inversion file */
 inv_fd = p_open("/inv_file", INV_WRITE, Inversion);
 if (inv_fd < 0) /* error */
  exit (1);

 /* copy the unix file to the inversion */
        /* large object                        */
 while ((nbytes = read(fd, buf, MYBUFSIZ)) > 0)
 {
  tmp = p_write(inv_fd, buf, nbytes);
  if (tmp < nbytes) /* error */
   exit (1);
 }

 (void) close(fd);
 (void) close(inv_fd);

 /* commit the transaction */
 qry_result = PQexec("end");

 if (*qry_result == 'E') /* error */
  exit (1);

 /* by here, success */
 exit (0);
}

BUGS

Shouldn't have to distinguish between Inversion and UNIX large objects when you open an existing large object. The system knows which implementation was used. The flags argument should be the same in these two cases.

SEE ALSO

define type(commands), define function(commands), load (commands).