Short analysis of organization of storage directory

Created by: Lester Caine, Last modification: 14 Jul 2008 (11:31 UTC)
Trying to pull together the various documents on how 'storage' is managed. The starting point is probably Kernel_Storage but that is now somewhat out of date and the replacement StoragePackage does not have any content as yet. Tutorial - Liberty Plugins refers to the R2 base for storage management using storage plugins LibertyAttachments, but these have been ripped out in R2.1 and replaced by the mime plugins LibertyMime.
Before adding any documentation to StoragePackage I think it is worth just outlining how the system currently works and the alternatives that could be included.
Basically anything that requires accessing in addition to the content generated via LibertyPackage is stored in 'storage' and this needs to be visible in addition to the site scripts. One of the design goals is that the 'storage' area could be at a different URL to the main site, allowing load sharing over multiple machines. This adds a level of complexity, but is essentially easy to manage as long as the storage path is generated using the core tools. As long as the path to the data is visible from the server anything is possible, and duplicating this data across multiple machines is also practical.
HOW data is located in storage is a little anarchic at the moment, and this is causing a few problems at times. The main storage mechanism is via Liberty and is now handled via the mime plugins, but some packages have their own areas and store 'private' data there, using the core liberty code directly. Things like images for articles, and users bypass some of the basic framework simply because they are not directly linked to the content_id.
Most of this has come about due to 'differences' between the way things are viewed by different developers, and not having a base plan to work to. The underlying problem is the difference in operating methods some of us are used to, and following the way things are done on other projects. So what I am proposing is a set of guide lines that will help future development.
LA and LM produce a complex storage plan which is ideal for some types of site, but causes difficulty with others. All of the files relating to an attachment are stored in a single sub-directory identified by the file_id at the lower level. These directories are then stored in a tree structure to prevent any directory getting too big. The basis of the existing structure is the idea of keeping uploads from each user ring fenced, so you need to know who uploaded an attachment before you can find the sub directory it is stored in. This is fine on one level since the liberty_file table will give you the location of the directory, but is only really practical for sites where the bulk of content relates to each user. LM has added additional complications by including the file type in the tree, but that can be a bit misleading when other file formats are also stored with the base file. This is currently provide by mime.default.php and is calculated as a 5 level tree.
'users' / user_id%1000 / user_id / <type> / file_id / <files>
and the only way of finding the data relating to a particular attachment is by a lookup in the combined liberty_attachments and liberty_files tables.
One setup where this configuration fails is where there are a small number of users creating the data for a large number of users to access. A single user uploading 90000 pdf files will find them all referenced from the one pdf directory. Something that is not very practical when trying to access those pdf's.
mime.flatdefault.php provides a 'flatter' but more consistent storage model based purely on the file_id and this can be tailored to limit the maximum number of sub-directories in any one branch. This is currently set to 1000 and provides
'attachments' / file_id%1000 / file_id
Any data can be found simply by knowing the file_id.
A related change that is also implemented in flatdefault is the 'flattening' of the identification id. Rather than having separate sequence generators for content_id, attachment_id and file_id. A single generator is used, and every file_id matches the attachment_id and they both match the content_id, if there is a linked content item. The advantage of this is that any attached data can be found simply by looking up the content_id in storage since this matches the file_id of the directory. A simple list of thumbnails can be accessed without needing to produce complex sql queries.
This storage mechanism is also used in BitcommercePackage to store images related to the products being provided. The 'attachment' is not used to store the images, because this may well be storing a program, document or such like that is being sold, although even in THAT case I can't see why the treasury facility to include a different thumbnail with a product is not used, so that the one attachment folder could be used. One of the changes that I have made to commerce is to use images already loaded in fisheye/treasury as the source of these product images. Accessing these via LM would involve additional multiple sql queries, while the zencart code avoids that by simply looking up the picture number in it's own flat format storage area. So by using flatdefault with a LINKED_ATTACHEMENT commerce can be tidily integrated into bitweaver so that fisheye or treasury can be used to provide management data relating to the images used elsewhere.
To complete THIS model, a switch to disable 'unlinked' uploads is useful, but not essential, by restricting uploading to controlled areas such as a fisheye gallery, all content can be managed and authenticated before it is publicly available to use in other content.</files></type>