Using collection indexes to find items
January 24th, 2006 at 3:03 pm (2 years, 3 months ago) by Andi Vajda under chandlerdbDuring last week’s sprints I was asked to see how easy it was to import mail into Chandler from a local mailbox. Thanks to Python’s mailbox and email packages, the mailbox parsing was trivial. Similarly, Chandler’s domain model can represent email items and has a number of APIs that make creating such items from a raw email string very easy.
Unexpectedly, importing mail messages was very slow however, and was getting worse as more messages were added to the repository. The first cause of slowness was easily resolved by turning off ‘bz2′ compression of all the text LOBs being created.
The second cause of slowness was more surprising though. It had to do with looking up existing email address items from the email addresses occurring in the headers of the messages being imported. The EmailAddress item class defined here
has a class method called getEmailAddress() that will linearly iterate all EmailAddress instances every time such an item is sought. The mailbox I was trying to import contains about 6,500 mails with 800 unique email addresses. Linearly searching them won’t scale.
In the early days of the repository we had planned on having a query system and Ted even implemented a query language not unlike one you’d encounter in an SQL database. Last year, while working on Chandler 0.6, it occurred to us that a formal query language in a Python-based system was somewhat redundant and that we should be using computed collections instead of queries altogether to accomplish the same goals . Hence abstract sets and their item wrappers were implemented.
By combining abstract sets, collections and indexes - something I had added to bi-directional reference collections during the Chandler 0.5 release - we realized we could get very decent item look-up performance as well as cache the computed aspect of abstract sets making membership tests and iteration also considerably faster.
This technique can be illustrated with the example below taken from the EmailAddress index I added during last week’s sprints.
- First, a collection needs to be setup. This can be as simple as a bi-directional reference collection, an abstract collection of all items of a given kind or a more complicated computed collection combining or filtering other collections. In the EmailAddress example, I just created a KindCollection instance based on the EmailAddress kind. In the app parcel I added yet another collection called
emailAddressCollection.
emailAddressCollection = \ KindCollection.update(parcel, 'emailAddressCollection', kind=pim.mail.EmailAddress.getKind(view), recursive=True) - Then, a sorted index needs be added to the collection. I wanted to be able to find existing email addresses such that no two email address items in the collection have the same lowercase internet email address string. For this purpose, I added the
_compareAddr()method to the EmailAddress class in the mail parcel:
def _compareAddr(self, other): return cmp(self.emailAddress.lower(), other.emailAddress.lower())
and the index creation call after the code creating the collection:emailAddressCollection.rep.addIndex('emailAddress', 'compare', compare='_compareAddr')which creates a ‘compare’ index called ‘emailAddress’, an index calling a method called ‘_compareAddr’ on the items it is comparing.
- And finally to use this collection and the index to find items I added a
findEmailAddress()class method to the EmailAddress class:@classmethod def findEmailAddress(cls, view, emailAddress): collection = schema.ns("osaf.app", view).emailAddressCollection.rep emailAddress = emailAddress.lower() def compareAddr(key): return cmp(emailAddress, view[key].emailAddress.lower()) uuid = collection.findInIndex('emailAddress', 'exact', compareAddr) if uuid is None: return None return view[uuid]The
findInIndex()call looks for an exact match as per thecompareAddrlocal function which compares lowercase versions of internet address strings.








