With the framework being in a stable form now and Siegfrieds work on a “dataprovider register” (he will blog about it soon i hope) I decided to take a stab at something new that has not really been used on the desktop before except in GNOME-Do.
However what we are doing is a bit more complicated.
With Zeitgeist knowing about your activities and events i tried to apply the classic “A Priori” Algorithm to detect associated files based on usage.
Unlike our current implementation of finding files related to other explicit given files. This one looks at a whole timerange and creates sets of files used together. It is kind of like detecting usage patterns.
Taken from wikipedia here is a little example:
A large supermarket tracks sales data by SKU (item), and thus is able to know what items are typically purchased together. Apriori is a moderately efficient way to build a list of frequent purchased item pairs from this data. Let the database of transactions consist of the sets {1,2,3,4}, {2,3,4}, {2,3}, {1,2,4}, {1,2,3,4}, and {2,4}. Each number corresponds to a product such as “butter” or “water”. The first step of Apriori to count up the frequencies, called the supports, of each member item separately:
Item Support
1 | 3
2 | 6
3 | 4
4 | 5
We can define a minimum support level to qualify as “frequent,” which depends on the context. For this case, let min support = 3. Therefore, all are frequent. The next step is to generate a list of all 2-pairs of the frequent items. Had any of the above items not been frequent, they wouldn’t have been included as a possible member of possible 2-item pairs. In this way, Apriori prunes the tree of all possible sets..
Item Support
{1,2} | 3
{1,3} | 2
{1,4} | 3
{2,3} | 4
{2,4} | 5
{3,4} | 3
This is counting up the occurrences of each of those pairs in the database. Since minsup=3, we don’t need to generate 3-sets involving {1,3}. This is due to the fact that since they’re not frequent, no supersets of them can possibly be frequent. Keep going:
Item Support
{1,2,4} | 3
{2,3,4} | 3
In much larger datasets, especially those with huge numbers of items present in low quantities and small numbers of items present in large quantities, the gains made by pruning the possible pairs tree like this can be very large.
Now imagine the supermarket being all files on ur computer and the transactions being sets of events happening to the files per 30 minutes.
Using this we will be able to determine alot of cool associations between files and with some Tracker magic we could do this on a metadata level.
Now i pushed the implementation with this test case and another into a branch waiting for the team to review it and expose it properly over D-Bus. The test cases work though which kinda gets me excited to work on new algorithms and generate adaptive transaction since right now i assume every 30 minutes is a transaction. I have a lot of ideas and also I will be applying two other algorithms Winepi and Minepi to compare results.
Expect some cool stuff to come up soon…
Releases of the engine and GAJ are being prepared…
Cheers
That is awesome! Well done!
This sort of thing is totally my definition of cool. Well, as cool as it gets in the Ubuntu world. Which is cool for me, but for most others probably incredibly boring haha
Seif, why are you linking codethink.co.uk for Tracker? Codethink is not the only company working on Tracker. Codeminded AND Lanedo have core developers there too.
I really like what you are doing with zeitgeist and certainly these kind of functionalities.
The supermarket idea is a nice data mining example but is not fully comparable with event streams.
Because you assume that a transaction occurs every 30 minutes it is possible that events of different tasks will occur in one transaction.
I believe a better approach is to use instead algorithms for frequent episodes.
The bigger the timerange the lesser the errors.
We are going to use an adaptive bucket size estimation based on the density of events. I think this would help out alot
Yo sup bro.. this was a cool article
I actually got it ^_^ But I still think this will take a long ass time if you got a proper database. But that’s just because it’s pretty much an NP-complete problem, right? I did some pruning last year when we built a computer AI for the orthello game. If I recall correctly it gave us like a 2-3 time speed up there
But I can imagine that with this type of problem pruning can even be more rewarding, especially when working on the data set u described.
Peace out bro
Hey
If you want i can show you the code and u can help us find a better way of pruning! right now i prune by looking just for a minimum tolerance for the frequency of item and items in transactions. I think there can be a better way to do so
Hi Seif, nice job!
Just wanted to mention that apriori is not the best choice, since it isn’t particularly developed for stream data mining. In [1] you will find some recent approaches to stream rule mining. There is also some work in clustering analysis of stream data that might be interesting for the project.
Jiang, N. and Gruenwald, L. 2006. Research issues in data stream association rule mining. SIGMOD Rec. 35, 1 (Mar. 2006), 14-19. DOI= http://doi.acm.org/10.1145/1121995.1121998