dindexer/docs/posts/02_untitled.md



The first piece of functionality I implemented in #dindexer is the part that looks for files and directories and calculates the hash. That’s indeed the heart of the project, and almost everything else is built around that central idea. It all happenes in the `hash_dir()` function. And “all” is definitely too much. Let me explain: in spite of the name, many things are done in that function, such as traversing the directory tree (literally the logic) and getting the mime type of files. That’s no surprise, since the project is very young and it’s still moving away from being a prototype. During the past months I’ve been adding new values to the DB and new functionalities to the program, and more than once that central function was the quick way to add the new things. Now I’m at the point where dindexer as a concept is working for me, and so I’m planning to keep the development going and add new features (which I will discuss here in the future).

This looks like the right moment to refactor that code, and I had to think on how best to do it. I should mention at this point that I’m considering the possibility to use more than one hashing algorithm for each item being indexed in order to minimize the collision probability, but this is a larger topic and I’m not so sure about the whole idea anymore, so let’s leave this discussion for another time. Anyways, with that in mind my first idea was to implement the different operations that `hash_dir()` is currently doing as jobs, in order to take advantage of multi-core systems. Running several hashing algorithms in parallel sounded like a good idea, but a friend of mine talked me out of that, so it’s going to stay single thread and single hash for now.

The next idea I had was to split the disk scanning process into tasks, and have a manager executing whatever tasks you registered with it. That’s very convenient because it will trim a lot of crap out of `main()` and will also let me easily add or remove jobs from the manager (for example in case there will be command line switches that enable or disable parts of the scanning process). My very first approach was just as described: manager + base task class with `run_task()` virtual method.

Still, there is a bit of dependency management involved. For example hashing files needs me to have a list of files to go through in the first place, and detecting the content type of a disk needs both that same list plus the media type. And those are hard dependencies, so it’s not like you can just skip one task and still expect everything to work. My way of keeping the task-based approach and still have some way to enforce compulsory dependencies is to give up on the task manager and the common base class approach and just have a completely different class for each task. This turns out to be very convenient since each task is producing something different (a list of files, an enum, a list of hashes...), and I can have each task require the tasks it depends on at construction time, so I get build errors if some key dependency is missing. How about the manager class? That’s also not needed anymore. Once I’ve instantiated the last object in the task chain, the one that returns the full list of data to be sent to the DB, I’ll just have to call its `get_or_create()` method and it will go up and collect all the bits and pieces it needs to do its own part.

That’s still not the entire story: some parts of the new tasks could still benefit from being put into a common base class, and I still need to be able to swap tasks for unit testing. For example let’s say I want to test the content-detection functions, which depend on having a list of files and a given media type. For the sake of clearness, let’s say you want to test if video DVDs are being detected fine. You will need a list of files containing a VIDEO_TS and AUDIO_TS directory, plus some VOB, IFO etc. And you need the media type to be a DVD. The base class I came up with is templated over the return type of its `get_or_create()` method, so by declaring the constructor in scantask::ContentType as `ContentType ( Base<FileList>&, Base<MediaTypes>& )` I leave the way open to replacing the tasks above in the dependency tree.

At this point the changes I’m working on are on a separate branch. Feel free to look at the *hashdir_refactoring* branch if you want to see the work-in-progress!

As usual, you can find [dindexer on bitbucket.org](https://bitbucket.org/King_DuckZ/dindexer).

\#opensource #linux #dindexer #cpp