1
0
Fork 0
mirror of https://github.com/KingDuckZ/dindexer.git synced 2024-11-29 01:33:46 +00:00
dindexer/docs/posts/02_untitled.md

4.5 KiB
Raw Blame History

The first piece of functionality I implemented in #dindexer is the part that looks for files and directories and calculates the hash. Thats indeed the heart of the project, and almost everything else is built around that central idea. It all happenes in the hash_dir() function. And “all” is definitely too much. Let me explain: in spite of the name, many things are done in that function, such as traversing the directory tree (literally the logic) and getting the mime type of files. Thats no surprise, since the project is very young and its still moving away from being a prototype. During the past months Ive been adding new values to the DB and new functionalities to the program, and more than once that central function was the quick way to add the new things. Now Im at the point where dindexer as a concept is working for me, and so Im planning to keep the development going and add new features (which I will discuss here in the future).

This looks like the right moment to refactor that code, and I had to think on how best to do it. I should mention at this point that Im considering the possibility to use more than one hashing algorithm for each item being indexed in order to minimize the collision probability, but this is a larger topic and Im not so sure about the whole idea anymore, so lets leave this discussion for another time. Anyways, with that in mind my first idea was to implement the different operations that hash_dir() is currently doing as jobs, in order to take advantage of multi-core systems. Running several hashing algorithms in parallel sounded like a good idea, but a friend of mine talked me out of that, so its going to stay single thread and single hash for now.

The next idea I had was to split the disk scanning process into tasks, and have a manager executing whatever tasks you registered with it. Thats very convenient because it will trim a lot of crap out of main() and will also let me easily add or remove jobs from the manager (for example in case there will be command line switches that enable or disable parts of the scanning process). My very first approach was just as described: manager + base task class with run_task() virtual method.

Still, there is a bit of dependency management involved. For example hashing files needs me to have a list of files to go through in the first place, and detecting the content type of a disk needs both that same list plus the media type. And those are hard dependencies, so its not like you can just skip one task and still expect everything to work. My way of keeping the task-based approach and still have some way to enforce compulsory dependencies is to give up on the task manager and the common base class approach and just have a completely different class for each task. This turns out to be very convenient since each task is producing something different (a list of files, an enum, a list of hashes...), and I can have each task require the tasks it depends on at construction time, so I get build errors if some key dependency is missing. How about the manager class? Thats also not needed anymore. Once Ive instantiated the last object in the task chain, the one that returns the full list of data to be sent to the DB, Ill just have to call its get_or_create() method and it will go up and collect all the bits and pieces it needs to do its own part.

Thats still not the entire story: some parts of the new tasks could still benefit from being put into a common base class, and I still need to be able to swap tasks for unit testing. For example lets say I want to test the content-detection functions, which depend on having a list of files and a given media type. For the sake of clearness, lets say you want to test if video DVDs are being detected fine. You will need a list of files containing a VIDEO_TS and AUDIO_TS directory, plus some VOB, IFO etc. And you need the media type to be a DVD. The base class I came up with is templated over the return type of its get_or_create() method, so by declaring the constructor in scantask::ContentType as ContentType ( Base<FileList>&, Base<MediaTypes>& ) I leave the way open to replacing the tasks above in the dependency tree.

At this point the changes Im working on are on a separate branch. Feel free to look at the hashdir_refactoring branch if you want to see the work-in-progress!

As usual, you can find dindexer on bitbucket.org.

#opensource #linux #dindexer #cpp