A tool to locate and store entries in Reddit's large .log files.
Created for a coding challenge posted by the Reddit staff in 2011 Feb. Resurrected from an old machine and uploaded to Github in 2020.
This program works off the assumption that the distance between two newlines is more than fifteen characters.
The binary search running on the file works on the assumption that the file is
large and that there is at least one entry for every second of the 24-hour
period. If there isn't, the search provides a 'closest value' to what was being
looked for. However, since the means by which the search closes the search field
at each iteration is slightly dodgy (dividing memory addresses, lseeks) some
values which are actually present in the list can be missed, and another (albeit
very close) value will be given. In either case, the caller method
(get_lower_bound()) calls first_instance() to clean up the mess by
confirming if the position given by the bin search was the first occurance of
that time or not.
The method circular_bin_search() performs a hacked sort of binary search on a
list (file, in this case) where the entries are in order, but not linearly. That
is to say, one end of the list has been cut off and moved to the other side.
Naturally the ideal format of a list for binary search would be:
lowest value -> exact middle -> highest value
but in the case of the test files generated by the Perl script, we have something like:
6:00 -> 18:00 -> 6:00
which, clearly, would confuse any normal binary search.
We can imagine that the array/list/file/what-have-you has had its two ends tied together, hence the list can be called "circular". I imagine a metal ring with a visible welding mark. Each iteration is a rotation of the ring to the left and right.
The basic logic of the circular bin is as follows:
- Find the middle value as normal.
- If middle value == target value: Done.
- If target value is between the lower bound and middle value, or the middle value and the upper bound, then we constrict the data set like a normal binary search.
- If all these fail, we need to find where the "link" in the circle occurs. Fortunately we only need to find out which side of the middle value the link occurs on. This is not hard.
- Luckily for us, the target value will always be on the same side as the link if normal binary search logic failed.