28 December 2015

Handling long paths on Windows

Recently I was having a tough time trying to add support for long paths to some piece of software running on Windows OS. If you did some work with file I/O on Windows you would definitely know that file paths have a length limit, namely, MAX_PATH, which is 260. If we exclude the drive letter, colon, the backslash after colon and the terminating null character at the very end of the path, then 256 characters remain for the path on the drive: "<drive letter>:\\<max 256 characters long path>\0".

  However, if again you worked with I/O APIs, you would notice that some of them come in two different versions - ANSI and UNICODE, ending with 'A' and 'W', respectively. A good example is GetCurrentDirectory, with GetCurrentDirectoryW and GetCurrentDirectoryA specialized versions. Turns out that newer file systems on Windows always store the paths in Unicode so they can store any extended character and there's no need to truncate the path. That said, a Unicode path can be up to 32,767 characters long. Note, that even with Unicode each of the path components cannot be longer than some limit, which is to the best of my knowledge 255 (can be found in lpMaximumComponentLength out parameter of GetVolumeInformation function)  So the idea behind the Unicode versions of APIs is to support longer paths. A good start would be to read this article on MSDN to familiarize yourselves with the notions and basic details of file and path names.

  The main technique to learn is prefixing the full path name with '\\?\', or with escaped string "\\\\?\\", or if you are a C++ fan, LR"(\\?\)". This will make the path treated as a Unicode path which does not have the MAX_PATH limit. However, this prefix can be used with full paths only, so be sure to call GetFullPathNameW and prefix the value obtained from it. Beware, the MSDN page I linked to has some wrong guideline, it says if you want to get the Unicode version without the MAX_PATH restriction, prefix the input path string with \\?\, however, as I already mentioned, this prefix is valid only for full paths, so providing a relative path prefixed with it won't ever work. Just make sure you call the Unicode version of the function. After this point, any path like ..\a\b\myFile.txt will be converted to something like C:\work\a\b\myFile.txt, even if the last path has more than MAX_PATH characters, and after prefixing it will look like \\?\C:\work\a\b\myFile.txt. Now you can use this path to open a stream and read the file contents (make sure you always use the Unicode versions of APIs, there are some that do not have it, e.g. OpenFile). You can correctly guess, that GetFullPathName does not need the path to represent an existing file, it can be something made up, and the relative root will be considered the current working directory, that's how the parent nodes will be identified.

  If you need to deal with not only local paths, but also remote shares, there is a slight difference you need to remember. The prefix for remote paths is not '\\?\', but rather '\\?\UNC\'. The full path returned by GetFullPathName function (or its Unicode or ANSI versions) will have the form <Drive letter>:\<directory path>\<filename> for local paths. For network shares it will have this form: \\<Server name>\<share path>\<filename>. So to determine which prefix you need to add, you just need to check the first two characters of the full path (assuming valid paths, but not necessarily existing). One last small but important thing; when you get a server share path, you cannot just prefix it with '\\?\UNC\', you need to remove the leading '\\' first, so that you do not end up with \\?\UNC\\\<Server name>... (note the three consecutive backslashes after UNC).

  The second big piece to know about is short paths. The most common way of creating short paths on Windows is the 8.3 pattern coming from DOS. It's called 8.3 because there are 8 characters for file name a point and then 3 characters for extension. Any path that is longer than MAX_PATH, can be represented as a short version of it by modifying one or more (until the total length is not greater than MAX_PATH) of its components to match the 8.3 pattern (note, that this is not the only possible way, so do not rely on the pattern itself, but use the APIs). I.e. for instance if we have a file file.txt under directory aaa...aa consisting of 100 'a'-s (<100a> from now on), which itself is under <100b> and the latter is under <100c> on drive D:, then the full path will contain wcslen(L"file.txt") + 3*100 + wcslen(L"D:\") + 1 (this last one for the terminating null character), which is 309, which obviously is greater than MAX_PATH. So shortening this will be take the first path component that does not match the 8.3 patten, which is <100a>, and shorten it to aaaaaa~1, as you can see it's 8 characters now, and with just one component modification the total length is reduced to 309-92=217, which is good to go.

  If somewhere in your program you have a file path which is coming as an input, always consider that it can be in a short form, and calling GetFullPathNameW on it, won't actually expand the short components to full ones, it will just get the full path from the root to the file, but containing the same short component names, i.e. aaaaaa~1 will become D:\aaaaaa~1. In order to get the expanded version of full path, call GetLongPathNameW function, but remember to prefix the full path with \\?\. In contrast to GetFullPathName, GetLongPathName requires that the path represent an actual file/directory on disc.

  That's it, after this point all the APIs that are internally supporting long paths, will work fine with the full expanded version of file name. I managed to open a stream on a file whose path length was 457. It took me a lot of time to create that file. Windows Explorer does not support long paths, but seems there are some bugs, so you can get to it. If you try to create a file whose full path name will be longer than MAX_PATH, it will not allow you. But if you create it shorter, then start renaming the containing directories, you can make it. If you try to copy the full path from the address bar in windows explorer, or do a Shift+right click on the file, and 'Copy As Path' from context menu, then it will store the short version of the full path in the clipboard.

  Since I used a lot of different APIs and in most of them I converted the path, I also added a check to my conversion function to consider the input path full if it already starts with the long path prefix (\\?\).

  As I already mentioned, some functions do not support long paths. One important instance is CreateProcessW, Although you can specify a command line parameter up to 32767 characters long, the executable path part is limited to MAX_PATH. The best option I found out is to get the short version of the full path, which will implicitly support ~MAX_PATH/9*255 path length, since each component in short version might contain 8 characters only, so we could have MAX_PATH/9 (8 characters for path component and 1 separator - slash or backslash) components, but in the long version each of them can be up to 255 characters long. This estimate is far from being precise, since it does not consider the drive name, the file name, and also implies too long component names which rarely happens. So in some specific cases shortening the full path helps, but that is not a portable solution, since not all file systems support short names and even the ones that do do not always have the same pattern for shortening.

  Seems this is all I wanted to share, I spent a lot of time researching the web to get a working model, and I did a lot of tests myself, before I updated the original code, so thought some people might find my experience useful.

P.S. A side note on file/path APIs

Whenever there is an API that accepts a buffer to fill and the number of characters to fill, it generally accepts a null and 0, respectively, for those arguments, and just returns the required number of characters to store the to-be-returned string. So you can just call with null and 0 first, and then allocate a buffer of correct size, not to do it twice. Let's look at one of the functions discussed above.

// consider an input fileName of type PCWSTR (const wchar_t*)
DWORD requiredBufferLength = GetFullPathNameW(fileName, 0, nullptr, nullptr);

if (0 == requiredBufferLength) // means failure
{
       return GetLastError();
}

wchar_t* buffer = new wchar_t[requiredBufferLength];

DWORD result = GetFullPathNameW(fileName, requiredBufferLength, buffer, nullptr);

if (0 == result)
{
       return GetLastError();
}

// buffer now contains the full path name of fileName, use it.

26 August 2015

Find occurrences of an integer in a sorted array

    Recently I was given a problem to solve during a technical interview: we have an array of sorted integers as an input and a single integer; we need to find the starting and ending positions of the occurrences of that integer in the array. For example, if the array is {0, 1, 1, 2, 3, 4, 4, 4, 5} and the input integer is 1, the program should return (1, 2). If the input integer is 4 and the array is the same, the program should return (5, 7). And finally, if no occurrence found, say if the input is 7, the program should return (-1, -1). I couldn't get a clean code in the assigned time, that's why I decided to try it at home and make sure my algorithm works. So here is just a simple solution. For simplicity, I used std::vector<int> as an input array, but obviously it should be a generalized function with a template parameter. Anyway, here is my solution:

// The actual interface method which is the solution to the problem
std::pair<int, int> getRange(const std::vector<int>& numbers, int value)
{
       return getRange(numbers, 0, numbers.size(), value, -1, -1);
}

As you can see, in the function above I am just calling an overloaded version of the same function, which contains the actual algorithm implementation. I added lots of explanatory comments so no more text, just the code:

// Obviously this should be just an implementation detail
// so you don't want this be part of an interface of your
// class/lib.
// Note the currentLow and currentUp parameters, they are
// meant to keep track of the so-far-found range of
// occurrences, to pass down to the same function through
// recursive calls.
std::pair<int, int> getRange(
       const std::vector<int>& numbers,
       int begin,
       int end,
       int value,
       int currentLow = -1,
       int currentUp = -1)
{
       // First of all, since the original array is sorted,
       // we don't need to look for anything if the element
       // we're looking for is less than the first
       // element of the array or greater than the last,
       // meaning that it's not there for sure. So just
       // returning the not found value.
       if (value < numbers[begin] || value > numbers[end - 1])
       {
              return std::make_pair(currentLow, currentUp);
       }

       // On the other hand, if the first and last elements
       // of the array are equal to the one we're looking for,
       // it means all the elements between are also equal
       // because the array is sorted. Hence, we just return
       // the whole range.
       if (numbers[begin] == value && numbers[end - 1] == value)
       {
              return std::make_pair(begin, end - 1);
       }

       // If none of the easy cases happen, then we just
       // split the array in two, and try to look for the
       // range in a binary search fashion.
       int middle = begin + (end - begin) / 2;

       // If the middle element is greater than the value, then
       // obviously, if the element is in the array, it's in the
       // first half (left from middle), so we just call the
       // function recursively with the cut range.
       if (value < numbers[middle])
       {
              // look in first half
              return getRange(
                     numbers,
                     begin,
                     middle,
                     value,
                     currentLow,
                     currentUp);
       }
       // Same here - but looking in the other half.
       else if (value > numbers[middle])
       {
              // look in 2nd half
              return getRange(
                     numbers,
                     middle,
                     end,
                     value,
                     currentLow,
                     currentUp);
       }
       // This is the most interesting case, when middle element
       // is equal to the given value, it means there might be
       // occurrences both on left and right sides. Now we call
       // the function recursively for both left and right
       // sections, but we only take the lower bound from left
       // section and the upper bound from the right.
       else
       {
              currentLow = (currentLow > middle || currentLow == -1) ?
                                         middle : currentLow;

              currentUp = (currentUp < middle) ? middle : currentUp;
             
              // look in both
              int lowerBound = getRange(
                     numbers,
                     begin,
                     middle,
                     value,
                     currentLow,
                     currentUp).first;

              int upperBound = getRange(
                     numbers,
                     middle,
                     end,
                     value,
                     currentLow,
                     currentUp).second;

              return std::make_pair(lowerBound, upperBound);
       }
}

Don't forget to include the required headers.

26 July 2014

How to convert string to DateTime and vice versa

Recently I was looking for a convenient way, to parse a string representation of DateTime to integral value representing epoch time. And I couldn't find an existing class to make that easily, for example giving the string as an input to constructor. After about a couple of hours of research I found a brief way to achieve that using a mix of C and C++ tools.

First of all, there is an std::tm struct defined in <ctime> header, which represents a calendar date and a time with all the relevant members, the date is reckoned from start of year 1900, thus, if we try to represent current year (2014) the year field of tm structure will be equal to 114.
Second, in the same <ctime> header there is an auxiliary function, mktime, which accepts a pointer to std::tm object and returns std::time_t. The latter is a typedef not defined by C++ standard specification. However, it is almost always an integral type, so we can store it in an std::int64_t to be on the safe side. std::time_t represents the number of seconds elapsed since epoch (Jan 1 1900, 00:00:00), the same as POSIX time.

Alright, now that we have found a good type to store date and time in, we need the last step, getting all this info from a string, and that is where the most difficult part comes. std::tm is a POD struct, so no parse methods for users. std::time_t is just a typedef of a built-in type, so again, not much use.
Fortunately, there is a great set of tools that help developers deal with streams, and one of them is i/o manipulators. The one that we are going to use here is std::get_time from <iomanip> header. If passed as an argument to input stream's operator>>, it formats the stream buffer with the provided format string and writes the resulting dateTime into an std::tm structure. And that solved my problem.

Enough text, let's look into the code:

#include <ctime>
#include <iomanip>
#include <iostream>
#include <sstream>

// Converts UTC time string to a time_t value.
std::time_t getEpochTime(const std::wstring& dateTime)
{
       // Let's consider we are getting all the input in
       // this format: '2014-07-25T20:17:22Z' (T denotes
       // start of Time part, Z denotes UTC zone).
       // A better approach would be to pass in the format as well.
       static const std::wstring dateTimeFormat{ L"%Y-%m-%dT%H:%M:%SZ" };

       // Create a stream which we will use to parse the string,
       // which we provide to constructor of stream to fill the buffer.
       std::wistringstream ss{ dateTime };

       // Create a tm object to store the parsed date and time.
       std::tm dt;

       // Now we read from buffer using get_time manipulator
       // and formatting the input appropriately.
       ss >> std::get_time(&dt, dateTimeFormat.c_str());

       // Convert the tm structure to time_t value and return.
       return std::mktime(&dt);
}

int main(void)
{
       // Test to see if our function works properly.
       std::wstring dateTimeStr{ L"2014-07-25T20:17:22Z" };
       std::wcout << L"The epoch time value representing " << dateTimeStr
              << L" is " << getEpochTime(dateTimeStr) << std::endl;

       return 0;
}

If you want to verify that get_time correctly populated the tm object, just add a couple of outputs  after operator>> line, something like this:

       std::wcout << dt.tm_year << std::endl;
       std::wcout << dt.tm_mon << std::endl;
       std::wcout << dt.tm_mday << std::endl;
       std::wcout << dt.tm_hour << std::endl;
       std::wcout << dt.tm_min << std::endl;
       std::wcout << dt.tm_sec << std::endl;

To convert the epoch time value back to std::tm structure you can use either of std::localtime and std::gmtime functions from <ctime> header. And finally, if you want to convert std::tm to string, there is an analogous manipulator std::put_time which is used with output stream's operator<<.

Thanks for reading, I hope some people will use this approach when they encounter the same problem.