28 December 2015

Handling long paths on Windows

Recently I was having a tough time trying to add support for long paths to some piece of software running on Windows OS. If you did some work with file I/O on Windows you would definitely know that file paths have a length limit, namely, MAX_PATH, which is 260. If we exclude the drive letter, colon, the backslash after colon and the terminating null character at the very end of the path, then 256 characters remain for the path on the drive: "<drive letter>:\\<max 256 characters long path>\0".

  However, if again you worked with I/O APIs, you would notice that some of them come in two different versions - ANSI and UNICODE, ending with 'A' and 'W', respectively. A good example is GetCurrentDirectory, with GetCurrentDirectoryW and GetCurrentDirectoryA specialized versions. Turns out that newer file systems on Windows always store the paths in Unicode so they can store any extended character and there's no need to truncate the path. That said, a Unicode path can be up to 32,767 characters long. Note, that even with Unicode each of the path components cannot be longer than some limit, which is to the best of my knowledge 255 (can be found in lpMaximumComponentLength out parameter of GetVolumeInformation function)  So the idea behind the Unicode versions of APIs is to support longer paths. A good start would be to read this article on MSDN to familiarize yourselves with the notions and basic details of file and path names.

  The main technique to learn is prefixing the full path name with '\\?\', or with escaped string "\\\\?\\", or if you are a C++ fan, LR"(\\?\)". This will make the path treated as a Unicode path which does not have the MAX_PATH limit. However, this prefix can be used with full paths only, so be sure to call GetFullPathNameW and prefix the value obtained from it. Beware, the MSDN page I linked to has some wrong guideline, it says if you want to get the Unicode version without the MAX_PATH restriction, prefix the input path string with \\?\, however, as I already mentioned, this prefix is valid only for full paths, so providing a relative path prefixed with it won't ever work. Just make sure you call the Unicode version of the function. After this point, any path like ..\a\b\myFile.txt will be converted to something like C:\work\a\b\myFile.txt, even if the last path has more than MAX_PATH characters, and after prefixing it will look like \\?\C:\work\a\b\myFile.txt. Now you can use this path to open a stream and read the file contents (make sure you always use the Unicode versions of APIs, there are some that do not have it, e.g. OpenFile). You can correctly guess, that GetFullPathName does not need the path to represent an existing file, it can be something made up, and the relative root will be considered the current working directory, that's how the parent nodes will be identified.

  If you need to deal with not only local paths, but also remote shares, there is a slight difference you need to remember. The prefix for remote paths is not '\\?\', but rather '\\?\UNC\'. The full path returned by GetFullPathName function (or its Unicode or ANSI versions) will have the form <Drive letter>:\<directory path>\<filename> for local paths. For network shares it will have this form: \\<Server name>\<share path>\<filename>. So to determine which prefix you need to add, you just need to check the first two characters of the full path (assuming valid paths, but not necessarily existing). One last small but important thing; when you get a server share path, you cannot just prefix it with '\\?\UNC\', you need to remove the leading '\\' first, so that you do not end up with \\?\UNC\\\<Server name>... (note the three consecutive backslashes after UNC).

  The second big piece to know about is short paths. The most common way of creating short paths on Windows is the 8.3 pattern coming from DOS. It's called 8.3 because there are 8 characters for file name a point and then 3 characters for extension. Any path that is longer than MAX_PATH, can be represented as a short version of it by modifying one or more (until the total length is not greater than MAX_PATH) of its components to match the 8.3 pattern (note, that this is not the only possible way, so do not rely on the pattern itself, but use the APIs). I.e. for instance if we have a file file.txt under directory aaa...aa consisting of 100 'a'-s (<100a> from now on), which itself is under <100b> and the latter is under <100c> on drive D:, then the full path will contain wcslen(L"file.txt") + 3*100 + wcslen(L"D:\") + 1 (this last one for the terminating null character), which is 309, which obviously is greater than MAX_PATH. So shortening this will be take the first path component that does not match the 8.3 patten, which is <100a>, and shorten it to aaaaaa~1, as you can see it's 8 characters now, and with just one component modification the total length is reduced to 309-92=217, which is good to go.

  If somewhere in your program you have a file path which is coming as an input, always consider that it can be in a short form, and calling GetFullPathNameW on it, won't actually expand the short components to full ones, it will just get the full path from the root to the file, but containing the same short component names, i.e. aaaaaa~1 will become D:\aaaaaa~1. In order to get the expanded version of full path, call GetLongPathNameW function, but remember to prefix the full path with \\?\. In contrast to GetFullPathName, GetLongPathName requires that the path represent an actual file/directory on disc.

  That's it, after this point all the APIs that are internally supporting long paths, will work fine with the full expanded version of file name. I managed to open a stream on a file whose path length was 457. It took me a lot of time to create that file. Windows Explorer does not support long paths, but seems there are some bugs, so you can get to it. If you try to create a file whose full path name will be longer than MAX_PATH, it will not allow you. But if you create it shorter, then start renaming the containing directories, you can make it. If you try to copy the full path from the address bar in windows explorer, or do a Shift+right click on the file, and 'Copy As Path' from context menu, then it will store the short version of the full path in the clipboard.

  Since I used a lot of different APIs and in most of them I converted the path, I also added a check to my conversion function to consider the input path full if it already starts with the long path prefix (\\?\).

  As I already mentioned, some functions do not support long paths. One important instance is CreateProcessW, Although you can specify a command line parameter up to 32767 characters long, the executable path part is limited to MAX_PATH. The best option I found out is to get the short version of the full path, which will implicitly support ~MAX_PATH/9*255 path length, since each component in short version might contain 8 characters only, so we could have MAX_PATH/9 (8 characters for path component and 1 separator - slash or backslash) components, but in the long version each of them can be up to 255 characters long. This estimate is far from being precise, since it does not consider the drive name, the file name, and also implies too long component names which rarely happens. So in some specific cases shortening the full path helps, but that is not a portable solution, since not all file systems support short names and even the ones that do do not always have the same pattern for shortening.

  Seems this is all I wanted to share, I spent a lot of time researching the web to get a working model, and I did a lot of tests myself, before I updated the original code, so thought some people might find my experience useful.

P.S. A side note on file/path APIs

Whenever there is an API that accepts a buffer to fill and the number of characters to fill, it generally accepts a null and 0, respectively, for those arguments, and just returns the required number of characters to store the to-be-returned string. So you can just call with null and 0 first, and then allocate a buffer of correct size, not to do it twice. Let's look at one of the functions discussed above.

// consider an input fileName of type PCWSTR (const wchar_t*)
DWORD requiredBufferLength = GetFullPathNameW(fileName, 0, nullptr, nullptr);

if (0 == requiredBufferLength) // means failure
{
       return GetLastError();
}

wchar_t* buffer = new wchar_t[requiredBufferLength];

DWORD result = GetFullPathNameW(fileName, requiredBufferLength, buffer, nullptr);

if (0 == result)
{
       return GetLastError();
}

// buffer now contains the full path name of fileName, use it.