From 91cdaffcc1698de59a86cc3a1d99643a82e87b80 Mon Sep 17 00:00:00 2001 From: albertony <12441419+albertony@users.noreply.github.com> Date: Mon, 1 Nov 2021 13:46:23 +0100 Subject: [PATCH] docs: add faq section explaining why rclone changes fullwidth characters in file names --- docs/content/faq.md | 18 ++++++ docs/content/local.md | 4 ++ docs/content/overview.md | 121 ++++++++++++++++++++++++++++++++------- 3 files changed, 121 insertions(+), 22 deletions(-) diff --git a/docs/content/faq.md b/docs/content/faq.md index 43641a35e..497fa4f0e 100644 --- a/docs/content/faq.md +++ b/docs/content/faq.md @@ -209,3 +209,21 @@ The most common cause of rclone using lots of memory is a single directory with thousands or millions of files in. Rclone has to load this entirely into memory as rclone objects. Each rclone object takes 0.5k-1k of memory. + +### Rclone changes fullwidth Unicode punctuation marks in file names + +For example: On a Windows system, you have a file with name `Test:1.jpg`, +where `:` is the Unicode fullwidth colon symbol. When using rclone +to copy this to your Google Drive, you will notice that the file +gets renamed to `Test:1.jpg`, where `:` is the regular (halfwidth) colon. + +The reason for such renames is the way rclone handles different +[restricted filenames](/overview/#restricted-filenames) on different +cloud storage systems. It tries to avoid ambiguous file names as +much and allow moving files between many cloud storage systems +transparently, by replacing invalid characters with similar looking +Unicode characters when transferring to one storage system, and replacing +back again when transferring to a different storage system where the +original characters are supported. When the same Unicode characters +are intentionally used in file names, this replacement strategy leads +to unwanted renames. Read more [here](/overview/#restricted-filenames-caveats). diff --git a/docs/content/local.md b/docs/content/local.md index d09268a82..3eed168a3 100644 --- a/docs/content/local.md +++ b/docs/content/local.md @@ -46,6 +46,10 @@ Local file system at .: Replacing invalid UTF-8 characters in "gro\xdf" #### Restricted characters +With the local backend, restrictions on the characters that are usable in +file or directory names depend on the operating system. To check what +rclone will replace by default on your system, run `rclone help flags local-encoding`. + On non Windows platforms the following characters are replaced when handling file names. diff --git a/docs/content/overview.md b/docs/content/overview.md index ca2daf23a..ef510b823 100644 --- a/docs/content/overview.md +++ b/docs/content/overview.md @@ -138,7 +138,8 @@ Some cloud storage systems might have restrictions on the characters that are usable in file or directory names. When `rclone` detects such a name during a file upload, it will transparently replace the restricted characters with similar looking -Unicode characters. +Unicode characters. To handle the different sets of restricted characters +for different backends, rclone uses something it calls [encoding](#encoding). This process is designed to avoid ambiguous file names as much as possible and allow to move files between many cloud storage systems @@ -150,14 +151,60 @@ to ensure correct formatting and not necessarily the actual name used on the cloud storage. This transformation is reversed when downloading a file or parsing -`rclone` arguments. -For example, when uploading a file named `my file?.txt` to Onedrive -will be displayed as `my file?.txt` on the console, but stored as -`my file?.txt` (the `?` gets replaced by the similar looking `?` -character) to Onedrive. -The reverse transformation allows to read a file`unusual/name.txt` -from Google Drive, by passing the name `unusual/name.txt` (the `/` needs -to be replaced by the similar looking `/` character) on the command line. +`rclone` arguments. For example, when uploading a file named `my file?.txt` +to Onedrive, it will be displayed as `my file?.txt` on the console, but +stored as `my file?.txt` to Onedrive (the `?` gets replaced by the similar +looking `?` character, the so-called "fullwidth question mark"). +The reverse transformation allows to read a file `unusual/name.txt` +from Google Drive, by passing the name `unusual/name.txt` on the command line +(the `/` needs to be replaced by the similar looking `/` character). + +#### Caveats {#restricted-filenames-caveats} + +The filename encoding system works well in most cases, at least +where file names are written in English or similar languages. +You might not even notice it: It just works. In some cases it may +lead to issues, though. E.g. when file names are written in Chinese, +or Japanese, where it is always the Unicode fullwidth variants of the +punctuation marks that are used. + +On Windows, the characters `:`, `*` and `?` are examples of restricted +characters. If these are used in filenames on a remote that supports it, +Rclone will transparently convert them to their fullwidth Unicode +variants `*`, `?` and `:` when downloading to Windows, and back again +when uploading. This way files with names that are not allowed on Windows +can still be stored. + +However, if you have files on your Windows system originally with these same +Unicode characters in their names, they will be included in the same conversion +process. E.g. if you create a file in your Windows filesystem with name +`Test:1.jpg`, where `:` is the Unicode fullwidth colon symbol, and use +rclone to upload it to Google Drive, which supports regular `:` (halfwidth +question mark), rclone will replace the fullwidth `:` with the +halfwidth `:` and store the file as `Test:1.jpg` in Google Drive. Since +both Windows and Google Drive allows the name `Test:1.jpg`, it would +probably be better if rclone just kept the name as is in this case. + +With the opposite situation; if you have a file named `Test:1.jpg`, +in your Google Drive, e.g. uploaded from a Linux system where `:` is valid +in file names. Then later use rclone to copy this file to your Windows +computer you will notice that on your local disk it gets renamed +to `Test:1.jpg`. The original filename is not legal on Windows, due to +the `:`, and rclone therefore renames it to make the copy possible. +That is all good. However, this can also lead to an issue: If you already +had a *different* file named `Test:1.jpg` on Windows, and then use rclone +to copy either way. Rclone will then treat the file originally named +`Test:1.jpg` on Google Drive and the file originally named `Test:1.jpg` +on Windows as the same file, and replace the contents from one with the other. + +Its virtually impossible to handle all cases like these correctly in all +situations, but by customizing the [encoding option](#encoding), changing the +set of characters that rclone should convert, you should be able to +create a configuration that works well for your specific situation. +See also the [example](/overview/#encoding-example-windows) below. + +(Windows was used as an example of a file system with many restricted +characters, and Google drive a storage system with few.) #### Default restricted characters {#restricted-characters} @@ -230,7 +277,7 @@ names in a different encoding than UTF-8 or UTF-16, like latin1. See the #### Encoding option {#encoding} -Most backends have an encoding options, specified as a flag +Most backends have an encoding option, specified as a flag `--backend-encoding` where `backend` is the name of the backend, or as a config parameter `encoding` (you'll need to select the Advanced config in `rclone config` to see it). @@ -240,17 +287,17 @@ such a way as to preserve the maximum number of characters (see above). However this can be incorrect in some scenarios, for example if you -have a Windows file system with characters such as `*` and `?` that -you want to remain as those characters on the remote rather than being -translated to `*` and `?`. +have a Windows file system with Unicode fullwidth characters +`*`, `?` or `:`, that you want to remain as those characters on the +remote rather than being translated to regular (halfwidth) `*`, `?` and `:`. The `--backend-encoding` flags allow you to change that. You can disable the encoding completely with `--backend-encoding None` or set `encoding = None` in the config file. Encoding takes a comma separated list of encodings. You can see the -list of all available characters by passing an invalid value to this -flag, e.g. `--local-encoding "help"` and `rclone help flags encoding` +list of all possible values by passing an invalid value to this +flag, e.g. `--local-encoding "help"`. The command `rclone help flags encoding` will show you the defaults for the backends. | Encoding | Characters | @@ -263,7 +310,7 @@ will show you the defaults for the backends. | Ctl | All control characters 0x00-0x1F | | Del | DEL 0x7F | | Dollar | `$` | -| Dot | `.` | +| Dot | `.` or `..` as entire string | | DoubleQuote | `"` | | Hash | `#` | | InvalidUtf8 | An invalid UTF-8 character (e.g. latin1) | @@ -283,6 +330,8 @@ will show you the defaults for the backends. | Slash | `/` | | SquareBracket | `[`, `]` | +##### Encoding example: FTP + To take a specific example, the FTP backend's default encoding is --ftp-encoding "Slash,Del,Ctl,RightSpace,Dot" @@ -300,14 +349,42 @@ to the existing ones, giving: This can be specified using the `--ftp-encoding` flag or using an `encoding` parameter in the config file. -Or let's say you have a Windows server but you want to preserve `*` -and `?`, you would then have this as the encoding (the Windows -encoding minus `Asterisk` and `Question`). +##### Encoding example: Windows - Slash,LtGt,DoubleQuote,Colon,Pipe,BackSlash,Ctl,RightSpace,RightPeriod,InvalidUtf8,Dot +As a nother example, take a Windows system where there is a file with +name `Test:1.jpg`, where `:` is the Unicode fullwidth colon symbol. +When using rclone to copy this to a remote which supports `:`, +the regular (halfwidth) colon (such as Google Drive), you will notice +that the file gets renamed to `Test:1.jpg`. -This can be specified using the `--local-encoding` flag or using an -`encoding` parameter in the config file. +To avoid this you can change the set of characters rclone should convert +for the local filesystem, using command-line argument `--local-encoding`. +Rclone's default behavior on Windows corresponds to + +``` +--local-encoding "Slash,LtGt,DoubleQuote,Colon,Question,Asterisk,Pipe,BackSlash,Ctl,RightSpace,RightPeriod,InvalidUtf8,Dot" +``` + +If you want to use fullwidth characters `:`, `*` and `?` in your filenames +without rclone changing them when uploading to a remote, then set the same as +the default value but without `Colon,Question,Asterisk`: + +``` +--local-encoding "Slash,LtGt,DoubleQuote,Pipe,BackSlash,Ctl,RightSpace,RightPeriod,InvalidUtf8,Dot" +``` + +Alternatively, you can disable the conversion of any characters with `--local-encoding None`. + +Instead of using command-line argument `--local-encoding`, you may also set it +as [environment variable](/docs/#environment-variables) `RCLONE_LOCAL_ENCODING`, +or [configure](/docs/#configure) a remote of type `local` in your config, +and set the `encoding` option there. + +The risk by doing this is that if you have a filename with the regular (halfwidth) +`:`, `*` and `?` in your cloud storage, and you try to download +it to your Windows filesystem, this will fail. These characters are not +valid in filenames on Windows, and you have told rclone not to work around +this by converting them to valid fullwidth variants. ### MIME Type ###