docs: add faq section explaining why rclone changes fullwidth characters in file names

This commit is contained in:
albertony 2021-11-01 13:46:23 +01:00 committed by GitHub
parent 33bf9b4923
commit 91cdaffcc1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 121 additions and 22 deletions

View File

@ -209,3 +209,21 @@ The most common cause of rclone using lots of memory is a single
directory with thousands or millions of files in. Rclone has to load
this entirely into memory as rclone objects. Each rclone object takes
0.5k-1k of memory.
### Rclone changes fullwidth Unicode punctuation marks in file names
For example: On a Windows system, you have a file with name `Test1.jpg`,
where `` is the Unicode fullwidth colon symbol. When using rclone
to copy this to your Google Drive, you will notice that the file
gets renamed to `Test:1.jpg`, where `:` is the regular (halfwidth) colon.
The reason for such renames is the way rclone handles different
[restricted filenames](/overview/#restricted-filenames) on different
cloud storage systems. It tries to avoid ambiguous file names as
much and allow moving files between many cloud storage systems
transparently, by replacing invalid characters with similar looking
Unicode characters when transferring to one storage system, and replacing
back again when transferring to a different storage system where the
original characters are supported. When the same Unicode characters
are intentionally used in file names, this replacement strategy leads
to unwanted renames. Read more [here](/overview/#restricted-filenames-caveats).

View File

@ -46,6 +46,10 @@ Local file system at .: Replacing invalid UTF-8 characters in "gro\xdf"
#### Restricted characters
With the local backend, restrictions on the characters that are usable in
file or directory names depend on the operating system. To check what
rclone will replace by default on your system, run `rclone help flags local-encoding`.
On non Windows platforms the following characters are replaced when
handling file names.

View File

@ -138,7 +138,8 @@ Some cloud storage systems might have restrictions on the characters
that are usable in file or directory names.
When `rclone` detects such a name during a file upload, it will
transparently replace the restricted characters with similar looking
Unicode characters.
Unicode characters. To handle the different sets of restricted characters
for different backends, rclone uses something it calls [encoding](#encoding).
This process is designed to avoid ambiguous file names as much as
possible and allow to move files between many cloud storage systems
@ -150,14 +151,60 @@ to ensure correct formatting and not necessarily the actual name used
on the cloud storage.
This transformation is reversed when downloading a file or parsing
`rclone` arguments.
For example, when uploading a file named `my file?.txt` to Onedrive
will be displayed as `my file?.txt` on the console, but stored as
`my file.txt` (the `?` gets replaced by the similar looking ``
character) to Onedrive.
The reverse transformation allows to read a file`unusual/name.txt`
from Google Drive, by passing the name `unusualname.txt` (the `/` needs
to be replaced by the similar looking `` character) on the command line.
`rclone` arguments. For example, when uploading a file named `my file?.txt`
to Onedrive, it will be displayed as `my file?.txt` on the console, but
stored as `my file.txt` to Onedrive (the `?` gets replaced by the similar
looking `` character, the so-called "fullwidth question mark").
The reverse transformation allows to read a file `unusual/name.txt`
from Google Drive, by passing the name `unusualname.txt` on the command line
(the `/` needs to be replaced by the similar looking `` character).
#### Caveats {#restricted-filenames-caveats}
The filename encoding system works well in most cases, at least
where file names are written in English or similar languages.
You might not even notice it: It just works. In some cases it may
lead to issues, though. E.g. when file names are written in Chinese,
or Japanese, where it is always the Unicode fullwidth variants of the
punctuation marks that are used.
On Windows, the characters `:`, `*` and `?` are examples of restricted
characters. If these are used in filenames on a remote that supports it,
Rclone will transparently convert them to their fullwidth Unicode
variants ``, `` and `` when downloading to Windows, and back again
when uploading. This way files with names that are not allowed on Windows
can still be stored.
However, if you have files on your Windows system originally with these same
Unicode characters in their names, they will be included in the same conversion
process. E.g. if you create a file in your Windows filesystem with name
`Test1.jpg`, where `` is the Unicode fullwidth colon symbol, and use
rclone to upload it to Google Drive, which supports regular `:` (halfwidth
question mark), rclone will replace the fullwidth `:` with the
halfwidth `:` and store the file as `Test:1.jpg` in Google Drive. Since
both Windows and Google Drive allows the name `Test1.jpg`, it would
probably be better if rclone just kept the name as is in this case.
With the opposite situation; if you have a file named `Test:1.jpg`,
in your Google Drive, e.g. uploaded from a Linux system where `:` is valid
in file names. Then later use rclone to copy this file to your Windows
computer you will notice that on your local disk it gets renamed
to `Test1.jpg`. The original filename is not legal on Windows, due to
the `:`, and rclone therefore renames it to make the copy possible.
That is all good. However, this can also lead to an issue: If you already
had a *different* file named `Test1.jpg` on Windows, and then use rclone
to copy either way. Rclone will then treat the file originally named
`Test:1.jpg` on Google Drive and the file originally named `Test1.jpg`
on Windows as the same file, and replace the contents from one with the other.
Its virtually impossible to handle all cases like these correctly in all
situations, but by customizing the [encoding option](#encoding), changing the
set of characters that rclone should convert, you should be able to
create a configuration that works well for your specific situation.
See also the [example](/overview/#encoding-example-windows) below.
(Windows was used as an example of a file system with many restricted
characters, and Google drive a storage system with few.)
#### Default restricted characters {#restricted-characters}
@ -230,7 +277,7 @@ names in a different encoding than UTF-8 or UTF-16, like latin1. See the
#### Encoding option {#encoding}
Most backends have an encoding options, specified as a flag
Most backends have an encoding option, specified as a flag
`--backend-encoding` where `backend` is the name of the backend, or as
a config parameter `encoding` (you'll need to select the Advanced
config in `rclone config` to see it).
@ -240,17 +287,17 @@ such a way as to preserve the maximum number of characters (see
above).
However this can be incorrect in some scenarios, for example if you
have a Windows file system with characters such as `` and `` that
you want to remain as those characters on the remote rather than being
translated to `*` and `?`.
have a Windows file system with Unicode fullwidth characters
``, `` or ``, that you want to remain as those characters on the
remote rather than being translated to regular (halfwidth) `*`, `?` and `:`.
The `--backend-encoding` flags allow you to change that. You can
disable the encoding completely with `--backend-encoding None` or set
`encoding = None` in the config file.
Encoding takes a comma separated list of encodings. You can see the
list of all available characters by passing an invalid value to this
flag, e.g. `--local-encoding "help"` and `rclone help flags encoding`
list of all possible values by passing an invalid value to this
flag, e.g. `--local-encoding "help"`. The command `rclone help flags encoding`
will show you the defaults for the backends.
| Encoding | Characters |
@ -263,7 +310,7 @@ will show you the defaults for the backends.
| Ctl | All control characters 0x00-0x1F |
| Del | DEL 0x7F |
| Dollar | `$` |
| Dot | `.` |
| Dot | `.` or `..` as entire string |
| DoubleQuote | `"` |
| Hash | `#` |
| InvalidUtf8 | An invalid UTF-8 character (e.g. latin1) |
@ -283,6 +330,8 @@ will show you the defaults for the backends.
| Slash | `/` |
| SquareBracket | `[`, `]` |
##### Encoding example: FTP
To take a specific example, the FTP backend's default encoding is
--ftp-encoding "Slash,Del,Ctl,RightSpace,Dot"
@ -300,14 +349,42 @@ to the existing ones, giving:
This can be specified using the `--ftp-encoding` flag or using an `encoding` parameter in the config file.
Or let's say you have a Windows server but you want to preserve ``
and ``, you would then have this as the encoding (the Windows
encoding minus `Asterisk` and `Question`).
##### Encoding example: Windows
Slash,LtGt,DoubleQuote,Colon,Pipe,BackSlash,Ctl,RightSpace,RightPeriod,InvalidUtf8,Dot
As a nother example, take a Windows system where there is a file with
name `Test1.jpg`, where `` is the Unicode fullwidth colon symbol.
When using rclone to copy this to a remote which supports `:`,
the regular (halfwidth) colon (such as Google Drive), you will notice
that the file gets renamed to `Test:1.jpg`.
This can be specified using the `--local-encoding` flag or using an
`encoding` parameter in the config file.
To avoid this you can change the set of characters rclone should convert
for the local filesystem, using command-line argument `--local-encoding`.
Rclone's default behavior on Windows corresponds to
```
--local-encoding "Slash,LtGt,DoubleQuote,Colon,Question,Asterisk,Pipe,BackSlash,Ctl,RightSpace,RightPeriod,InvalidUtf8,Dot"
```
If you want to use fullwidth characters ``, `` and `` in your filenames
without rclone changing them when uploading to a remote, then set the same as
the default value but without `Colon,Question,Asterisk`:
```
--local-encoding "Slash,LtGt,DoubleQuote,Pipe,BackSlash,Ctl,RightSpace,RightPeriod,InvalidUtf8,Dot"
```
Alternatively, you can disable the conversion of any characters with `--local-encoding None`.
Instead of using command-line argument `--local-encoding`, you may also set it
as [environment variable](/docs/#environment-variables) `RCLONE_LOCAL_ENCODING`,
or [configure](/docs/#configure) a remote of type `local` in your config,
and set the `encoding` option there.
The risk by doing this is that if you have a filename with the regular (halfwidth)
`:`, `*` and `?` in your cloud storage, and you try to download
it to your Windows filesystem, this will fail. These characters are not
valid in filenames on Windows, and you have told rclone not to work around
this by converting them to valid fullwidth variants.
### MIME Type ###