PEP 672 — Unicode-related Security Considerations for Python
Most of this information is directly from Python.org specifically here. But I felt like the more people that read this – the quicker this issue will be resolved. Some of the information here can be daunting and complicated; but if you read carefully you will understand it completely. Also, much of this information is specific to Python although it may be present in other programming languages – this is specifically regarding Python as Python is our preferred language here at IA.
Python & Unicode has discovered that there may be a bit of a security problem with malicious individuals that may use the freedom that Unicode offers a user to in fact be able to write code that fools a person that may be conducting a code review on something that has just been introduced to a system.
Unicode is the universal character encoding standard that provides the basis for processing, storing of text data and the interchange that takes place on various platforms throughout the world. It performs this function regardless of language type and regardless of software being utilized when interacting with information technology protocols. Unicode encompasses all characters and all writing systems used through out the world – all of them – old as well as new systems. The intention of the Unicode Standard is to support the needs of all types of users, be it in business or academia and using either mainstream or minority scripts. As you can imagine, the number of languages and their variations would be difficult to keep track of. Unicode encodes the characters for a script, and not the languages themselves. But it basically covers the entire world of script and text.
PEP: | 672 |
---|---|
Title: | Unicode-related Security Considerations for Python |
Author: | Petr Viktorin <encukou at gmail.com> |
Status: | Active |
Type: | Informational |
Created: | 01-Nov-2021 |
Post-History: | 01-Nov-2021 |
If you wish to see which languages are supported in more detail just click here.
Version (Year) | Scripts Added | Totals | ||
---|---|---|---|---|
1.1 (1993) |
23 |
|||
Arabic | Gujarati | Lao | ||
Armenian | Gurmukhi | Latin | ||
Bengali | Han | Malayalam | ||
Bopomofo | Hangul | Oriya | ||
Cyrillic | Hebrew | Tamil | ||
Devanagari | Hiragana | Telugu | ||
Georgian | Kannada | Thai | ||
Greek | Katakana | |||
2.0 (1996) |
+1, = 24 |
|||
Tibetan | ||||
3.0 (1999) |
+13, = 37 |
|||
Braille (patterns) | Mongolian | Syriac | ||
Canadian Syllabics | Myanmar | Thaana | ||
Cherokee | Ogham | Yi | ||
Ethiopic | Runic | |||
Khmer | Sinhala | |||
3.1 (2001) |
+3, = 40 |
|||
Deseret | Gothic | Old Italic | ||
3.2 (2002) |
+4, = 44 |
|||
Buhid | Tagalog | |||
Hanunóo | Tagbanwa | |||
4.0 (2003) |
+7, = 51 |
|||
Cypriot | Osmanya | Ugaritic | ||
Limbu | Shavian | |||
Linear B | Tai Le | |||
4.1 (2005) |
+8, = 59 |
|||
Buginese | Kharoshthi | Syloti Nagri | ||
Coptic | New Tai Lue | Tifinagh | ||
Glagolitic | Old Persian Cuneiform | |||
5.0 (2006) |
+5, = 64 |
|||
Balinese | Phags-pa | Sumero-Akkadian Cuneiform | ||
N’Ko | Phoenician | |||
5.1 (2008) |
+11, = 75 |
|||
Carian | Lycian | Saurashtra | ||
Cham | Lydian | Sundanese | ||
Kayah Li | Ol Chiki | Vai | ||
Lepcha | Rejang | |||
5.2 (2009) |
+15, = 90 |
|||
Avestan | Inscriptional Parthian | Old South Arabian | ||
Bamum | Javanese | Old Turkic | ||
Egyptian Hieroglyphs | Kaithi | Samaritan | ||
Imperial Aramaic | Lisu | Tai Tham | ||
Inscriptional Pahlavi | Meetei Mayek | Tai Viet | ||
6.0 (2010) |
+3, = 93 |
|||
Batak | Brahmi | Mandaic | ||
6.1 (2012) |
+7, = 100 |
|||
Chakma | Miao | Takri | ||
Meroitic Cursive | Sharada | |||
Meroitic Hieroglyphs | Sora Sompeng | |||
7.0 (2014) |
+23, = 123 |
|||
Bassa Vah | Mahajani | Pahawh Hmong | ||
Caucasian Albanian | Manichaean | Palmyrene | ||
Duployan (shorthand) | Mende Kikakui | Pau Cin Hau | ||
Elbasan | Modi | Psalter Pahlavi | ||
Grantha | Mro | Siddham | ||
Khojki | Nabataean | Tirhuta | ||
Khudawadi | Old North Arabian | Warang Citi | ||
Linear A | Old Permic | |||
8.0 (2015) |
+6, = 129 |
|||
Ahom | Hatran | Old Hungarian | ||
Anatolian Hieroglyphs | Multani | Sutton SignWriting | ||
9.0 (2016) |
+6, = 135 |
|||
Adlam | Marchen | Osage | ||
Bhaiksuki | Newa | Tangut | ||
10.0 (2017) |
+4, = 139 |
|||
Masaram Gondi | Soyombo | |||
Nushu | Zanabazar Square | |||
11.0 (2018) |
+7, = 146 |
|||
Dogra | Makasar | Sogdian | ||
Gunjala Gondi | Medefaidrin | |||
Hanifi Rohingya | Old Sogdian | |||
12.0 (2019) |
+4, = 150 |
|||
Elymaic | Nyiakeng Puachue Hmong | |||
Nandinagari | Wancho | |||
13.0 (2020) |
+4, = 154 |
|||
Chorasmian | Khitan Small Script | |||
Dives Akuru | Yezidi | |||
14.0 (2021) |
+5, = 159 |
|||
Cypro-Minoan | Tangsa | Vithkuqi | ||
Old Uyghur | Toto |
In addition to the scripts listed above, a large number of other collections of characters are also encoded by Unicode. These collections include the following:
- Numbers
- General Diacritics
- General Punctuation
- General Symbols
- Mathematical Symbols (Western and Arabic)
- Musical Symbols (Western, Byzantine, Ancient Greek, and other)
- Technical Symbols
- Emoji: For details, see Emoji Versions
- Dingbats
- Arrows, Blocks, Box Drawing Forms, and Geometric Shapes
- Game Symbols
- Miscellaneous Symbols
- Presentation Forms
- Kangxi and other CJK radicals
Keep in mind, Unicode works with script and not languages. So it’s power to encompass virtually any language via texted scripts naturally makes it a broadly used and universally accepted system for encoding languages. It really is a great resource.
When writing systems for more than one language, according the Unicode website :
When writing systems for more than one language share sets of graphical symbols that have historically related derivations, the union of all of them is treated as a single collection of characters for encoding and is viewed (identified ) as a single script – which in turn serves as an inventory of graphical symbols, which are drawn upon for the writing systems of any particular language. So you will see – on occasion – an instance where a spoken/written language will make use of only one script because the symbols are exclusively implied to be specific to that same language; Hangul, is one such case – which is typically only used to write the Korean Language. Where as writing systems for some languages use more than just one script – for example Japanese, which traditionally uses Han (Kanji), Hiragana, and Katakana scripts, and modern Japanese usage commonly mixes in Latin script as well.
So one can see just how very versatile and ingenious Unicode really is.
As advanced and developed as Unicode is and as truly impressive as it is – much work is still being done to complete it. It still requires development tools such as keyboards, fonts language data (dates and such) and language data translation as well. So still a work in progress – if you can believe that.
So the very thing that gives Unicode it’s versatility and power is the same thing that allows for a malicious actor to take advantage of security flaws – because it allows writing virtually any type of code and this fact can lead to confusion for a code reviewer – as an example. And if it looks confusing to a code reviewer, it is possible that that same person reviewing the code will mistakenly allow it and approve it for their system or systems.
Apparently the best course of action is not for Python developers to re-write python code as this would leave far too many restrictions and probably would be much less viable for use with many different data sets and information or code. Awareness seems to be the correct strategy here. That is, the programmers and people that are reviewing the code should approach this by solving the issues that may be present when malicious code is introduced and by simply enforcing project-specific policies using their code editors.
This makes perfect sense – if you ask me.
Malicious code that can – at first glance – appears to be harmless can be missed by the individual that is reviewing the code. This is a possibility and not the norm. Awareness of this possibility was first marked as a potential problem by CVE-2021-42574 :
An issue was discovered in the Bidirectional Algorithm in the Unicode Specification through 14.0. It permits the visual reordering of characters via control sequences, which can be used to craft source code that renders different logic than the logical ordering of tokens ingested by compilers and interpreters. Adversaries can leverage this to encode source code for compilers accepting Unicode such that targeted vulnerabilities are introduced invisibly to human reviewers.
Now, one must understand that this does not mean that anything like this has taken place or that there are individuals out there working on something that targets this vulnerability specifically – this is only an indication that it is possible.
As you can imagine – if you put all of the languages of the world together, there are going to be a number of differences between the different languages of the world when it comes to the script that is used that translates each individual language into Unicode. So there is bound to be some confusion and some mismatches in the way a script may be interpreted. One other thing that needs to mentioned once again is that we are referring specifically to Python and how it affects Python in particular. Although the effect of this security issue may be germane to other languages due to the way a script interprets certain symbols within – if any problem is present in any language – for the most part the problem is the same one and the systems merely need to adapt – that is the programmers, code editors and code reviewers need to stay on high alert and be on the look out for code that is poorly written as a matter of consequence or one that is deliberately written in a malicious way. Either one can be harmful – So the issue needs to be dealt with by the common problem that is innate to any one particular language due to it’s interpretation of any given symbol compared to the universal interpretation of a symbol that resembles it so closely that it may go unnoticed by an individual – and so it behooves us to solve this problem with the code editor during development. Just add to the already vast capabilities that is already present in code editors. This will make the job of anyone involved so much easier.
Some characters look alike. Before the age of computers, many mechanical typewriters lacked the keys for the digits 0 and 1: users typed O (capital o) and l (lowercase L) instead. Human readers could tell them apart by context only. In programming languages, however, distinction between digits and letters is critical — and most fonts designed for programmers make it easy to tell them apart.
Similarly, in fonts designed for human languages, the uppercase “I” and lowercase “l” can look similar. Or the letters “rn” may be virtually indistinguishable from the single letter “m”. Again, programmers’ fonts make these pairs of confusables noticeably different.
However, what is “noticeably” different always depends on the context. Humans tend to ignore details in longer identifiers: the variable name accessibi1ity_options can still look indistinguishable from accessibility_options, while they are distinct for the compiler. The same can be said for plain typos: most humans will not notice the typo in responsbility_chain_delegate.
accessibi1ity_options
accessibility_options
Control Characters
Python generally considers all CR (\r), LF (\n), and CR-LF pairs (\r\n) as an end of line characters. Most code editors do as well, but there are editors that display “non-native” line endings as unknown characters (or nothing at all), rather than ending the line, displaying this example:
# Don't call this function: fire_the_missiles()
as a harmless comment like:
# Don't call this function:⬛fire_the_missiles()
CPython may treat the control character NUL (\0) as end of input, but many editors simply skip it, possibly showing code that Python will not run as a regular part of a file.
Some characters can be used to hide/overwrite other characters when source is listed in common terminals. For example:
- BS (\b, Backspace) moves the cursor back, so the character after it will overwrite the character before.
- CR (\r, carriage return) moves the cursor to the start of line, subsequent characters overwrite the start of the line.
- SUB (\x1A, Ctrl+Z) means “End of text” on Windows. Some programs (such as type) ignore the rest of the file after it.
- ESC (\x1B) commonly initiates escape codes which allow arbitrary control of the terminal.
Confusable Characters in Identifiers
Python is not limited to ASCII. It allows characters of all scripts – Latin letters to ancient Egyptian hieroglyphs – in identifiers (such as variable names). See PEP 3131 for details and rationale. Only “letters and numbers” are allowed, so while γάτα is a valid Python identifier, 🐱 is not. (See Identifiers and keywords for details.)
Non-printing control characters are also not allowed in identifiers.
However, within the allowed set there is a large number of “confusables”. For example, the uppercase versions of the Latin b, Greek β (Beta), and Cyrillic в (Ve) often look identical: B, Β and В, respectively.
This allows identifiers that look the same to humans, but not to Python. For example, all of the following are distinct identifiers:
- scope (Latin, ASCII-only)
- scоpe (with a Cyrillic о)
- scοpe (with a Greek ο)
- ѕсоре (all Cyrillic letters)
Additionally, some letters can look like non-letters:
- The letter for the Hawaiian ʻokina looks like an apostrophe; ʻHelloʻ is a Python identifier, not a string.
- The East Asian word for ten looks like a plus sign, so 十= 10 is a complete Python statement. (The “十” is a word: “ten” rather than “10”.)
Note
The converse also applies – some symbols look like letters – but since Python does not allow arbitrary symbols in identifiers, this is not an issue.
Confusable Digits
Numeric literals in Python only use the ASCII digits 0-9 (and non-digits such as . or e).
However, when numbers are converted from strings, such as in the int and float constructors or by the str.format method, any decimal digit can be used. For example ߅ (NKO DIGIT FIVE) or ௫ (TAMIL DIGIT FIVE) work as the digit 5.
Some scripts include digits that look similar to ASCII ones, but have a different value. For example:
>>> int('৪୨') 42 >>> '{٥}'.format('zero', 'one', 'two', 'three', 'four', 'five') five
Bidirectional Text
Some scripts, such as Hebrew or Arabic, are written right-to-left. Phrases in such scripts interact with nearby text in ways that can be surprising to people who aren’t familiar with these writing systems and their computer representation.
The exact process is complicated, and explained in Unicode Standard Annex #9, Unicode Bidirectional Algorithm.
Consider the following code, which assigns a 100-character string to the variable s:
s = "X" * 100 # "X" is assigned
When the X is replaced by the Hebrew letter א, the line becomes:
s = "א" * 100 # "א" is assigned
This command still assigns a 100-character string to s, but when displayed as general text following the Bidirectional Algorithm (e.g. in a browser), it appears as s = "א" followed by a comment.
Other surprising examples include:
- In the statement ערך = 23, the variable ערך is set to the integer 23.
- In the statement قيمة = ערך, the variable قيمة is set to the value of ערך.
- In the statement قيمة - (ערך ** 2), the value of ערך is squared and then subtracted from قيمة. The opening parenthesis is displayed as ).
Bidirectional Marks, Embeddings, Overrides and Isolates
Default reordering rules do not always yield the intended direction of text, so Unicode provides several ways to alter it.
The most basic are directional marks, which are invisible but affect text as a left-to-right (or right-to-left) character would. Continuing with the s = "X" example above, in the next example the X is replaced by the Latin x followed or preceded by a right-to-left mark (U+200F). This assigns a 200-character string to s (100 copies of x interspersed with 100 invisible marks), but under Unicode rules for general text, it is rendered as s = "x" followed by an ASCII-only comment:
s = "x" * 100 # "x" is assigned
The directional embedding, override and isolate characters are also invisible, but affect the ordering of all text after them until either ended by a dedicated character, or until the end of line. (Unicode specifies the effect to last until the end of a “paragraph” (see Unicode Bidirectional Algorithm), but allows tools to interpret newline characters as paragraph ends (see Unicode Newline Guidelines). Most code editors and terminals do so.)
These characters essentially allow arbitrary reordering of the text that follows them. Python only allows them in strings and comments, which does limit their potential (especially in combination with the fact that Python’s comments always extend to the end of a line), but it doesn’t render them harmless.
Normalizing identifiers
Python strings are collections of Unicode codepoints, not “characters”.
For reasons like compatibility with earlier encodings, Unicode often has several ways to encode what is essentially a single “character”. For example, all these are different ways of writing Å as a Python string, each of which is unequal to the others.
- "\N{LATIN CAPITAL LETTER A WITH RING ABOVE}" (1 codepoint)
- "\N{LATIN CAPITAL LETTER A}\N{COMBINING RING ABOVE}" (2 codepoints)
- "\N{ANGSTROM SIGN}" (1 codepoint, but different)
For another example, the ligature fi has a dedicated Unicode codepoint, even though it has the same meaning as the two letters fi.
Also, common letters frequently have several distinct variations. Unicode provides them for contexts where the difference has some semantic meaning, like mathematics. For example, some variations of n are:
- n (LATIN SMALL LETTER N)
- 𝐧 (MATHEMATICAL BOLD SMALL N)
- 𝘯 (MATHEMATICAL SANS-SERIF ITALIC SMALL N)
- n (FULLWIDTH LATIN SMALL LETTER N)
- ⁿ (SUPERSCRIPT LATIN SMALL LETTER N)
Unicode includes algorithms to normalize variants like these to a single form, and Python identifiers are normalized. (There are several normal forms; Python uses NFKC.)
For example, xn and xⁿ are the same identifier in Python:
>>> xⁿ = 8 >>> xn 8
… as is fi and fi, and as are the different ways to encode Å.
This normalization applies only to identifiers, however. Functions that treat strings as identifiers, such as getattr, do not perform normalization:
>>> class Test: ... def finalize(self): ... print('OK') ... >>> Test().finalize() OK >>> Test().finalize() OK >>> getattr(Test(), 'finalize') Traceback (most recent call last): ... AttributeError: 'Test' object has no attribute 'finalize'
This also applies when importing:
- import finalization performs normalization, and looks for a file named finalization.py (and other finalization.* files).
- importlib.import_module("finalization") does not normalize, so it looks for a file named finalization.py.
Some filesystems independently apply normalization and/or case folding. On some systems, finalization.py, finalization.py and FINALIZATION.py are three distinct filenames; on others, some or all of these name the same file.
The encoding of Python source files is given by a specific regex on the first two lines of a file, as per Encoding declarations. This mechanism is very liberal in what it accepts, and thus easy to obfuscate.
This can be misused in combination with Python-specific special-purpose encodings (see Text Encodings). For example, with encoding: unicode_escape, characters like quotes or braces can be hidden in an (f-)string, with many tools (syntax highlighters, linters, etc.) considering them part of the string. For example:
# For writing Japanese, you don’t need an editor that supports
# UTF-8 source encoding: unicode_escape sequences work just as well.
import os
message = ”’
This is “Hello World” in Japanese:
\u3053\u3093\u306b\u3061\u306f\u7f8e\u3057\u3044\u4e16\u754c
This runs `echo WHOA` in your shell:
\u0027\u0027\u0027\u002c\u0028\u006f\u0073\u002e
\u0073\u0079\u0073\u0074\u0065\u006d\u0028
\u0027\u0065\u0063\u0068\u006f\u0020\u0057\u0048\u004f\u0041\u0027
\u0029\u0029\u002c\u0027\u0027\u0027
”’
Here, encoding: unicode_escape in the initial comment is an encoding declaration. The unicode_escape encoding instructs Python to treat \u0027 as a single quote (which can start/end a string), \u002c as a comma (punctuator), etc.
In conclusion – Python is such a diverse programming language and Unicode is a Universally accepted symbol/language translator that is so diverse and broad as well that their very capability creates an innate possibility where confusion can definitely become a vulnerability – however this does not mean that it is the norm and we simply attack this from the programmer side of things, this should keep the bad code to a minumum.
Thank you
driven
More Stories
U.N. FAILS TO REACH AGREEMENT ON LAWS (LETHAL AUTONOMOUS WEAPONS)
MOTOR & DRIVE SYSTEMS 2022 CONFERENCE FEBRUARY 8 – 9
A CALL TO ACTION FOR PARENTS AND CHILDREN