What is the best way to remove accents normalize in a python unicode string?

Here is a short function which strips the diacritics, but keeps the non-latin characters. Most cases (e.g., "à" -> "a") are handled by unicodedata (standard library), but several (e.g., "æ" -> "ae") rely on the given parallel strings.

Code

from unicodedata import combining, normalize LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü " ASCII = "ae ae ae d d f h i l o o oe oe ss t ue" def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))): return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))

NB. The default argument outliers is evaluated once and not meant to be provided by the caller.

Intended usage

As a key to sort a list of strings in a more “natural” order:

sorted(['cote', 'coteau', "crottez", 'crotté', 'côte', 'côté'], key=remove_diacritics)

Output:

['cote', 'côte', 'côté', 'coteau', 'crotté', 'crottez']

If your strings mix texts and numbers, you may be interested in composing remove_diacritics() with the function string_to_pairs() I give elsewhere.

Tests

To make sure the behavior meets your needs, take a look at the pangrams below:

examples = [ ("hello, world", "hello, world"), ("42", "42"), ("你好,世界", "你好,世界"), ( "Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.", "des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d’exquis rotis de boeuf au kir, a l’ay d’age mur, &caetera.", ), ( "Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.", "falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.", ), ( "Љубазни фењерџија чађавог лица хоће да ми покаже штос.", "љубазни фењерџија чађавог лица хоће да ми покаже штос.", ), ( "Ljubazni fenjerdžija čađavog lica hoće da mi pokaže štos.", "ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.", ), ( "Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.", "quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.", ), ( "Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa.", "kaemi ny oexi her ykist þjofum nu baedi vil og adrepa.", ), ( "Glāžšķūņa rūķīši dzērumā čiepj Baha koncertflīģeļu vākus.", "glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.", ) ] for (given, expected) in examples: assert remove_diacritics(given) == expected

Case-preserving variant

LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü Ä Æ Ǽ Đ Ð Ƒ Ħ I Ł Ø Ǿ Ö Œ SS Ŧ Ü " ASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE" def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))): return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))

View Discussion

Improve Article

Save Article

  • Read
  • Discuss
  • View Discussion

    Improve Article

    Save Article

    String accents are special string characters adapted from languages of other accents. In this article, we are going to remove ascents from a string.

    Examples:

    Input: orčpžsíáýd

    Output: orcpzsiayd

    Input: stävänger

    Output: stavanger

    We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.

    Syntax:

    output_string = unidecode.unidecode(target_string )

    Below are some examples which depict how to remove ascents from a string:

    Example 1:

    Python3

    import unidecode

    string = "orčpžsíáýd"

    print('\nOriginal String:', string)

    outputString = unidecode.unidecode(string)

    print('\nNew String:', outputString)

    Output:

    Original String: orčpžsíáýd New String: orcpzsiayd

    Example 2:

    Python3

    import unidecode

    string = "stävänger"

    print('\nOriginal String:', string)

    outputString = unidecode.unidecode(string)

    print('\nNew String:', outputString)

    Output:

    Original String: stävänger New String: stavanger

    Example 3:

    Python3

    import unidecode

    stringList = ["hell°",  "tromsø",  "stävänger", "ölut"]

    print('\nOriginal List of Strings:\n', stringList)

    for i in range(len(stringList)):

        stringList[i] = unidecode.unidecode(stringList[i])

    print('\nNew List of Strings:\n', stringList)

    Output:

    Original List of Strings: ['hell°', 'tromsø', 'stävänger', 'ölut'] New List of Strings: ['helldeg', 'tromso', 'stavanger', 'olut']

    How do you remove accent marks in Python?

    We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.

    How do I remove all special characters from a string in Python?

    Using 're..
    “[^A-Za-z0–9]” → It'll match all of the characters except the alphabets and the numbers. ... .
    All of the characters matched will be replaced with an empty string..
    All of the characters except the alphabets and numbers are removed..

    How do you escape a Unicode character in Python?

    Unicode.
    re. ASCII. ... .
    Codepoints and Unicode escapes. You can use escapes \u and \U to specify Unicode characters with 4 and 8 hexadecimal digits respectively. ... .
    \N escape sequence. You can also specify a Unicode character using \N{name} escape sequence. ... .
    Cheatsheet and Summary. Note. ... .
    Exercises..

    How do you change an accented character to a regular character?

    replace(/[^a-z0-9]/gi,'') . However a more intuitive solution (at least for the user) would be to replace accented characters with their "plain" equivalent, e.g. turn á , á into a , and ç into c , etc.

    Chủ đề