What is the best way to remove accents normalize in a python unicode string?

Here is a short function which strips the diacritics, but keeps the non-latin characters. Most cases (e.g., "à" -> "a") are handled by unicodedata (standard library), but several (e.g., "æ" -> "ae") rely on the given parallel strings.

Code

from unicodedata import combining, normalize LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü " ASCII = "ae ae ae d d f h i l o o oe oe ss t ue" def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))): return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))

NB. The default argument outliers is evaluated once and not meant to be provided by the caller.

Intended usage

As a key to sort a list of strings in a more “natural” order:

sorted(['cote', 'coteau', "crottez", 'crotté', 'côte', 'côté'], key=remove_diacritics)

Output:

['cote', 'côte', 'côté', 'coteau', 'crotté', 'crottez']

If your strings mix texts and numbers, you may be interested in composing remove_diacritics() with the function string_to_pairs() I give elsewhere.

Tests

To make sure the behavior meets your needs, take a look at the pangrams below:

examples = [ ("hello, world", "hello, world"), ("42", "42"), ("你好，世界", "你好，世界"), ( "Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.", "des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d’exquis rotis de boeuf au kir, a l’ay d’age mur, &caetera.", ), ( "Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.", "falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.", ), ( "Љубазни фењерџија чађавог лица хоће да ми покаже штос.", "љубазни фењерџија чађавог лица хоће да ми покаже штос.", ), ( "Ljubazni fenjerdžija čađavog lica hoće da mi pokaže štos.", "ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.", ), ( "Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.", "quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.", ), ( "Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa.", "kaemi ny oexi her ykist þjofum nu baedi vil og adrepa.", ), ( "Glāžšķūņa rūķīši dzērumā čiepj Baha koncertflīģeļu vākus.", "glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.", ) ] for (given, expected) in examples: assert remove_diacritics(given) == expected

Case-preserving variant

LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü Ä Æ Ǽ Đ Ð Ƒ Ħ I Ł Ø Ǿ Ö Œ SS Ŧ Ü " ASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE" def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))): return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))

View Discussion

Improve Article

Save Article

Read

Discuss

View Discussion

Improve Article

Save Article

String accents are special string characters adapted from languages of other accents. In this article, we are going to remove ascents from a string.

Examples:

Input: orčpžsíáýd
Output: orcpzsiayd
Input: stävänger
Output: stavanger

We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.

Syntax:

output_string = unidecode.unidecode(target_string )

Below are some examples which depict how to remove ascents from a string:

Example 1:

Python3

import unidecode

string = "orčpžsíáýd"

print('\nOriginal String:', string)

outputString = unidecode.unidecode(string)

print('\nNew String:', outputString)

Output:

Original String: orčpžsíáýd New String: orcpzsiayd

Example 2:

Python3

import unidecode

string = "stävänger"

print('\nOriginal String:', string)

outputString = unidecode.unidecode(string)

print('\nNew String:', outputString)

Output:

Original String: stävänger New String: stavanger

Example 3:

Python3

import unidecode

stringList = ["hell°", "tromsø", "stävänger", "ölut"]

print('\nOriginal List of Strings:\n', stringList)

for i in range(len(stringList)):

stringList[i] = unidecode.unidecode(stringList[i])

print('\nNew List of Strings:\n', stringList)

Output:

Original List of Strings: ['hell°', 'tromsø', 'stävänger', 'ölut'] New List of Strings: ['helldeg', 'tromso', 'stavanger', 'olut']

How do you remove accent marks in Python?

We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.

How do I remove all special characters from a string in Python?

Using 're..

“[^A-Za-z0–9]” → It'll match all of the characters except the alphabets and the numbers. ... .

All of the characters matched will be replaced with an empty string..

All of the characters except the alphabets and numbers are removed..

How do you escape a Unicode character in Python?

Unicode.

re. ASCII. ... .

Codepoints and Unicode escapes. You can use escapes \u and \U to specify Unicode characters with 4 and 8 hexadecimal digits respectively. ... .

\N escape sequence. You can also specify a Unicode character using \N{name} escape sequence. ... .

Cheatsheet and Summary. Note. ... .

Exercises..

How do you change an accented character to a regular character?

replace(/[^a-z0-9]/gi,'') . However a more intuitive solution (at least for the user) would be to replace accented characters with their "plain" equivalent, e.g. turn á , á into a , and ç into c , etc.

Code

Intended usage

Tests

Case-preserving variant

Python3

Python3

Python3

How do you remove accent marks in Python?

How do I remove all special characters from a string in Python?

How do you escape a Unicode character in Python?

How do you change an accented character to a regular character?

Bài Viết Liên Quan

Toplist

Bài mới nhất

Chủ đề