Hướng dẫn python natural sort - trăn tự nhiên

Question

Sắp xếp danh sách (list), mảng (array) là một trong những thao tác thông dụng với hầu hết các ứng dụng. Tuy nhiên, với đa số các thư viện mặc định, chẳng hạn như Python built-in library, cách thức sắp xếp sẽ khó đáp ứng tất cả nhu cầu của người dùng. Một trong những thứ tự sắp xếp phổ biến nhưng ít khi được hỗ trợ là sắp xếp tự nhiên, hay còn gọi là natural sort.

Natural sort là sắp xếp mà ở đó, các đối tượng sẽ được sắp xếp theo trình tự mà chúng ta thường hay sử dụng trong cuộc sống:

Original:['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
Sorted: ['image1.jpg', 'image3.jpg', 'image12.jpg', 'image15.jpg']

Với việc thực hiện sắp xếp mặc định bằng thư viện của Python, chúng ta sẽ có kết quả như sau:

>>> L = ['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
>>> L.sort()
['image1.jpg', 'image12.jpg', 'image15.jpg', 'image3.jpg']

Để có thể sắp xếp theo thứ tự tự nhiên như mong muốn, chúng ta sẽ cần định nghĩa lại cách sắp xếp bằng natural_key như sau:

def natural_key(string_):
  return [int(s) if s.isdigit() else s for s in re.split(r'(\d+)', string_)]

>>> L = ['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
>>> L.sort(key=natural_key)
['image1.jpg', 'image3.jpg', 'image12.jpg', 'image5.jpg']

Như vậy, để sắp xếp danh sách các chuỗi như trên, ta có thể đơn giản định nghĩa cách sắp xếp của riêng mình với tham số “key” của phương thức “sort()”.

Bên cạnh đó, chúng ta đều biết rằng luôn có nhiều hơn một cách để giải quyết vấn đề. Hiển nhiên, sắp xếp cũng vậy. Ta cũng có thể sử dụng các thư viện của bên thứ ba (third party library) như natsort để giải quyết vấn đề này.

Để có góc nhìn đa dạng hơn, ta có thể tham khảo tại:

StackOverflow
Python documentation

Post navigation

Nhiều khả năng

>>> L = ['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
>>> L.sort()
['image1.jpg', 'image12.jpg', 'image15.jpg', 'image3.jpg']

0 gắn chặt với việc triển khai cơ bản của loại Python. Bên cạnh đó, tham số CMP là di sản. Cách hiện đại là chuyển đổi các mục đầu vào thành các đối tượng hỗ trợ các hoạt động so sánh phong phú mong muốn.

Theo Cpython 2.x, các đối tượng của các loại khác nhau có thể được đặt hàng ngay cả khi các toán tử so sánh phong phú tương ứng chưa được thực hiện. Theo CPython 3.x, các đối tượng thuộc các loại khác nhau phải hỗ trợ rõ ràng cho việc so sánh. Xem làm thế nào để python so sánh chuỗi và int? mà liên kết đến các tài liệu chính thức. Hầu hết các câu trả lời phụ thuộc vào thứ tự ngầm này. Chuyển sang Python 3.x sẽ yêu cầu một loại mới để thực hiện và thống nhất các so sánh giữa các số và chuỗi.

Python 2.7.12 (default, Sep 29 2016, 13:30:34) 
>>> (0,"foo") < ("foo",0)
True

Python 3.5.2 (default, Oct 14 2016, 12:54:53) 
>>> (0,"foo") < ("foo",0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  TypeError: unorderable types: int() < str()

Có ba cách tiếp cận khác nhau. Đầu tiên sử dụng các lớp lồng nhau để tận dụng thuật toán so sánh ____11 của Python. Thứ hai hủy bỏ việc làm tổ này thành một lớp duy nhất. Các phân lớp thứ ba được phân lớp

>>> L = ['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
>>> L.sort()
['image1.jpg', 'image12.jpg', 'image15.jpg', 'image3.jpg']

2 để tập trung vào hiệu suất. Tất cả đều có thời gian; Lần thứ hai nhanh gấp đôi trong khi nhanh hơn gần sáu lần. Không cần thiết

>>> L = ['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
>>> L.sort()
['image1.jpg', 'image12.jpg', 'image15.jpg', 'image3.jpg']

2, và có lẽ là một ý tưởng tồi ngay từ đầu, nhưng nó đi kèm với một số tiện ích nhất định.

Các ký tự sắp xếp được nhân đôi để buộc đặt hàng theo trường hợp và được trao đổi trường hợp để buộc chữ thường sắp xếp trước; Đây là định nghĩa điển hình của "loại tự nhiên". Tôi không thể quyết định loại nhóm; Một số có thể thích những điều sau đây, điều này cũng mang lại lợi ích hiệu suất đáng kể:

d = lambda s: s.lower()+s.swapcase()

Khi được sử dụng, các toán tử so sánh được đặt thành

>>> L = ['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
>>> L.sort()
['image1.jpg', 'image12.jpg', 'image15.jpg', 'image3.jpg']

4 nên chúng sẽ không bị bỏ qua bởi

>>> L = ['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
>>> L.sort()
['image1.jpg', 'image12.jpg', 'image15.jpg', 'image3.jpg']

5.

import functools
import itertools


@functools.total_ordering
class NaturalStringA(str):
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , super().__repr__()
            )
    d = lambda c, s: [ c.NaturalStringPart("".join(v))
                        for k,v in
                       itertools.groupby(s, c.isdigit)
                     ]
    d = classmethod(d)
    @functools.total_ordering
    class NaturalStringPart(str):
        d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
        d = staticmethod(d)
        def __lt__(self, other):
            if not isinstance(self, type(other)):
                return NotImplemented
            try:
                return int(self) < int(other)
            except ValueError:
                if self.isdigit():
                    return True
                elif other.isdigit():
                    return False
                else:
                    return self.d(self) < self.d(other)
        def __eq__(self, other):
            if not isinstance(self, type(other)):
                return NotImplemented
            try:
                return int(self) == int(other)
            except ValueError:
                if self.isdigit() or other.isdigit():
                    return False
                else:
                    return self.d(self) == self.d(other)
        __le__ = object.__le__
        __ne__ = object.__ne__
        __gt__ = object.__gt__
        __ge__ = object.__ge__
    def __lt__(self, other):
        return self.d(self) < self.d(other)
    def __eq__(self, other):
        return self.d(self) == self.d(other)
    __le__ = object.__le__
    __ne__ = object.__ne__
    __gt__ = object.__gt__
    __ge__ = object.__ge__

import functools
import itertools


@functools.total_ordering
class NaturalStringB(str):
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , super().__repr__()
            )
    d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
    d = staticmethod(d)
    def __lt__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
        zipped = itertools.zip_longest(*groups)
        for s,o in zipped:
            if s is None:
                return True
            if o is None:
                return False
            s_k, s_v = s[0], "".join(s[1])
            o_k, o_v = o[0], "".join(o[1])
            if s_k and o_k:
                s_v, o_v = int(s_v), int(o_v)
                if s_v == o_v:
                    continue
                return s_v < o_v
            elif s_k:
                return True
            elif o_k:
                return False
            else:
                s_v, o_v = self.d(s_v), self.d(o_v)
                if s_v == o_v:
                    continue
                return s_v < o_v
        return False
    def __eq__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
        zipped = itertools.zip_longest(*groups)
        for s,o in zipped:
            if s is None or o is None:
                return False
            s_k, s_v = s[0], "".join(s[1])
            o_k, o_v = o[0], "".join(o[1])
            if s_k and o_k:
                s_v, o_v = int(s_v), int(o_v)
                if s_v == o_v:
                    continue
                return False
            elif s_k or o_k:
                return False
            else:
                s_v, o_v = self.d(s_v), self.d(o_v)
                if s_v == o_v:
                    continue
                return False
        return True
    __le__ = object.__le__
    __ne__ = object.__ne__
    __gt__ = object.__gt__
    __ge__ = object.__ge__

import functools
import itertools
import enum


class OrderingType(enum.Enum):
    PerWordSwapCase         = lambda s: s.lower()+s.swapcase()
    PerCharacterSwapCase    = lambda s: "".join(c.lower()+c.swapcase() for c in s)


class NaturalOrdering:
    @classmethod
    def by(cls, ordering):
        def wrapper(string):
            return cls(string, ordering)
        return wrapper
    def __init__(self, string, ordering=OrderingType.PerCharacterSwapCase):
        self.string = string
        self.groups = [ (k,int("".join(v)))
                            if k else
                        (k,ordering("".join(v)))
                            for k,v in
                        itertools.groupby(string, str.isdigit)
                      ]
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , self.string
            )
    def __lesser(self, other, default):
        if not isinstance(self, type(other)):
            return NotImplemented
        for s,o in itertools.zip_longest(self.groups, other.groups):
            if s is None:
                return True
            if o is None:
                return False
            s_k, s_v = s
            o_k, o_v = o
            if s_k and o_k:
                if s_v == o_v:
                    continue
                return s_v < o_v
            elif s_k:
                return True
            elif o_k:
                return False
            else:
                if s_v == o_v:
                    continue
                return s_v < o_v
        return default
    def __lt__(self, other):
        return self.__lesser(other, default=False)
    def __le__(self, other):
        return self.__lesser(other, default=True)
    def __eq__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        for s,o in itertools.zip_longest(self.groups, other.groups):
            if s is None or o is None:
                return False
            s_k, s_v = s
            o_k, o_v = o
            if s_k and o_k:
                if s_v == o_v:
                    continue
                return False
            elif s_k or o_k:
                return False
            else:
                if s_v == o_v:
                    continue
                return False
        return True
    # functools.total_ordering doesn't create single-call wrappers if both
    # __le__ and __lt__ exist, so do it manually.
    def __gt__(self, other):
        op_result = self.__le__(other)
        if op_result is NotImplemented:
            return op_result
        return not op_result
    def __ge__(self, other):
        op_result = self.__lt__(other)
        if op_result is NotImplemented:
            return op_result
        return not op_result
    # __ne__ is the only implied ordering relationship, it automatically
    # delegates to __eq__

Phân loại tự nhiên là cả khá phức tạp vừa được định nghĩa là một vấn đề. Đừng quên chạy

>>> L = ['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
>>> L.sort()
['image1.jpg', 'image12.jpg', 'image15.jpg', 'image3.jpg']

6 trước và xem xét sử dụng

>>> L = ['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
>>> L.sort()
['image1.jpg', 'image12.jpg', 'image15.jpg', 'image3.jpg']

7 thay vì

>>> L = ['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
>>> L.sort()
['image1.jpg', 'image12.jpg', 'image15.jpg', 'image3.jpg']

8. Có lẽ có những vấn đề mã hóa tinh tế mà tôi chưa xem xét. Vì vậy, tôi dự kiến giới thiệu Thư viện Natsort. Tôi liếc nhanh vào kho lưu trữ GitHub; Việc bảo trì mã đã được xuất sắc.

Tất cả các thuật toán tôi đã thấy phụ thuộc vào các thủ thuật như sao chép và hạ thấp ký tự, và trường hợp hoán đổi. Mặc dù điều này tăng gấp đôi thời gian chạy, một giải pháp thay thế sẽ yêu cầu một thứ tự tự nhiên hoàn toàn trên bộ ký tự đầu vào. Tôi không nghĩ rằng đây là một phần của đặc điểm kỹ thuật Unicode và vì có nhiều chữ số Unicode hơn

>>> L = ['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
>>> L.sort()
['image1.jpg', 'image12.jpg', 'image15.jpg', 'image3.jpg']

9, việc tạo ra một sự sắp xếp như vậy sẽ cũng đáng ngại như nhau. Nếu bạn muốn so sánh nhận biết địa phương, hãy chuẩn bị các chuỗi của bạn với

def natural_key(string_):
  return [int(s) if s.isdigit() else s for s in re.split(r'(\d+)', string_)]

>>> L = ['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
>>> L.sort(key=natural_key)
['image1.jpg', 'image3.jpg', 'image12.jpg', 'image5.jpg']

0 mỗi Python sắp xếp cách làm thế nào.

programming python Natsort Natsort key Natsort Pandas Conda install natsort