Trích xuất các từ từ tệp văn bản Python

Tập lệnh Python đơn giản mà không cần sử dụng các thư viện xử lý văn bản nặng để trích xuất các từ phổ biến nhất từ kho văn bản

Nội dung chính Show

Tập lệnh Python đơn giản mà không cần sử dụng các thư viện xử lý văn bản nặng để trích xuất các từ phổ biến nhất từ kho văn bản
Trích xuất văn bản từ tệp bằng Python
Xuất dữ liệu thành tệp CSV bằng Python
Suy nghĩ cuối cùng về quét văn bản trong Python

Cũng lưu ý rằng, trong ví dụ này, đường dẫn đầy đủ đến trình thông dịch Python được chỉ định. Tùy thuộc vào thiết lập của bạn, bạn có thể không cần phải quá rõ ràng. Tuy nhiên, trên một số hệ thống, cả Python 2 và Python 3 đều có thể được cài đặt và trình thông dịch Python “mặc định” chạy khi đường dẫn không được chỉ định có thể không phải là đường dẫn chính xác. Người ta cũng cho rằng Dữ liệu mẫu. txt nằm trong cùng thư mục với Trình giải nén. tập tin py. Vì tệp này có một khoảng trắng trong tên nên nó phải được đặt trong dấu ngoặc kép để được nhận dạng dưới dạng một tham số. Điều này áp dụng cho cả hệ thống Windows và Linux

Trước khi tiếp tục, hãy đảm bảo rằng tất cả các SSN từ tệp mẫu được hiển thị ở đầu ra. Một lỗi phổ biến trong các triển khai này là bỏ qua quá trình xử lý thủ công bản ghi cuối cùng

Đọc. Các khóa học trực tuyến hàng đầu để học Python

Trích xuất văn bản từ tệp bằng Python

Bây giờ SSN đã được phân tích cú pháp chính xác, các mục còn lại có thể được trích xuất bằng cách thêm logic phù hợp

Full-Extractor. py

# Full-Extractor.py

# For command-line arguments
import sys

def main(argv):
    try:
        # Is there one command line param and is it a file?
        if len(sys.argv) < 2:
            raise IndexError("There must be a filename specified.")
        with open(sys.argv[1]) as input_file:
            # Create variables to hold each output record's information,
            # along with corresponding values to hold the previous record's
            # information.
            currentSSN = ""
            previousSSN = ""

            currentName = ""

            currentMonthlyAmount = ""

            currentYearlyAmount = ""

            # This time, we need to know if we are processing the first record.  If we don't keep track of this, the
            # first record will process incorrectly and each subsequent record will be wrong.
            firstRecord = True
            
            # Handle the file as an enumerable object, split by newlines 
            for x, line in enumerate(input_file):
                # Strip newlines from right (trailing newlines)
                currentLine = line.rstrip()
                
                # For this example, a single record is composed of 4 lines.
                # We need to make sure we get each piece of information
                # before we move on to the next record.

                # Python strings are 0-indexed, so the 13th character is
                # at position 12, and we must add the length of 11 to 12
                # to get position 23 to complete the substring function.
                currentSSN = currentLine[12:23]
                if (True == firstRecord):
                    previousSSN = currentSSN
                    firstRecord = False

                # For the first record, previousSSN would be blank and currentSSN would have a value, and this condition would be true.
                # We do not want this, so we need the logic above to set the values to be the same for the first record.
                if (previousSSN != currentSSN):
                    # We are at a new record, and hopefully the completed
                    # record's information is stored in the "previous"
                    # versions of all these variables.  Note that on the
                    # first iteration of this loop, the previous versions
                    # of these variables will all be blank.

                    # Also note the "Disconnect" between the previous and current notation.
                    if ("" != previousSSN):
                        print ("Found record with SSN ["+previousSSN+"], name ["+currentName+"], monthly amount [" + currentMonthlyAmount+
                               "] yearly amount [" + currentYearlyAmount + "]")

                    # Reset for the next record.  This logic needs to come before the remaining data extractions, or you will have
                    # "off by one" errors.
                    previousSSN = currentSSN

                    # Blank out the "current" versions of the variables (except the SSN!) so the conditions above will be true again.
                    currentName = ""
                    currentMonthlyAmount = ""
                    currentYearlyAmount = ""

                # Get the name if we do not already have it.  This condition prevents us from overwriting the name.  Note that if the
                # data was structured in a way that there was more than one piece of information at this position in the file, you would
                # need additional logic to determine what it is you are parsing out.  In this example, the simplistic logic of checking if
                # the first character is present and that a comma is in the substring is the "test".

                if ("" == currentName) and (False == (currentLine[33:].startswith(' '))) and (True == currentLine[33:].__contains__(',')):
                    # Also note that the name can go to the end of the line, so
                    # no ending position is included here.
                    currentName = currentLine[33:]

                # Follow the same logic for extracting the other information.  In this case, make sure the string contains only
                # numeric values.  In the case of the monthly amount, we only want to process lines that end in "2022" as these are
                # the only lines which contain this information.
                if ("" == currentMonthlyAmount) and (currentLine.endswith("2022")):
                    currentMonthlyAmount = currentLine[51:57]

                if ("" == currentYearlyAmount) and currentLine[57:62].isdigit():
                    currentYearlyAmount = currentLine[57:62]

            # Note that at the end of the loop, the last record's information
            # will be in the previous versions of the variables.  We need to
            # manually run this logic to get them.
            if ("" != previousSSN):
                print ("Last record with SSN ["+previousSSN+"], name [" + currentName +"], monthly amount [" + currentMonthlyAmount+
                       "] yearly amount [" + currentYearlyAmount + "]")
            #print(str(x+1)+" lines read.")
        return 0
    except IndexError as ex:
        print(str(ex))
    except FileNotFoundError as ex:
        print("The file ["+ sys.argv[1] + "] cannot be read.")
    return 1

# Call the "main" program.
if __name__ == "__main__":
    main(sys.argv[1:])

Vị trí của logic xác định ranh giới bản ghi, trong trường hợp này là chuyển từ SSN này sang SSN khác, là rất quan trọng, bởi vì dữ liệu khác sẽ bị "tắt từng cái" nếu logic đó vẫn ở cuối vòng lặp. Điều quan trọng không kém là, đối với bản ghi đầu tiên, SSN trước đó “khớp” với bản ghi hiện tại, cụ thể là logic này sẽ không được thực thi trong lần lặp đầu tiên

Chạy mã này cho đầu ra sau

Lưu ý bản ghi được đánh dấu và cách nó không có tên hoặc số tiền liên quan đến nó. "Lỗi" này do thiếu dữ liệu trong tệp văn bản gốc, đang hiển thị chính xác. Quá trình trích xuất không được thêm hoặc bớt dữ liệu. Thay vào đó, nó phải biểu thị dữ liệu chính xác như vốn có từ nguồn ban đầu hoặc ít nhất chỉ ra rằng có một số loại lỗi với dữ liệu. Bằng cách này, người dùng có thể xem lại nguồn dữ liệu gốc trong ERP để tìm ra lý do tại sao thông tin này bị thiếu

Đọc. Xử lý tệp trong Python

Xuất dữ liệu thành tệp CSV bằng Python

Bây giờ chúng ta có thể trích xuất dữ liệu theo chương trình, đã đến lúc viết nó ra một định dạng thân thiện. Trong trường hợp này, nó sẽ là một tệp CSV đơn giản. Trước tiên, chúng tôi cần một tệp CSV để ghi vào và trong trường hợp này, nó sẽ có cùng tên với tệp đầu vào, với phần mở rộng được đổi thành “. csv”. Mã đầy đủ với đầu ra thành. CSV bên dưới

Full-Extractor-Export. py

 
# Full-Extractor-Export.py

# For command-line arguments
import sys

def main(argv):
    try:
        # Is there one command line param and is it a file?
        if len(sys.argv) < 2:
            raise IndexError("There must be a filename specified.")

        fileNameParts = sys.argv[1].split(".")
        fileNameParts[-1] = "csv"
        outputFileName = ".".join(fileNameParts)
        outputLines = "";
        with open(sys.argv[1]) as input_file:
            # Create variables to hold each output record's information,
            # along with corresponding values to hold the previous record's
            # information.
            currentSSN = ""
            previousSSN = ""

            currentName = ""

            currentMonthlyAmount = ""

            currentYearlyAmount = ""

            # This time, we need to know if we are processing the first record.  If we don't keep track of this, the
            # first record will process incorrectly and each subsequent record will be wrong.
            firstRecord = True
            
            # Handle the file as an enumerable object, split by newlines 
            for x, line in enumerate(input_file):
                # Strip newlines from right (trailing newlines)
                currentLine = line.rstrip()
                
                # For this example, a single record is composed of 4 lines.
                # We need to make sure we get each piece of information
                # before we move on to the next record.

                # Python strings are 0-indexed, so the 13th character is
                # at position 12, and we must add the length of 11 to 12
                # to get position 23 to complete the substring function.
                currentSSN = currentLine[12:23]
                if (True == firstRecord):
                    previousSSN = currentSSN
                    firstRecord = False

                # For the first record, previousSSN would be blank and currentSSN would have a value, and this condition would be true.
                # We do not want this, so we need the logic above to set the values to be the same for the first record.
                if (previousSSN != currentSSN):
                    # We are at a new record, and hopefully the completed
                    # record's information is stored in the "previous"
                    # versions of all these variables.  Note that on the
                    # first iteration of this loop, the previous versions
                    # of these variables will all be blank.

                    # Also note the "Disconnect" between the previous and current notation.
                    if ("" != previousSSN):
                        print ("Found record with SSN ["+previousSSN+"], name ["+currentName+"], monthly amount [" + currentMonthlyAmount+
                               "] yearly amount [" + currentYearlyAmount + "]")
                        # Because CSV is a trivially format to write to, string processing can be used.  Note that we also need to split
                        # the name into the first and last names.
                        nameParts = currentName.split(",")
                        # This is trivial error checking.  Ideally a more robust error response system should be here.
                        firstName = "Error"
                        lastName = "Error"
                        if (2 == len(nameParts)):
                            firstName = nameParts[1]
                            lastName = nameParts[0]
                        # Should there be any quotation marks in these strings, they need to be escaped in the CSV file by using
                        # double quotation marks.  Strings in CSV files should always be delimited with quotation marks.
                        outputLines += ("\"" + previousSSN.replace("\"", "\"\"") + "\",\"" + lastName.replace("\"", "\"\"") + "\",\"" +
                            firstName.replace("\"", "\"\"") + "\",\"" + currentMonthlyAmount.replace("\"", "\"\"") + "\",\"" +
                            currentYearlyAmount.replace("\"", "\"\"") +"\"\r\n")
                    # Reset for the next record.  This logic needs to come before the remaining data extractions, or you will have
                    # "off by one" errors.
                    previousSSN = currentSSN

                    # Blank out the "current" versions of the variables (except the SSN!) so the conditions above will be true again.
                    currentName = ""
                    currentMonthlyAmount = ""
                    currentYearlyAmount = ""

                # Get the name if we do not already have it.  This condition prevents us from overwriting the name.  Note that if the
                # data was structured in a way that there was more than one piece of information at this position in the file, you would
                # need additional logic to determine what it is you are parsing out.  In this example, the simplistic logic of checking if
                # the first character is present and that a comma is in the substring is the "test".

                if ("" == currentName) and (False == (currentLine[33:].startswith(' '))) and (True == currentLine[33:].__contains__(',')):
                    # Also note that the name can go to the end of the line, so
                    # no ending position is included here.
                    currentName = currentLine[33:]

                # Follow the same logic for extracting the other information.  In this case, make sure the string contains only
                # numeric values.  In the case of the monthly amount, we only want to process lines that end in "2022" as these are
                # the only lines which contain this information.
                if ("" == currentMonthlyAmount) and (currentLine.endswith("2022")):
                    currentMonthlyAmount = currentLine[51:57]

                if ("" == currentYearlyAmount) and currentLine[57:62].isdigit():
                    currentYearlyAmount = currentLine[57:62]

            # Note that at the end of the loop, the last record's information
            # will be in the previous versions of the variables.  We need to
            # manually run this logic to get them.
            if ("" != previousSSN):
                print ("Last record with SSN ["+previousSSN+"], name [" + currentName +"], monthly amount [" + currentMonthlyAmount+
                       "] yearly amount [" + currentYearlyAmount + "]")
                nameParts = currentName.split(",")
                firstName = "Error"
                lastName = "Error"
                if (2 == len(nameParts)):
                    firstName = nameParts[1]
                    lastName = nameParts[0§6]
                    outputLines += ("\"" + previousSSN.replace("\"", "\"\"") + "\",\"" + lastName.replace("\"", "\"\"") + "\",\"" +
                        firstName.replace("\"", "\"\"") + "\",\"" + currentMonthlyAmount.replace("\"", "\"\"") + "\",\"" +
                        currentYearlyAmount.replace("\"", "\"\"") +"\"\r\n")
            # As the string already contains newlines, make sure to blank these out.
            outputFile = open(outputFileName, "w", newline="")
            outputFile.write(outputLines)
            outputFile.close()
            print ("Wrote to [" + outputFileName + "]")
            #print(str(x+1)+" lines read.")
        return 0
    except IndexError as ex:
        print(str(ex))
    except FileNotFoundError as ex:
        print("The file ["+ sys.argv[1] + "] cannot be read.")
    return 1

# Call the "main" program.
if __name__ == "__main__":
    main(sys.argv[1:])

Để có khả năng tương thích tối đa, tất cả các thành phần dữ liệu trong tệp CSV phải được đóng gói bằng dấu ngoặc kép, với bất kỳ dấu ngoặc kép nào trong các chuỗi đó được thoát bằng dấu ngoặc kép. Đoạn mã trên phản ánh điều này

Chạy mã cho đầu ra này

Lệnh trên sử dụng quy ước dấu mũ (^) để phân tách lệnh giữa các dòng để dễ đọc hơn. Lệnh có thể là một dòng nếu bạn chọn

Đầu ra như được hiển thị trong Notepad ++

Tất nhiên, toàn bộ mục đích của bài tập này là để xem đầu ra trong Excel, vì vậy bây giờ hãy mở tệp này ở đó

Bây giờ dữ liệu ở trong Excel, tất cả các công cụ mà giải pháp mang lại cho bảng giờ đây có thể được áp dụng cho dữ liệu này. Giờ đây, bất kỳ loại lỗi nào có thể có trong dữ liệu do ERP tạo ra đều có thể được khắc phục sự cố và sửa chữa đúng cách theo cùng một cách mà không phải lo lắng về việc gửi dữ liệu xấu ra ngoài trước khi có thể xác minh

Lưu ý rằng các giá trị "Lỗi" trong các ô ở trên là cố ý, vì chúng là kết quả của việc dữ liệu gốc bị trống. Ở cấp độ cao, điều này sẽ giúp người đánh giá nhanh chóng nhận ra rằng có vấn đề

Một cách khác có thể được thực hiện là viết một thông báo lỗi có nhiều thông tin hơn vào một ô lân cận trong cùng một hàng

Đọc. Cách sắp xếp danh sách trong Python

Suy nghĩ cuối cùng về quét văn bản trong Python

Cái hay của giải pháp này là, mặc dù mã có vẻ phức tạp, đặc biệt là so với một biểu thức chính quy thậm chí còn phức tạp hơn, nhưng nó có thể được sử dụng lại và điều chỉnh dễ dàng hơn cho các tệp có cấu trúc tương tự. Loại giải pháp này có thể trở thành một công cụ không thể thiếu trong kho vũ khí của người được giao nhiệm vụ xác minh loại thông tin này từ nguồn dữ liệu “hộp đen” như ERP hoặc hơn thế nữa