Hướng dẫn how to read pdf file in python using pandas - cách đọc tệp pdf trong python bằng gấu trúc

Question

Có thể mở các tệp PDF và đọc nó bằng cách sử dụng Python Pandas hay tôi phải sử dụng bảng tạm cho chức năng này?

Nội dung chính Show

Làm thế nào tôi có thể đọc pdf trong gấu trúc?
Python có thể đọc các tệp pdf không?

Hỏi ngày 25 tháng 4 năm 2014 lúc 5:24Apr 25, 2014 at 5:24

1

Có một phiên bản mới của

import tabula
df = tabula.io.read_pdf(url, pages='all')

9 được gọi là

# ex
df[0]

0

pip install tabula-py

Phương pháp

# ex
df[0]

1 hoạt động giống như trong phiên bản cũ, tài liệu ở đây: https://pypi.org/project/tabula-py/

Đã trả lời ngày 29 tháng 4 năm 2019 lúc 12:23Apr 29, 2019 at 12:23

MarkmarkMark

85610 Huy hiệu bạc25 Huy hiệu Đồng10 silver badges25 bronze badges

Trong trường hợp đó là một lần, bạn có thể sao chép dữ liệu từ bảng PDF của mình vào tệp văn bản, định dạng nó (sử dụng macro tìm kiếm và thay thế, notepad ++, tập lệnh), lưu nó dưới dạng tệp CSV và tải nó vào Gấu trúc.

Nếu bạn cần làm điều này theo cách có thể mở rộng, bạn có thể thử sản phẩm này: http://tabula.technology/. Tôi chưa sử dụng nó, vì vậy tôi không biết nó hoạt động tốt như thế nào, nhưng bạn có thể khám phá nó nếu bạn cần.

Đã trả lời ngày 27 tháng 1 năm 2016 lúc 4:58Jan 27, 2016 at 4:58

Matija Hanmatija HanMatija Han

4725 Huy hiệu bạc7 Huy hiệu Đồng5 silver badges7 bronze badges

Tôi đã thực hiện một số thử nghiệm với Camelot (https://camelot-py.readthedocs.io/en/master/), và nó hoạt động rất tốt trong nhiều tình huống. Và bạn có thể cố gắng điều chỉnh một số tham số nếu các tham số mặc định không hoạt động.Camelot (https://camelot-py.readthedocs.io/en/master/), and it works very good in many situations. And you can try to adjust some parameters if the default ones doesn't work.

Nó tương tự như Tabula, nhưng nó sử dụng các thuật toán khác nhau (Tabula sử dụng dữ liệu vector trong PDF và raster các dòng của bảng; Camelot sử dụng Transform Hough Transform), vì vậy bạn có thể thử cả hai để tìm ra cái tốt nhất.Tabula, but it use different algorithms (Tabula use the vector data in the PDF and raster the lines of the table; Camelot uses Hough Transform), so you can try both to find the best one.

Cả hai đều có phiên bản web, vì vậy bạn có thể thử với một số ví dụ để quyết định xem cái nào là tốt nhất cho ứng dụng của bạn.

Đã trả lời ngày 16 tháng 1 năm 2019 lúc 8:59Jan 16, 2019 at 8:59

Joselquinjoselquinjoselquin

1532 Huy hiệu bạc6 Huy hiệu Đồng2 silver badges6 bronze badges

Điều này là không thể. PDF là một định dạng dữ liệu để in. Cấu trúc bàn bị mất. Với một số may mắn, bạn có thể trích xuất văn bản với PYPDF và đoán các cột bảng trước đây.

Đã trả lời ngày 25 tháng 4 năm 2014 lúc 6:27Apr 25, 2014 at 6:27

DanieldanielDaniel

41.4K4 Huy hiệu vàng55 Huy hiệu bạc80 Huy hiệu đồng4 gold badges55 silver badges80 bronze badges

2

Sao chép dữ liệu bảng từ PDF và dán vào tệp Excel (thường được dán dưới dạng một thay vì nhiều cột). Sau đó, sử dụng flashfill (có sẵn trong Excel 2016, không chắc chắn về các phiên bản Excel trước đó) để tách dữ liệu vào các cột ban đầu được xem trong PDF. Quá trình này là nhanh chóng và dễ dàng. Sau đó sử dụng gấu trúc để làm lộn xộn dữ liệu excel.

Đã trả lời ngày 14 tháng 12 năm 2016 lúc 1:49Dec 14, 2016 at 1:49

Tôi sử dụng thư viện Tabula để cài đặt, thông qua:

# ex
df[0]

2

Đọc một số bảng bên trong PDF theo liên kết, ví dụ:

import tabula
df = tabula.io.read_pdf(url, pages='all')

Sau đó, bạn sẽ nhận được nhiều bảng, bạn có thể gọi nó bằng cách sử dụng chỉ mục, nó giống như in phần tử từ danh sách, ví dụ:

# ex
df[0]

Thông tin thêm ở đây - https://pypi.org/project/tabula-py/

Đã trả lời ngày 22 tháng 9 năm 2021 lúc 10:13Sep 22, 2021 at 10:13

Tất cả các bạn phải quen thuộc với PDFS là gì. Trên thực tế, chúng là một trong những phương tiện kỹ thuật số quan trọng và được sử dụng rộng rãi nhất. & NBSP; PDF là viết tắt của định dạng tài liệu di động. Nó sử dụng tiện ích mở rộng .pdf. Nó được sử dụng để trình bày và trao đổi tài liệu một cách đáng tin cậy, độc lập với phần mềm, phần cứng hoặc hệ điều hành. Được Adobe, PDF hiện là một tiêu chuẩn mở được duy trì bởi Tổ chức Tiêu chuẩn hóa Quốc tế (ISO). PDF có thể chứa các liên kết và nút, trường mẫu, âm thanh, video và logic kinh doanh. Trong bài viết này, chúng ta sẽ tìm hiểu, làm thế nào chúng ta có thể thực hiện các hoạt động khác nhau như: & nbsp; & nbsp;Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.
Invented by Adobe, PDF is now an open standard maintained by the International Organization for Standardization (ISO). PDFs can contain links and buttons, form fields, audio, video, and business logic.
In this article, we will learn, how we can do various operations like:

Trích xuất văn bản từ PDF
Xoay trang PDF
Hợp nhất pdfs
Tách PDF
Thêm hình mờ vào các trang PDF

Sử dụng các tập lệnh Python đơn giản! Cài đặt & nbsp; Chúng tôi sẽ sử dụng mô-đun của bên thứ ba, PYPDF2.PYPDF2 là một thư viện Python được xây dựng dưới dạng bộ công cụ PDF. Nó có khả năng: & nbsp; & nbsp;
Installation
We will be using a third-party module, PyPDF2.
PyPDF2 is a python library built as a PDF toolkit. It is capable of:

Trích xuất thông tin tài liệu (Tiêu đề, Tác giả, Mạnh)
Chia tài liệu trang theo từng trang
Hợp nhất trang tài liệu theo từng trang
Trang cắt xén
Hợp nhất nhiều trang vào một trang
Mã hóa và giải mã các tệp PDF
và nhiều hơn nữa!

Để cài đặt PYPDF2, hãy chạy lệnh sau từ dòng lệnh: & nbsp; & nbsp;

 pip3 install PyPDF2

Tên mô-đun này nhạy cảm với trường hợp, vì vậy hãy đảm bảo Y là chữ thường và mọi thứ khác là chữ hoa. Tất cả các tệp mã và PDF được sử dụng trong hướng dẫn/bài viết này đều có sẵn tại đây.1. Trích xuất văn bản từ tệp pdf & nbsp; & nbsp;y is lowercase and everything else is uppercase. All the code and PDF files used in this tutorial/article are available here.
1. Extracting text from PDF file

Python

# ex
df[0]

3

# ex
df[0]

4

# ex
df[0]

5

# ex
df[0]

6

# ex
df[0]

7

# ex
df[0]

8

# ex
df[0]

9

 pip3 install PyPDF2

0

 pip3 install PyPDF2

1

 pip3 install PyPDF2

2

 pip3 install PyPDF2

3

# ex
df[0]

6

 pip3 install PyPDF2

5

 pip3 install PyPDF2

6

 pip3 install PyPDF2

7

 pip3 install PyPDF2

8

# ex
df[0]

6

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

0

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

1

 pip3 install PyPDF2

2

 pip3 install PyPDF2

6

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

4

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

5

Đầu ra của chương trình trên trông như thế này: & nbsp; & nbsp;

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

Hãy để chúng tôi cố gắng hiểu mã trên trong các khối: & nbsp; & nbsp;

pdfFileObj = open('example.pdf', 'rb')

Chúng tôi đã mở ví dụ.pdf ở chế độ nhị phân. & Nbsp; và lưu đối tượng tệp dưới dạng pdffiLeobj. & Nbsp;example.pdf in binary mode. And saved the file object as pdfFileObj.

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

Ở đây, chúng tôi tạo một đối tượng của lớp PDFFileReader của mô -đun PYPDF2 và vượt qua đối tượng tệp PDF và nhận đối tượng đầu đọc PDF. & NBSP;PdfFileReader class of PyPDF2 module and pass the PDF file object & get a PDF reader object.

print(pdfReader.numPages)

Thuộc tính Numpages cung cấp số lượng trang trong tệp PDF. Ví dụ: trong trường hợp của chúng tôi, nó là 20 (xem dòng đầu ra đầu tiên). & Nbsp;property gives the number of pages in the PDF file. For example, in our case, it is 20 (see first line of output).

pageObj = pdfReader.getPage(0)

Bây giờ, chúng tôi tạo một đối tượng của lớp trang PYPDF2. Đối tượng đầu đọc pdf có chức năng getPage () lấy số trang (bắt đầu từ chỉ mục 0) làm đối số và trả về đối tượng trang. & Nbsp;PageObject class of PyPDF2 module. PDF reader object has function getPage() which takes page number (starting from index 0) as argument and returns the page object.

print(pageObj.extractText())

Đối tượng trang có chức năng trích xuất () để trích xuất văn bản từ trang PDF. & Nbsp;extractText() to extract text from the PDF page.

import tabula
df = tabula.io.read_pdf(url, pages='all')

0

Cuối cùng, chúng tôi đóng đối tượng tệp PDF.

Lưu ý: Mặc dù các tệp PDF rất tuyệt vời để đưa ra văn bản theo cách mà mọi người dễ in và đọc, nhưng chúng không đơn giản đối với phần mềm để phân tích thành bản rõ. Như vậy, PYPDF2 có thể mắc lỗi khi trích xuất văn bản từ PDF và thậm chí có thể không thể mở một số tệp PDF. Thật không may, đó là rất nhiều bạn có thể làm về điều này, thật không may. PYPDF2 có thể chỉ đơn giản là không thể làm việc với một số tệp PDF cụ thể của bạn. While PDF files are great for laying out text in a way that’s easy for people to print and read, they’re not straightforward for software to parse into plaintext. As such, PyPDF2 might make mistakes when extracting text from a PDF and may even be unable to open some PDFs at all. It isn’t much you can do about this, unfortunately. PyPDF2 may simply be unable to work with some of your particular PDF files.

2. Xoay PDF Trang & NBSP;

Python

# ex
df[0]

3

# ex
df[0]

4

# ex
df[0]

5

# ex
df[0]

6

# ex
df[0]

7

# ex
df[0]

8

# ex
df[0]

9

 pip3 install PyPDF2

0

 pip3 install PyPDF2

1

 pip3 install PyPDF2

2

 pip3 install PyPDF2

3

# ex
df[0]

6

 pip3 install PyPDF2

5

 pip3 install PyPDF2

8

# ex
df[0]

6

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

0

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

1

 pip3 install PyPDF2

2

Đầu ra của chương trình trên trông như thế này: & nbsp; & nbsp;

Hãy để chúng tôi cố gắng hiểu mã trên trong các khối: & nbsp; & nbsp;

Chúng tôi đã mở ví dụ.pdf ở chế độ nhị phân. & Nbsp; và lưu đối tượng tệp dưới dạng pdffiLeobj. & Nbsp;

print(pdfReader.numPages)

1

print(pdfReader.numPages)

6

print(pdfReader.numPages)

1

print(pdfReader.numPages)

8

Ở đây, chúng tôi tạo một đối tượng của lớp PDFFileReader của mô -đun PYPDF2 và vượt qua đối tượng tệp PDF và nhận đối tượng đầu đọc PDF. & NBSP;

pdfFileObj = open('example.pdf', 'rb')

0

pageObj = pdfReader.getPage(0)

7

pdfFileObj = open('example.pdf', 'rb')

0

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

5

pdfFileObj = open('example.pdf', 'rb')

0

print(pageObj.extractText())

1

Thuộc tính Numpages cung cấp số lượng trang trong tệp PDF. Ví dụ: trong trường hợp của chúng tôi, nó là 20 (xem dòng đầu ra đầu tiên). & Nbsp;

Bây giờ, chúng tôi tạo một đối tượng của lớp trang PYPDF2. Đối tượng đầu đọc pdf có chức năng getPage () lấy số trang (bắt đầu từ chỉ mục 0) làm đối số và trả về đối tượng trang. & Nbsp;

Đối tượng trang có chức năng trích xuất () để trích xuất văn bản từ trang PDF. & Nbsp;

Cuối cùng, chúng tôi đóng đối tượng tệp PDF.

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

07

Lưu ý: Mặc dù các tệp PDF rất tuyệt vời để đưa ra văn bản theo cách mà mọi người dễ in và đọc, nhưng chúng không đơn giản đối với phần mềm để phân tích thành bản rõ. Như vậy, PYPDF2 có thể mắc lỗi khi trích xuất văn bản từ PDF và thậm chí có thể không thể mở một số tệp PDF. Thật không may, đó là rất nhiều bạn có thể làm về điều này, thật không may. PYPDF2 có thể chỉ đơn giản là không thể làm việc với một số tệp PDF cụ thể của bạn.

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

15

2. Xoay PDF Trang & NBSP;rotated_example.pdf looks like ( right image) after rotation:

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

8

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

9

pdfFileObj = open('example.pdf', 'rb')

0____25

# ex
df[0]

6

# ex
df[0]

7

pdfFileObj = open('example.pdf', 'rb')

4

 pip3 install PyPDF2

1

 pip3 install PyPDF2

2

import tabula
df = tabula.io.read_pdf(url, pages='all')

1

pdfFileObj = open('example.pdf', 'rb')

0

 pip3 install PyPDF2

3

# ex
df[0]

6

 pip3 install PyPDF2

5PdfFileWriter class of PyPDF2 module.

import tabula
df = tabula.io.read_pdf(url, pages='all')

2

```
pdfFileObj = open('example.pdf', 'rb')
```
0
```
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
```
2
```
# ex
df[0]
```
6
```
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
```
4getPage() method of PDF reader class. Now, we rotate the page by rotateClockwise() method of page object class. Then, we add a page to PDF writer object using addPage() method of PDF writer class by passing the rotated page object.

import tabula
df = tabula.io.read_pdf(url, pages='all')

3

Bây giờ, chúng ta phải viết các trang PDF vào tệp PDF mới. Đầu tiên, chúng tôi mở đối tượng tệp mới và ghi các trang PDF vào nó bằng phương thức Write () của đối tượng Writer Pdf. Cuối cùng, chúng tôi đóng đối tượng tệp PDF gốc và đối tượng tệp mới.write() method of PDF writer object. Finally, we close the original PDF file object and the new file object.

3. Hợp nhất các tệp PDF & NBSP;

Python

# ex
df[0]

3

# ex
df[0]

4

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

8

import tabula
df = tabula.io.read_pdf(url, pages='all')

19

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

21

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

23

pdfFileObj = open('example.pdf', 'rb')

0

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

26

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

8

import tabula
df = tabula.io.read_pdf(url, pages='all')

28

print(pdfReader.numPages)

1

import tabula
df = tabula.io.read_pdf(url, pages='all')

30

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

32

# ex
df[0]

7

import tabula
df = tabula.io.read_pdf(url, pages='all')

34

pageObj = pdfReader.getPage(0)

4

import tabula
df = tabula.io.read_pdf(url, pages='all')

36

print(pdfReader.numPages)

1

import tabula
df = tabula.io.read_pdf(url, pages='all')

38

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

8

print(pageObj.extractText())

3

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

42

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

44

# ex
df[0]

9

 pip3 install PyPDF2

0__1014148

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

50

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

52

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

54

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

56

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

58

import tabula
df = tabula.io.read_pdf(url, pages='all')

08

import tabula
df = tabula.io.read_pdf(url, pages='all')

09

# ex
df[0]

6

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

12

import tabula
df = tabula.io.read_pdf(url, pages='all')

13

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

15

Đầu ra của chương trình trên là PDF kết hợp, kết hợp_example.pdf, thu được bằng cách hợp nhất ví dụ.pdf và rotated_example.pdf. & Nbsp;combined_example.pdf,obtained by merging example.pdf and rotated_example.pdf.

Chúng ta hãy xem xét các khía cạnh quan trọng của chương trình này: & nbsp; & nbsp; & nbsp;

import tabula
df = tabula.io.read_pdf(url, pages='all')

4

Để hợp nhất, chúng tôi sử dụng lớp được xây dựng sẵn, pdffilemerger của mô-đun pypdf2.PdfFileMerger of PyPDF2 module.
Here, we create an object pdfMerger of PDF merger class

import tabula
df = tabula.io.read_pdf(url, pages='all')

5

Bây giờ, chúng tôi nối thêm đối tượng tệp của mỗi đối tượng PDF vào PDF bằng phương thức append ().append() method.

import tabula
df = tabula.io.read_pdf(url, pages='all')

6

Cuối cùng, chúng tôi viết các trang PDF vào tệp PDF đầu ra bằng phương thức ghi của đối tượng sáp nhập PDF.write method of PDF merger object.

4. Tách tệp PDF & NBSP;

Python

# ex
df[0]

3

# ex
df[0]

4

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

8

import tabula
df = tabula.io.read_pdf(url, pages='all')

19

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

21

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

23

pdfFileObj = open('example.pdf', 'rb')

0

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

26

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

8

import tabula
df = tabula.io.read_pdf(url, pages='all')

28

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

8

print(pageObj.extractText())

3

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

42

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

44

# ex
df[0]

9

 pip3 install PyPDF2

0__1014148

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

50

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

52

import tabula
df = tabula.io.read_pdf(url, pages='all')

08

import tabula
df = tabula.io.read_pdf(url, pages='all')

09

# ex
df[0]

6

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

12

import tabula
df = tabula.io.read_pdf(url, pages='all')

13

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

15

Đầu ra của chương trình trên là PDF kết hợp, kết hợp_example.pdf, thu được bằng cách hợp nhất ví dụ.pdf và rotated_example.pdf. & Nbsp;

# ex
df[0]

26

# ex
df[0]

27

Chúng ta hãy xem xét các khía cạnh quan trọng của chương trình này: & nbsp; & nbsp; & nbsp;

# ex
df[0]

26

# ex
df[0]

35

Để hợp nhất, chúng tôi sử dụng lớp được xây dựng sẵn, pdffilemerger của mô-đun pypdf2.

print(pdfReader.numPages)

1

# ex
df[0]

41

import tabula
df = tabula.io.read_pdf(url, pages='all')

13

Bây giờ, chúng tôi nối thêm đối tượng tệp của mỗi đối tượng PDF vào PDF bằng phương thức append ().

Cuối cùng, chúng tôi viết các trang PDF vào tệp PDF đầu ra bằng phương thức ghi của đối tượng sáp nhập PDF.

4. Tách tệp PDF & NBSP;

pdfFileObj = open('example.pdf', 'rb')

0

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

5

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

8

print(pageObj.extractText())

3

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

42

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

44

# ex
df[0]

9

 pip3 install PyPDF2

0__1014148

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

50

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

52

pdfFileObj = open('example.pdf', 'rb')

0

# ex
df[0]

74

import tabula
df = tabula.io.read_pdf(url, pages='all')

08

import tabula
df = tabula.io.read_pdf(url, pages='all')

09

# ex
df[0]

6

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

12

import tabula
df = tabula.io.read_pdf(url, pages='all')

13

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

15

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

15split 1 (page 0,1), split 2(page 2,3), split 3(page 4-end).
No new function or class has been used in the above python program. Using simple logic and iterations, we created the splits of passed PDF according to the passed list splits.
5. Adding watermark to PDF pages

Python

# ex
df[0]

3

# ex
df[0]

4

Đầu ra của chương trình trên là PDF kết hợp, kết hợp_example.pdf, thu được bằng cách hợp nhất ví dụ.pdf và rotated_example.pdf. & Nbsp;

Chúng ta hãy xem xét các khía cạnh quan trọng của chương trình này: & nbsp; & nbsp; & nbsp;

Để hợp nhất, chúng tôi sử dụng lớp được xây dựng sẵn, pdffilemerger của mô-đun pypdf2.

pdfFileObj = open('example.pdf', 'rb')

0

# ex
df[0]

99

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

1

 pip3 install PyPDF2

01

pdfFileObj = open('example.pdf', 'rb')

0

 pip3 install PyPDF2

03

Bây giờ, chúng tôi nối thêm đối tượng tệp của mỗi đối tượng PDF vào PDF bằng phương thức append ().

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

8

print(pageObj.extractText())

3

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

42

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

44

# ex
df[0]

9

 pip3 install PyPDF2

0__1014148

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

50

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

52

import tabula
df = tabula.io.read_pdf(url, pages='all')

08

import tabula
df = tabula.io.read_pdf(url, pages='all')

09

# ex
df[0]

6

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

12

import tabula
df = tabula.io.read_pdf(url, pages='all')

13

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

15

pdfFileObj = open('example.pdf', 'rb')

0

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

26

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

8

import tabula
df = tabula.io.read_pdf(url, pages='all')

28

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

8

print(pageObj.extractText())

3

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

42

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

44

# ex
df[0]

9

 pip3 install PyPDF2

0__1014148

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

50

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

52

print(pdfReader.numPages)

1

 pip3 install PyPDF2

47

import tabula
df = tabula.io.read_pdf(url, pages='all')

08

import tabula
df = tabula.io.read_pdf(url, pages='all')

09

# ex
df[0]

6

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

12

import tabula
df = tabula.io.read_pdf(url, pages='all')

13

pdfFileObj = open('example.pdf', 'rb')

0

pageObj = pdfReader.getPage(0)

7

pdfFileObj = open('example.pdf', 'rb')

0

20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................
.........4 [and some more lines...]

5

pdfFileObj = open('example.pdf', 'rb')

0

print(pageObj.extractText())

1

import tabula
df = tabula.io.read_pdf(url, pages='all')

08

import tabula
df = tabula.io.read_pdf(url, pages='all')

09

# ex
df[0]

6

# ex
df[0]

6

import tabula
df = tabula.io.read_pdf(url, pages='all')

12

import tabula
df = tabula.io.read_pdf(url, pages='all')

13

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

15

pdfFileObj = open('example.pdf', 'rb')

0

import tabula
df = tabula.io.read_pdf(url, pages='all')

15

Đầu ra của chương trình trên là PDF kết hợp, kết hợp_example.pdf, thu được bằng cách hợp nhất ví dụ.pdf và rotated_example.pdf. & Nbsp;

import tabula
df = tabula.io.read_pdf(url, pages='all')

7

Chúng ta hãy xem xét các khía cạnh quan trọng của chương trình này: & nbsp; & nbsp; & nbsp;add_watermark() function.
Để hợp nhất, chúng tôi sử dụng lớp được xây dựng sẵn, pdffilemerger của mô-đun pypdf2.add_watermark() function:

import tabula
df = tabula.io.read_pdf(url, pages='all')

8

Bây giờ, chúng tôi nối thêm đối tượng tệp của mỗi đối tượng PDF vào PDF bằng phương thức append ().watermark.pdf. To the passed page object, we use mergePage() function and pass the page object of the first page of the watermark PDF reader object. This will overlay the watermark over the passed page object.

Cuối cùng, chúng tôi viết các trang PDF vào tệp PDF đầu ra bằng phương thức ghi của đối tượng sáp nhập PDF.
Now, you can easily create your own PDF manager!
References:

https://automatetheboringstuff.com/chapter13/
https://pythonhosted.org/PyPDF2/

Bài viết này được đóng góp bởi Nikhil Kumar.Nếu bạn thích GeekSforGeeks và muốn đóng góp, bạn cũng có thể viết một bài viết bằng Write.GeekSforGeek.org hoặc gửi bài viết của bạn.Xem bài viết của bạn xuất hiện trên trang chính của GeekSforGeek và giúp các chuyên viên máy tính khác. Xin vui lòng viết nhận xét nếu bạn tìm thấy bất cứ điều gì không chính xác hoặc nếu bạn muốn chia sẻ thêm thông tin về chủ đề được thảo luận ở trên. & NBSP;Nikhil Kumar. If you like GeeksforGeeks and would like to contribute, you can also write an article using write.geeksforgeeks.org or mail your article to . See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please write comments if you find anything incorrect, or if you want to share more information about the topic discussed above.

Làm thế nào tôi có thể đọc pdf trong gấu trúc?

Trích xuất dữ liệu từ PDF đến Pandas..

Bước 1: PDF bí mật vào tệp văn bản.....

Bước 2: Tập hợp văn bản thành các khối hợp lý.....

Bước 3: Sử dụng các biểu thức thông thường.....

Bước 4: Chuyển đổi danh sách thành khung dữ liệu gấu trúc ..

Python có thể đọc các tệp pdf không?

Nó cũng có thể thêm dữ liệu tùy chỉnh, tùy chọn xem và mật khẩu vào các tệp PDF.Nó có thể lấy văn bản và siêu dữ liệu từ các tệp PDF cũng như hợp nhất toàn bộ các tệp với nhau.PDFRW là một thư viện và tiện ích Python đọc và viết các tệp PDF: Phiên bản 0.4 được kiểm tra và hoạt động trên Python 2.6, 2.7, 3.3, 3.4, 3.5 và 3.6.pdfrw is a Python library and utility that reads and writes PDF files: Version 0.4 is tested and works on Python 2.6, 2.7, 3.3, 3.4, 3.5, and 3.6.

programming python Read PDF Python Tabula-py Pandas to pdf