Hướng dẫn cosine similarity between two sentences python sklearn - cosine tương tự giữa hai câu python sklearn

Question

Xem thảo luận

Nội dung chính Show

Tính toán tương đồng theo cặp
Diễn giải kết quả
Làm thế nào để bạn tìm thấy sự tương đồng cosine giữa hai câu trong Python?
Làm thế nào để bạn tìm thấy sự tương đồng cosine với Sklearn?
Làm thế nào để bạn tìm thấy sự tương đồng giữa hai câu?
Làm thế nào để bạn tìm thấy sự tương đồng giữa hai văn bản trong Python?

Cải thiện bài viết

Lưu bài viết

Đọc

Bàn luận

Xem thảo luận

Cải thiện bài viết

Lưu bài viết

Đọc is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.
Similarity = (A.B) / (||A||.||B||) where A and B are vectors.

Bàn luận

1. Open terminal(Linux).
2. sudo pip3 install nltk
3. python3
4. import nltk
5. nltk.download(‘all’)

Độ tương tự cosine là thước đo sự tương đồng giữa hai vectơ khác không của không gian sản phẩm bên trong đo cosin của góc giữa chúng. B là vectơ.

Tương tự cosine và mô -đun công cụ NLTK được sử dụng trong chương trình này. Để thực hiện chương trình này, NLTK phải được cài đặt trong hệ thống của bạn. Để cài đặt mô -đun NLTK, hãy làm theo các bước bên dưới - It is used for tokenization. Tokenization is the process by which big quantity of text is divided into smaller parts called tokens. word_tokenize(X) split the given sentence X into words and return list.
Các chức năng được sử dụng: In this program, it is used to get a list of stopwords. A stop word is a commonly used word (such as “the”, “a”, “an”, “in”).

nltk.tokenize: Nó được sử dụng để mã hóa. Mã thông báo là quá trình mà số lượng lớn văn bản được chia thành các phần nhỏ hơn được gọi là mã thông báo. word_tokenize(X) Chia câu X đã cho thành các từ và danh sách trả về.

nltk.corpus: Trong chương trình này, nó được sử dụng để có được một danh sách các từ dừng. Một từ dừng là một từ thường được sử dụng (chẳng hạn như là The The The, một, A A, một, một trong những người khác).

Dưới đây là triển khai Python -

similarity:  0.2886751345948129

7

similarity:  0.2886751345948129

8

similarity:  0.2886751345948129

9

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

0

similarity:  0.2886751345948129

8

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

2

from

similarity:  0.2886751345948129

0

similarity:  0.2886751345948129

1

similarity:  0.2886751345948129

2

from

similarity:  0.2886751345948129

4

similarity:  0.2886751345948129

1

similarity:  0.2886751345948129

6

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

3

similarity:  0.2886751345948129

8

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

5

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T

4

similarity:  0.2886751345948129

8

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T

6

similarity:  0.2886751345948129

8

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T

8

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

6

similarity:  0.2886751345948129

8

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

8

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

9

similarity:  0.2886751345948129

8

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T

1

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T

2

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T

3

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T

9

similarity:  0.2886751345948129

8

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

1

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

2

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

3

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

4

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

3__

>>> pairwise_similarity.toarray()                                                                                                                                                                                                                            
array([[1.        , 0.17668795, 0.27056873, 0.        , 0.        ],
       [0.17668795, 1.        , 0.15439436, 0.        , 0.        ],
       [0.27056873, 0.15439436, 1.        , 0.19635649, 0.16815247],
       [0.        , 0.        , 0.19635649, 1.        , 0.54499756],
       [0.        , 0.        , 0.16815247, 0.54499756, 1.        ]])

1

similarity:  0.2886751345948129

8

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

1

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

2

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

3

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

4

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

6__

>>> import numpy as np     
                                                                                                                                                                                                                                  
>>> arr = pairwise_similarity.toarray()     
>>> np.fill_diagonal(arr, np.nan)                                                                                                                                                                                                                            
                                                                                                                                                                                                                 
>>> input_doc = "The scikit-learn docs are Orange and Blue"                                                                                                                                                                                                  
>>> input_idx = corpus.index(input_doc)                                                                                                                                                                                                                      
>>> input_idx                                                                                                                                                                                                                                                
4

>>> result_idx = np.nanargmax(arr[input_idx])                                                                                                                                                                                                                
>>> corpus[result_idx]                                                                                                                                                                                                                                       
'I prefer scikit-learn to Orange'

3

similarity:  0.2886751345948129

8

>>> import numpy as np     
                                                                                                                                                                                                                                  
>>> arr = pairwise_similarity.toarray()     
>>> np.fill_diagonal(arr, np.nan)                                                                                                                                                                                                                            
                                                                                                                                                                                                                 
>>> input_doc = "The scikit-learn docs are Orange and Blue"                                                                                                                                                                                                  
>>> input_idx = corpus.index(input_doc)                                                                                                                                                                                                                      
>>> input_idx                                                                                                                                                                                                                                                
4

>>> result_idx = np.nanargmax(arr[input_idx])                                                                                                                                                                                                                
>>> corpus[result_idx]                                                                                                                                                                                                                                       
'I prefer scikit-learn to Orange'

5

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

0

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

8

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

9word_tokenize(X)0word_tokenize(X)1

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

2

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

3

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

4

>>> import numpy as np     
                                                                                                                                                                                                                                  
>>> arr = pairwise_similarity.toarray()     
>>> np.fill_diagonal(arr, np.nan)                                                                                                                                                                                                                            
                                                                                                                                                                                                                 
>>> input_doc = "The scikit-learn docs are Orange and Blue"                                                                                                                                                                                                  
>>> input_idx = corpus.index(input_doc)                                                                                                                                                                                                                      
>>> input_idx                                                                                                                                                                                                                                                
4

>>> result_idx = np.nanargmax(arr[input_idx])                                                                                                                                                                                                                
>>> corpus[result_idx]                                                                                                                                                                                                                                       
'I prefer scikit-learn to Orange'

9

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

0

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

8from1word_tokenize(X)0word_tokenize(X)1

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

0

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

6

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

3

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

4

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

4

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

55____76

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

0

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

6

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

3

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

4 word_tokenize(X)6

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

5word_tokenize(X)1

from4

similarity:  0.2886751345948129

8 word_tokenize(X)0

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

2 from8

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

4

similarity:  0.2886751345948129

00

similarity:  0.2886751345948129

01

similarity:  0.2886751345948129

022

similarity:  0.2886751345948129

26

similarity:  0.2886751345948129

01

similarity:  0.2886751345948129

28

similarity:  0.2886751345948129

29

Output:

similarity:  0.2886751345948129

Cách phổ biến để làm điều này là chuyển đổi các tài liệu thành các vectơ TF-IDF và sau đó tính toán độ tương tự cosin giữa chúng. Bất kỳ sách giáo khoa về Truy xuất thông tin (IR) bao gồm điều này. Xem ESP. Giới thiệu về truy xuất thông tin, miễn phí và có sẵn trực tuyến.

Tính toán tương đồng theo cặp

TF-IDF (và các phép biến đổi văn bản tương tự) được triển khai trong các gói Python Gensim và Scikit-learn. Trong gói sau, sự tương đồng về cosine tính toán dễ dàng như

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

Hoặc, nếu các tài liệu là chuỗi đơn giản,

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T

Mặc dù Gensim có thể có nhiều lựa chọn hơn cho loại nhiệm vụ này.

Xem thêm câu hỏi này.

[Tuyên bố miễn trừ trách nhiệm: Tôi đã tham gia vào việc triển khai Scikit-Learn TF-IDF.]

Diễn giải kết quả

Từ trên cao,

similarity:  0.2886751345948129

30 là một ma trận thưa thớt SCIPY có hình vuông, với số lượng hàng và cột bằng với số lượng tài liệu trong kho văn bản.

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

Bạn có thể chuyển đổi mảng thưa thớt thành một mảng numpy thông qua

similarity:  0.2886751345948129

31 hoặc

similarity:  0.2886751345948129

32:

>>> pairwise_similarity.toarray()                                                                                                                                                                                                                            
array([[1.        , 0.17668795, 0.27056873, 0.        , 0.        ],
       [0.17668795, 1.        , 0.15439436, 0.        , 0.        ],
       [0.27056873, 0.15439436, 1.        , 0.19635649, 0.16815247],
       [0.        , 0.        , 0.19635649, 1.        , 0.54499756],
       [0.        , 0.        , 0.16815247, 0.54499756, 1.        ]])

Giả sử chúng tôi muốn tìm tài liệu tương tự như tài liệu cuối cùng, "Các tài liệu Scikit-Learn có màu cam và màu xanh". Tài liệu này có Chỉ số 4 trong

similarity:  0.2886751345948129

33. Bạn có thể tìm thấy chỉ mục của tài liệu tương tự nhất bằng cách lấy argmax của hàng đó, nhưng trước tiên bạn sẽ cần che dấu 1, thể hiện sự giống nhau của mỗi tài liệu với chính nó. Bạn có thể thực hiện sau này thông qua

similarity:  0.2886751345948129

34 và cái trước thông qua

similarity:  0.2886751345948129

35:taking the argmax of that row, but first you'll need to mask the 1's, which represent the similarity of each document to itself. You can do the latter through

similarity:  0.2886751345948129

34, and the former through

similarity:  0.2886751345948129

35:

>>> import numpy as np     
                                                                                                                                                                                                                                  
>>> arr = pairwise_similarity.toarray()     
>>> np.fill_diagonal(arr, np.nan)                                                                                                                                                                                                                            
                                                                                                                                                                                                                 
>>> input_doc = "The scikit-learn docs are Orange and Blue"                                                                                                                                                                                                  
>>> input_idx = corpus.index(input_doc)                                                                                                                                                                                                                      
>>> input_idx                                                                                                                                                                                                                                                
4

>>> result_idx = np.nanargmax(arr[input_idx])                                                                                                                                                                                                                
>>> corpus[result_idx]                                                                                                                                                                                                                                       
'I prefer scikit-learn to Orange'

Lưu ý: Mục đích của việc sử dụng ma trận thưa thớt là để tiết kiệm (một lượng không gian đáng kể) cho một kho văn bản & từ vựng lớn. Thay vì chuyển đổi thành một mảng numpy, bạn có thể làm:

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

Làm thế nào để bạn tìm thấy sự tương đồng cosine giữa hai câu trong Python?

Sự tương tự cosine là thước đo sự tương đồng giữa hai vectơ khác không của không gian sản phẩm bên trong đo cosin của góc giữa chúng. Sự tương đồng = (A.B) / (|| a ||. || b ||) trong đó a và b là vectơ.Similarity = (A.B) / (||A||. ||B||) where A and B are vectors.

Làm thế nào để bạn tìm thấy sự tương đồng cosine với Sklearn?

Hướng dẫn: Cách tính toán tương tự cosin..

Phương pháp 1: Sử dụng các hàm chấm và tiêu chuẩn của Numpy ..

Phương pháp 2: Sử dụng chức năng cosine tích hợp của Scipy ..

Phương pháp 3: Sử dụng sklearn để tính toán ma trận tương tự cosine giữa các vectơ ..

Sử dụng sự tương đồng cosine để phát hiện đạo văn ..

Làm thế nào để bạn tìm thấy sự tương đồng giữa hai câu?

Sự tương đồng về thứ tự từ là một cách để đánh giá sự tương tự của câu xem xét thứ tự của các từ.Hai câu thường giống nhau nếu cùng một từ tồn tại trong cả hai câu theo cùng một thứ tự.Tuy nhiên, các câu nên được coi là không hoàn toàn giống nhau nếu các từ của một câu có thứ tự khác nhau như câu khác.Two sentences are typically similar if same words exist in both sentences in the same order. However, sentences should be considered as not completely similar if words of a sentence have dif- ferent order as the other sentence.

Làm thế nào để bạn tìm thấy sự tương đồng giữa hai văn bản trong Python?

Thuật toán của chúng tôi để xác nhận độ tương tự tài liệu sẽ bao gồm ba bước cơ bản: chia các tài liệu trong Words.compute các tần số từ. Kết hợp sản phẩm chấm của các vectơ tài liệu.Split the documents in words. Compute the word frequencies. Calculate the dot product of the document vectors.

programming python Cosine similarity Python Cosine similarity formula TF-IDF cosine similarity