Hướng dẫn soft cosine similarity python - con trăn tương tự cosine mềm

Ghi chú

Nội dung chính Show

Biện pháp cosine mềm
Basine Basics Soft Bascials¶
Tính toán các biện pháp cosine mềm
Hãy để Lừa lấy một số câu để tính khoảng cách giữa.

Nhấn vào đây để tải xuống mã ví dụ đầy đủhere to download the full example code

Biện pháp cosine mềm

Thể hiện việc sử dụng GENSIM, sự suy yếu của SCM.

Biện pháp cosine mềm (SCM) là một công cụ mới đầy hứa hẹn trong học máy cho phép chúng tôi gửi truy vấn và trả về các tài liệu phù hợp nhất. Hướng dẫn này giới thiệu SCM và cho thấy cách bạn có thể tính toán điểm tương đồng của SCM giữa hai tài liệu bằng phương pháp inner_product.

Basine Basics Soft Bascials¶

Biện pháp cosine mềm (SCM) là một phương pháp cho phép chúng tôi đánh giá sự giống nhau giữa hai tài liệu một cách có ý nghĩa, ngay cả khi chúng không có từ chung. Nó sử dụng một thước đo về sự tương đồng giữa các từ, có thể được lấy [2] bằng cách sử dụng [Word2VEC] [] [4] Vector nhúng các từ. Nó đã được chứng minh là vượt trội so với nhiều phương pháp hiện đại trong nhiệm vụ tương tự văn bản ngữ nghĩa trong bối cảnh trả lời câu hỏi cộng đồng [2].

SCM được minh họa dưới đây cho hai câu rất giống nhau. Các câu không có từ chung, nhưng bằng cách mô hình từ đồng nghĩa, SCM có thể đo lường chính xác sự giống nhau giữa hai câu. Phương pháp này cũng sử dụng biểu diễn vectơ túi của các tài liệu (chỉ cần đặt, tần số từ trong các tài liệu). Điểm số đằng sau phương pháp là chúng tôi tính toán độ tương tự cosine tiêu chuẩn giả sử rằng các vectơ tài liệu được biểu thị theo cơ sở không teo, trong đó góc giữa hai vectơ cơ sở có nguồn gốc từ góc giữa các từ của Word2VEC của các từ tương ứng.

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img = mpimg.imread('scm-hello.png')
imgplot = plt.imshow(img)
plt.axis('off')
plt.show()

Phương pháp này có lẽ lần đầu tiên được giới thiệu trong bài viết Biện pháp mềm và đo lường mềm: Biện pháp các tính năng trong mô hình không gian vector của Grigori Sidorov, Alexander Gelbukh, Helena Gomez-Adorno và David Pinto.

Trong hướng dẫn này, chúng tôi sẽ tìm hiểu cách sử dụng chức năng SCM của GENSIM, bao gồm phương pháp inner_product để tính toán một lần và lớp ____10 cho các truy vấn tương tự dựa trên Corpus.

Quan trọng

Nếu bạn sử dụng chức năng SCM GENSIM, vui lòng xem xét trích dẫn [1], [2] và [3].

Tính toán các biện pháp cosine mềm

Để sử dụng SCM, bạn cần một số từ nhúng từ hiện có. Bạn có thể đào tạo mô hình Word2VEC của riêng mình, nhưng điều đó nằm ngoài phạm vi của hướng dẫn này (hãy xem mô hình Word2VEC nếu bạn quan tâm). Đối với hướng dẫn này, chúng tôi sẽ sử dụng mô hình Word2VEC hiện có.Word2Vec Model if you’re interested). For this tutorial, we’ll be using an existing Word2Vec model.

Hãy để Lừa lấy một số câu để tính khoảng cách giữa.

# Initialize logging.
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
sentence_orange = 'Oranges are my favorite fruit'

Hai câu đầu tiên có nội dung rất giống nhau và như vậy SCM phải cao. Ngược lại, câu thứ ba không liên quan đến hai câu đầu tiên và SCM phải thấp.

Trước khi chúng tôi tính toán SCM, chúng tôi muốn xóa các từ dừng (Hồi giáo, trực tiếp, v.v.), vì những điều này không đóng góp nhiều cho thông tin trong các câu.

# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download
download('stopwords')  # Download stopwords list.
stop_words = stopwords.words('english')

def preprocess(sentence):
    return [w for w in sentence.lower().split() if w not in stop_words]

sentence_obama = preprocess(sentence_obama)
sentence_president = preprocess(sentence_president)
sentence_orange = preprocess(sentence_orange)

Out:

/home/witiko/.virtualenvs/gensim4/lib/python3.7/site-packages/sklearn/feature_extraction/image.py:167: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
dtype=np.int):
/home/witiko/.virtualenvs/gensim4/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:30: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
method='lar', copy_X=True, eps=np.finfo(np.float).eps,
/home/witiko/.virtualenvs/gensim4/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:167: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
method='lar', copy_X=True, eps=np.finfo(np.float).eps,
/home/witiko/.virtualenvs/gensim4/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:284: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,
/home/witiko/.virtualenvs/gensim4/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:862: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
/home/witiko/.virtualenvs/gensim4/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:1101: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
/home/witiko/.virtualenvs/gensim4/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:1127: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
eps=np.finfo(np.float).eps, positive=False):
/home/witiko/.virtualenvs/gensim4/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:1362: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,
/home/witiko/.virtualenvs/gensim4/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:1602: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,
/home/witiko/.virtualenvs/gensim4/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:1738: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
eps=np.finfo(np.float).eps, copy_X=True, positive=False):
[nltk_data] Downloading package stopwords to /home/witiko/nltk_data...
[nltk_data] Package stopwords is already up-to-date!

Tiếp theo, chúng tôi sẽ xây dựng một từ điển và mô hình TF-IDF và chúng tôi sẽ chuyển đổi các câu thành định dạng túi.

from gensim.corpora import Dictionary
documents = [sentence_obama, sentence_president, sentence_orange]
dictionary = Dictionary(documents)

sentence_obama = dictionary.doc2bow(sentence_obama)
sentence_president = dictionary.doc2bow(sentence_president)
sentence_orange = dictionary.doc2bow(sentence_orange)

from gensim.models import TfidfModel
documents = [sentence_obama, sentence_president, sentence_orange]
tfidf = TfidfModel(documents)

sentence_obama = tfidf[sentence_obama]
sentence_president = tfidf[sentence_president]
sentence_orange = tfidf[sentence_orange]

Bây giờ, như đã đề cập trước đó, chúng tôi sẽ sử dụng một số nhúng được đào tạo trước được tải xuống. Chúng tôi tải chúng vào lớp mô hình Word2VEC GENSIM và chúng tôi xây dựng một thuật ngữ tương tự Mextrix bằng cách sử dụng các nhúng.

Quan trọng

Nếu bạn sử dụng chức năng SCM GENSIM, vui lòng xem xét trích dẫn [1], [2] và [3].

import gensim.downloader as api
model = api.load('word2vec-google-news-300')

from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex
termsim_index = WordEmbeddingSimilarityIndex(model)
termsim_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary, tfidf)

Tính toán các biện pháp cosine mềm

similarity = termsim_matrix.inner_product(sentence_obama, sentence_president, normalized=(True, True))
print('similarity = %.4f' % similarity)

Out:

similarity = termsim_matrix.inner_product(sentence_obama, sentence_orange, normalized=(True, True))
print('similarity = %.4f' % similarity)

Out:

Hãy để Lừa lấy một số câu để tính khoảng cách giữa.

Hai câu đầu tiên có nội dung rất giống nhau và như vậy SCM phải cao. Ngược lại, câu thứ ba không liên quan đến hai câu đầu tiên và SCM phải thấp.
Trước khi chúng tôi tính toán SCM, chúng tôi muốn xóa các từ dừng (Hồi giáo, trực tiếp, v.v.), vì những điều này không đóng góp nhiều cho thông tin trong các câu.
Tiếp theo, chúng tôi sẽ xây dựng một từ điển và mô hình TF-IDF và chúng tôi sẽ chuyển đổi các câu thành định dạng túi.
Bây giờ, như đã đề cập trước đó, chúng tôi sẽ sử dụng một số nhúng được đào tạo trước được tải xuống. Chúng tôi tải chúng vào lớp mô hình Word2VEC GENSIM và chúng tôi xây dựng một thuật ngữ tương tự Mextrix bằng cách sử dụng các nhúng.

Các nhúng mà chúng tôi đã chọn ở đây đòi hỏi rất nhiều bộ nhớ. ( 0 minutes 56.707 seconds)

Vì vậy, hãy để tính toán SCM bằng phương pháp inner_product. 7701 MB

Hãy cùng thử điều tương tự với hai câu hoàn toàn không liên quan. Lưu ý rằng sự giống nhau là nhỏ hơn.