CypherNLP: Difference between revisions
No edit summary |
|||
(One intermediate revision by the same user not shown) | |||
Line 111: | Line 111: | ||
This will give you a matrix of cosine similarity scores, where each row and column correspond to a text, and the value at (i, j) is the cosine similarity score between text i and text j. | This will give you a matrix of cosine similarity scores, where each row and column correspond to a text, and the value at (i, j) is the cosine similarity score between text i and text j. | ||
=== Stop Words / Long Tail === | |||
<pre> | |||
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.85, min_df=2) | |||
</pre> | |||
=== Fixed Vocabulary === | |||
You can fit a '''TfidfVectorizer''' on a set of documents to define the term mapping (vocabulary) and then use that same mapping for future documents without refitting. This is especially useful when you want to ensure the same feature representation is used for both your training and test datasets or any subsequent datasets. | |||
==== 1. Fit the TfidfVectorizer on the Initial Set of Documents ==== | |||
This will determine the vocabulary based on the provided documents. | |||
<pre> | |||
from sklearn.feature_extraction.text import TfidfVectorizer | |||
# Initial set of documents | |||
initial_docs = ["This is a sample document.", "Another example document.", "Yet another example."] | |||
# Initialize and fit the vectorizer | |||
tfidf_vectorizer = TfidfVectorizer(stop_words='english') | |||
tfidf_matrix_initial = tfidf_vectorizer.fit_transform(initial_docs) | |||
</pre> | |||
==== 2. Transform Future Documents Using the Same Vocabulary ==== | |||
After fitting on the initial set of documents, you can transform new/future documents using the '''transform''' method without refitting the vectorizer. The vocabulary remains fixed, and any new terms in the future documents that are not in the initial vocabulary will be ignored. | |||
<pre> | |||
# Future set of documents | |||
future_docs = ["This is a new document.", "Some other sample text.", "A completely different example."] | |||
# Transform future documents using the existing vocabulary | |||
tfidf_matrix_future = tfidf_vectorizer.transform(future_docs) | |||
</pre> | |||
Any new terms in the future documents that are not present in the initial vocabulary will simply be ignored during the transformation. |
Latest revision as of 18:38, 26 September 2023
Text Cleanup
From GPT
Simplifying text by removing or replacing emojis, Unicode characters, and other non-standard text elements can be important for certain text processing tasks, especially when these elements don't add meaningful content or can introduce noise. Here's how you can handle these elements:
- 1. Removing Emojis
-
- Emojis are Unicode characters, so they have specific code points. You can use a regular expression to filter out these code points.
import re def remove_emojis(text): emoji_pattern = re.compile("[" u"\U0001F600-\U0001F64F" # emoticons u"\U0001F300-\U0001F5FF" # symbols & pictographs u"\U0001F680-\U0001F6FF" # transport & map symbols u"\U0001F700-\U0001F77F" # alchemical symbols u"\U0001F780-\U0001F7FF" # Geometric Shapes Extended u"\U0001F800-\U0001F8FF" # Supplemental Arrows-C u"\U0001F900-\U0001F9FF" # Supplemental Symbols and Pictographs u"\U0001FA00-\U0001FA6F" # Chess Symbols u"\U0001FA70-\U0001FAFF" # Symbols and Pictographs Extended-A u"\U00002702-\U000027B0" # Dingbats u"\U000024C2-\U0001F251" "]+", flags=re.UNICODE) return emoji_pattern.sub(r'', text)
- 2. Handling Unicode Characters
-
- If you want to keep only ASCII characters (and remove all other Unicode characters):
def keep_ascii_only(text): return text.encode('ascii', 'ignore').decode('ascii')
- If you want to replace non-ASCII characters with a placeholder or space:
def replace_non_ascii(text, replacement=' '): return ''.join([char if ord(char) < 128 else replacement for char in text])
- 3. Using Libraries
-
- Libraries like unidecode can be used to convert Unicode text into its closest ASCII representation. This is especially useful for transliterating characters from other languages into English characters.
from unidecode import unidecode text = "Mëtàl" print(unidecode(text)) # Output: "Metal"
- Installation:
pip install unidecode
- Installation:
- 4. Stripping or Replacing Specific Unicode Categories
-
- Sometimes, you might want to remove or replace specific categories of Unicode characters, like non-breaking spaces, mathematical symbols, etc. Python's unicodedata module can help identify categories.
import unicodedata def strip_unicode_categories(text, categories): return ''.join([char for char in text if unicodedata.category(char) not in categories])
Then, to remove, for instance, letter-like symbols and mathematical symbols, you can call:
cleaned_text = strip_unicode_categories(text, ["Lo", "Sm"])
Remember to consider the context and purpose of your text processing to determine which simplifications are necessary and how aggressive you want to be.
TF-IDF
From GPT
Certainly! Here's a step-by-step approach to go from tokenizations to cosine similarity scores using TF-IDF vectors:
- Convert tokenizations into bag-of-words token counts.
- Convert bag-of-words counts into TF-IDF vectors.
- Compute the cosine similarity between documents.
For this example, I'll be using the scikit-learn library. If you haven't installed it yet, you can do so with:
pip install scikit-learn
Here's a Python code example:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Assuming you have the texts tokenized like this (just for demonstration):
tokenized_texts = [
[1, 2, 3, 4, 5],
[2, 3, 4, 6, 7],
[1, 3, 4, 8, 9]
]
# Convert token IDs back to strings (or words) to work with TfidfVectorizer
# In practice, you might have a mapping from token IDs to words.
# For this example, I'm simply converting the numbers to strings.
texts = [' '.join(map(str, text)) for text in tokenized_texts]
# Convert tokenized texts into TF-IDF vectors
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(texts)
# Compute the cosine similarity between texts
cosine_sim = cosine_similarity(tfidf_matrix)
print(cosine_sim)
Here's a breakdown of the steps:
- We first convert the tokenized texts into plain texts. In practice, you might have a mapping from your token IDs to actual words. For the sake of simplicity, I just converted the numbers to strings.
- TfidfVectorizer from scikit-learn is used to convert the texts into TF-IDF vectors.
- Finally, the cosine_similarity function from scikit-learn computes the cosine similarity between these vectors.
This will give you a matrix of cosine similarity scores, where each row and column correspond to a text, and the value at (i, j) is the cosine similarity score between text i and text j.
Stop Words / Long Tail
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.85, min_df=2)
Fixed Vocabulary
You can fit a TfidfVectorizer on a set of documents to define the term mapping (vocabulary) and then use that same mapping for future documents without refitting. This is especially useful when you want to ensure the same feature representation is used for both your training and test datasets or any subsequent datasets.
1. Fit the TfidfVectorizer on the Initial Set of Documents
This will determine the vocabulary based on the provided documents.
from sklearn.feature_extraction.text import TfidfVectorizer # Initial set of documents initial_docs = ["This is a sample document.", "Another example document.", "Yet another example."] # Initialize and fit the vectorizer tfidf_vectorizer = TfidfVectorizer(stop_words='english') tfidf_matrix_initial = tfidf_vectorizer.fit_transform(initial_docs)
2. Transform Future Documents Using the Same Vocabulary
After fitting on the initial set of documents, you can transform new/future documents using the transform method without refitting the vectorizer. The vocabulary remains fixed, and any new terms in the future documents that are not in the initial vocabulary will be ignored.
# Future set of documents future_docs = ["This is a new document.", "Some other sample text.", "A completely different example."] # Transform future documents using the existing vocabulary tfidf_matrix_future = tfidf_vectorizer.transform(future_docs)
Any new terms in the future documents that are not present in the initial vocabulary will simply be ignored during the transformation.