CypherNLP
Text Cleanup
From GPT
Simplifying text by removing or replacing emojis, Unicode characters, and other non-standard text elements can be important for certain text processing tasks, especially when these elements don't add meaningful content or can introduce noise. Here's how you can handle these elements:
- 1. Removing Emojis
-
- Emojis are Unicode characters, so they have specific code points. You can use a regular expression to filter out these code points.
import re def remove_emojis(text): emoji_pattern = re.compile("[" u"\U0001F600-\U0001F64F" # emoticons u"\U0001F300-\U0001F5FF" # symbols & pictographs u"\U0001F680-\U0001F6FF" # transport & map symbols u"\U0001F700-\U0001F77F" # alchemical symbols u"\U0001F780-\U0001F7FF" # Geometric Shapes Extended u"\U0001F800-\U0001F8FF" # Supplemental Arrows-C u"\U0001F900-\U0001F9FF" # Supplemental Symbols and Pictographs u"\U0001FA00-\U0001FA6F" # Chess Symbols u"\U0001FA70-\U0001FAFF" # Symbols and Pictographs Extended-A u"\U00002702-\U000027B0" # Dingbats u"\U000024C2-\U0001F251" "]+", flags=re.UNICODE) return emoji_pattern.sub(r'', text)
- 2. Handling Unicode Characters
-
- If you want to keep only ASCII characters (and remove all other Unicode characters):
def keep_ascii_only(text): return text.encode('ascii', 'ignore').decode('ascii')
- If you want to replace non-ASCII characters with a placeholder or space:
def replace_non_ascii(text, replacement=' '): return ''.join([char if ord(char) < 128 else replacement for char in text])
- 3. Using Libraries
-
- Libraries like unidecode can be used to convert Unicode text into its closest ASCII representation. This is especially useful for transliterating characters from other languages into English characters.
from unidecode import unidecode text = "Mëtàl" print(unidecode(text)) # Output: "Metal"
- Installation:
pip install unidecode
- Installation:
- 4. Stripping or Replacing Specific Unicode Categories
-
- Sometimes, you might want to remove or replace specific categories of Unicode characters, like non-breaking spaces, mathematical symbols, etc. Python's unicodedata module can help identify categories.
import unicodedata def strip_unicode_categories(text, categories): return ''.join([char for char in text if unicodedata.category(char) not in categories])
Then, to remove, for instance, letter-like symbols and mathematical symbols, you can call:
cleaned_text = strip_unicode_categories(text, ["Lo", "Sm"])
Remember to consider the context and purpose of your text processing to determine which simplifications are necessary and how aggressive you want to be.
TF-IDF
From GPT
Certainly! Here's a step-by-step approach to go from tokenizations to cosine similarity scores using TF-IDF vectors:
- Convert tokenizations into bag-of-words token counts.
- Convert bag-of-words counts into TF-IDF vectors.
- Compute the cosine similarity between documents.
For this example, I'll be using the scikit-learn library. If you haven't installed it yet, you can do so with:
pip install scikit-learn
Here's a Python code example:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Assuming you have the texts tokenized like this (just for demonstration):
tokenized_texts = [
[1, 2, 3, 4, 5],
[2, 3, 4, 6, 7],
[1, 3, 4, 8, 9]
]
# Convert token IDs back to strings (or words) to work with TfidfVectorizer
# In practice, you might have a mapping from token IDs to words.
# For this example, I'm simply converting the numbers to strings.
texts = [' '.join(map(str, text)) for text in tokenized_texts]
# Convert tokenized texts into TF-IDF vectors
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(texts)
# Compute the cosine similarity between texts
cosine_sim = cosine_similarity(tfidf_matrix)
print(cosine_sim)
Here's a breakdown of the steps:
- We first convert the tokenized texts into plain texts. In practice, you might have a mapping from your token IDs to actual words. For the sake of simplicity, I just converted the numbers to strings.
- TfidfVectorizer from scikit-learn is used to convert the texts into TF-IDF vectors.
- Finally, the cosine_similarity function from scikit-learn computes the cosine similarity between these vectors.
This will give you a matrix of cosine similarity scores, where each row and column correspond to a text, and the value at (i, j) is the cosine similarity score between text i and text j.