CypherNLP

From Traxel Wiki
Jump to navigation Jump to search

Text Cleanup

From GPT

Simplifying text by removing or replacing emojis, Unicode characters, and other non-standard text elements can be important for certain text processing tasks, especially when these elements don't add meaningful content or can introduce noise. Here's how you can handle these elements:

1. Removing Emojis
  • Emojis are Unicode characters, so they have specific code points. You can use a regular expression to filter out these code points.
import re

def remove_emojis(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F700-\U0001F77F"  # alchemical symbols
                           u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                           u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                           u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                           u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                           u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                           u"\U00002702-\U000027B0"  # Dingbats
                           u"\U000024C2-\U0001F251"  
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)
2. Handling Unicode Characters
  • If you want to keep only ASCII characters (and remove all other Unicode characters):
def keep_ascii_only(text):
    return text.encode('ascii', 'ignore').decode('ascii')
  • If you want to replace non-ASCII characters with a placeholder or space:
def replace_non_ascii(text, replacement=' '):
    return ''.join([char if ord(char) < 128 else replacement for char in text])
3. Using Libraries
  • Libraries like unidecode can be used to convert Unicode text into its closest ASCII representation. This is especially useful for transliterating characters from other languages into English characters.
from unidecode import unidecode

text = "Mëtàl"
print(unidecode(text))  # Output: "Metal"
  • Installation: pip install unidecode
4. Stripping or Replacing Specific Unicode Categories
  • Sometimes, you might want to remove or replace specific categories of Unicode characters, like non-breaking spaces, mathematical symbols, etc. Python's unicodedata module can help identify categories.
import unicodedata

def strip_unicode_categories(text, categories):
    return ''.join([char for char in text if unicodedata.category(char) not in categories])

Then, to remove, for instance, letter-like symbols and mathematical symbols, you can call:

cleaned_text = strip_unicode_categories(text, ["Lo", "Sm"])

Remember to consider the context and purpose of your text processing to determine which simplifications are necessary and how aggressive you want to be.

TF-IDF

From GPT

Certainly! Here's a step-by-step approach to go from tokenizations to cosine similarity scores using TF-IDF vectors:

  1. Convert tokenizations into bag-of-words token counts.
  2. Convert bag-of-words counts into TF-IDF vectors.
  3. Compute the cosine similarity between documents.

For this example, I'll be using the scikit-learn library. If you haven't installed it yet, you can do so with:

pip install scikit-learn

Here's a Python code example:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Assuming you have the texts tokenized like this (just for demonstration):
tokenized_texts = [
    [1, 2, 3, 4, 5],
    [2, 3, 4, 6, 7],
    [1, 3, 4, 8, 9]
]

# Convert token IDs back to strings (or words) to work with TfidfVectorizer
# In practice, you might have a mapping from token IDs to words.
# For this example, I'm simply converting the numbers to strings.
texts = [' '.join(map(str, text)) for text in tokenized_texts]

# Convert tokenized texts into TF-IDF vectors
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(texts)

# Compute the cosine similarity between texts
cosine_sim = cosine_similarity(tfidf_matrix)

print(cosine_sim)

Here's a breakdown of the steps:

  • We first convert the tokenized texts into plain texts. In practice, you might have a mapping from your token IDs to actual words. For the sake of simplicity, I just converted the numbers to strings.
  • TfidfVectorizer from scikit-learn is used to convert the texts into TF-IDF vectors.
  • Finally, the cosine_similarity function from scikit-learn computes the cosine similarity between these vectors.

This will give you a matrix of cosine similarity scores, where each row and column correspond to a text, and the value at (i, j) is the cosine similarity score between text i and text j.

Stop Words / Long Tail

tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.85, min_df=2)

Fixed Vocabulary

You can fit a TfidfVectorizer on a set of documents to define the term mapping (vocabulary) and then use that same mapping for future documents without refitting. This is especially useful when you want to ensure the same feature representation is used for both your training and test datasets or any subsequent datasets.

1. Fit the TfidfVectorizer on the Initial Set of Documents

This will determine the vocabulary based on the provided documents.

from sklearn.feature_extraction.text import TfidfVectorizer

# Initial set of documents
initial_docs = ["This is a sample document.", "Another example document.", "Yet another example."]

# Initialize and fit the vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix_initial = tfidf_vectorizer.fit_transform(initial_docs)

2. Transform Future Documents Using the Same Vocabulary

After fitting on the initial set of documents, you can transform new/future documents using the transform method without refitting the vectorizer. The vocabulary remains fixed, and any new terms in the future documents that are not in the initial vocabulary will be ignored.

# Future set of documents
future_docs = ["This is a new document.", "Some other sample text.", "A completely different example."]

# Transform future documents using the existing vocabulary
tfidf_matrix_future = tfidf_vectorizer.transform(future_docs)

Any new terms in the future documents that are not present in the initial vocabulary will simply be ignored during the transformation.