(Text, Context, Rules) -> Reduced Text
Word embeddings are vector representations that capture a word's meaning based on the contexts in which it appears, positioning similar words close in a high-dimensional space. This allows embeddings to reflect relationships like synonyms, analogies, and usage patterns.
In this context, \( T = \{t_1, t_2, \dots, t_n\} \) represents a set of embedding vectors for words, sentences, or paragraphs from the text. Each element \( t_i \) in the set \( T \) is a vector that encodes some linguistic information about a particular word, sentence, or paragraph.
\( C = \{c_1, c_2, \dots, c_m\} \) represents a set of embedding vectors corresponding to context elements. Each element \( c_i \) in the set \( C \) is a vector that encodes some semantic or informational significance about the specific context element, which can be a word, sentence, or paragraph related to a larger text or domain.
Let \( I = \{I_1, I_2, \dots, I_n\} \) represent the weights for each context element, indicating their informational importance. Such weights could be based on the rarity of the elements—rarer elements (words, sentences, paragraphs) typically carry more informational importance because they are less expected and thus contribute more uniquely to the overall meaning.
Between Words, Sentences, and Paragraphs
The weighted similarity \( V \) between context elements and text elements is: \[ V = \sum_{i=1}^{m} I_i \sum_{j=1}^{n} \text{sim}(c_i, t_j) \]
Let \( S \) be the understanding measurement, defined as the ratio between the initial similarity \( V \) and the similarity after text reduction \( V_{\text{reduced}} \).
The understanding measurement \( S \) is the ratio between these two similarity values: \[ S = \frac{V_{\text{reduced}}}{V} \leq 1 \]
The formula for evaluating the quality of text reduction is:
\[ O = S \times \log_2 \left( \frac{a}{b} \right), \, a < b \]
Where:
This formula measures the trade-off between reducing text size and retaining understanding. Higher \( O \) values indicate a better balance.
Rules allow you to define allowed ranges for measurements during text reduction. These rules set constraints on how much understanding can be lost or the size of the reduced text.
For example:
By combining rules for understanding and text size, you can control the quality and efficiency of the text reduction process. For example:
These rules help guide the search for the most optimal reduced text versions that maintain both meaning and conciseness.
I. Removing semantically repetitive elements optimizes the content without changing the information.
$$ I_i \cup I_j \neq \emptyset $$
II. The absence of an element that semantically contradicts another improves the comprehension of the content without altering the information.
$$ I_i = I_j^c $$