Infoveave Data Automation — Text & String
Where single-word tokenization misses context, n-grams capture the phrases that matter — extract all bigrams and trigrams from customer text to find which two-word combinations define your top feedback themes.
Individual word frequency analysis often misses the meaning carried by word combinations. A review containing the words product and quality separately tells you less than the bigram product quality. Support tickets mentioning not and working as separate tokens look identical to tickets mentioning not working as a connected complaint phrase. N-gram extraction captures these multi-word combinations that carry analytical signal as semantic units. Extract N-Grams generates all n-gram sequences — bigrams, trigrams, or any custom size — from text columns, applying stop word removal and stemming to clean the output before phrase frequency analysis, topic modeling preparation, or keyword pattern identification.
Extract n-gram sequences — bigrams, trigrams, and custom-length phrase pairs — from text columns in Infoveave. Identify frequent phrase patterns in reviews, support tickets, and survey text with stop word removal and stemming for clean NLP pipeline preparation.
Extract N-Grams is one step inside a multi-step Infoveave workflow. Chain it with other activities — no code, no manual hand-offs.
Build this workflow visually in Infoveave Data Automation — drag, connect, and schedule with no infrastructure setup.
Real scenarios where this transformation saves hours of manual work.
A retail analytics team processes a product review dataset with a review text column and a product category column. Extract N-Grams with size 2 and stop word removal enabled produces bigrams from each review. The team groups the bigram output by product category and ranks bigrams by frequency to identify the most common two-word phrases associated with each category. Phrases like delivery delay, packaging damaged, and size incorrect emerge clearly from frequency ranking across thousands of reviews.
A platform support team processes customer ticket descriptions with Extract N-Grams to generate bigrams from each description after stop word removal. The bigram frequency table identifies recurring phrases like login fail, api timeout, data missing, and export error that appear consistently across unrelated tickets. These phrase patterns reveal systemic product issues more accurately than individual word frequencies, which mix verbs, nouns, and adjectives from different contexts without retaining their co-occurrence relationship.
A health data team applies Extract N-Grams with trigram extraction to anonymized patient intake form text to identify three-word symptom phrases that appear with increasing frequency across specific time windows. Trigrams capture clinical phrase patterns — persistent chest pain, shortness of breath, difficulty swallowing — that single-word tokenization dissects into unrelated tokens. Trigram frequency trends over time periods help identify emerging symptom profile shifts for population health monitoring.
Input data (left) is transformed using the configuration below. The output table (right) is ready for dashboards or downstream steps.
ReviewOne per rowngramsYes2YesNoNoInput Data
| Product ID | Category | Review |
|---|---|---|
| P001 | Electronics | The product quality is amazing and very durable. |
| P002 | Clothing | Great fabric quality but the size runs small. |
| P003 | Electronics | Poor product quality and the battery drains fast. |
Output Data
| Product ID | Category | Review | ngrams |
|---|---|---|---|
| P001 | Electronics | The product quality is amazing... | product quality |
| P001 | Electronics | The product quality is amazing... | quality amazing |
| P001 | Electronics | The product quality is amazing... | amazing durable |
| P002 | Clothing | Great fabric quality but the size runs small. | great fabric |
| P002 | Clothing | Great fabric quality but the size runs small. | fabric quality |
| P002 | Clothing | Great fabric quality but the size runs small. | quality size |
| P003 | Electronics | Poor product quality and the battery drains fast. | poor product |
| P003 | Electronics | Poor product quality and the battery drains fast. | product quality |
Key fields to configure in the Infoveave workflow builder. Full reference available in the documentation.
Column To Extract
Select the text column containing natural language content from which n-gram sequences should be extracted. The column is first cleaned — punctuation is removed and text is lowercased — before n-gram sequences are generated from the word sequence. Select the column that contains the most analytical valuable free text in your dataset.
Output Method
Choose how extracted n-grams are output. One per row creates a new row for each n-gram with all other columns from the original row repeated — this mode is standard for frequency counting via grouping. One per column places n-grams in sequential numbered columns across the row. JSON stores all n-grams as a JSON array in a single output column for programmatic or ML pipeline input.
Size
Specify the number of words in each n-gram sequence. Size 2 produces bigrams — two-word sequences. Size 3 produces trigrams — three-word sequences. Larger sizes produce fewer but longer phrase sequences per row. Choose the size that matches the phrase length most relevant to your analysis — bigrams are the most common starting point for customer text analysis.
Clear Stop Words
Enable to remove common function words from the word sequence before n-grams are generated. Stop word removal ensures the generated n-grams contain only content-bearing words and avoids producing n-grams that are entirely composed of function words like of the or and a. Enabling stop word removal is strongly recommended for quality text analysis — it significantly reduces noise and improves phrase frequency signal.
Stem Words
Enable to reduce words to their root stem before n-gram sequences are generated. Stemming ensures that review and reviews, fail and failing, damage and damaged are treated as the same word form in n-gram sequences. This is useful when you want to aggregate n-gram frequencies across variant word forms and build phrase pattern rankings based on concept frequency rather than exact surface form frequency.
Everything you need to know about Extract N-Grams in Infoveave.
Transformations in the same family as Extract N-Grams, often chained together in the same Infoveave workflow.
Part of Infoveave Data Automation
Extract N-Grams is one of over 80 transformation activities available inside Infoveave workflows. Chain transformations together — no code, no exports, no waiting for IT.
Ready to see Infoveave in action?