Data TransformationText & StringAdvanced

Extract N-Grams

Infoveave Data Automation — Text & String

Where single-word tokenization misses context, n-grams capture the phrases that matter — extract all bigrams and trigrams from customer text to find which two-word combinations define your top feedback themes.

Individual word frequency analysis often misses the meaning carried by word combinations. A review containing the words product and quality separately tells you less than the bigram product quality. Support tickets mentioning not and working as separate tokens look identical to tickets mentioning not working as a connected complaint phrase. N-gram extraction captures these multi-word combinations that carry analytical signal as semantic units. Extract N-Grams generates all n-gram sequences — bigrams, trigrams, or any custom size — from text columns, applying stop word removal and stemming to clean the output before phrase frequency analysis, topic modeling preparation, or keyword pattern identification.

Input:Tabular dataset with one or more text columns containing natural language content such as product reviews, support tickets, survey responses, or customer feedbackOutput:Tabular dataset with n-gram sequences extracted from text in one of three formats: one n-gram per row, one n-gram per column, or all n-grams as a JSON array in a single output column

What Extract N-Grams does

Extract n-gram sequences — bigrams, trigrams, and custom-length phrase pairs — from text columns in Infoveave. Identify frequent phrase patterns in reviews, support tickets, and survey text with stop word removal and stemming for clean NLP pipeline preparation.

When to use Extract N-Grams

  • You are analyzing customer review text, survey responses, or support ticket descriptions and need to identify recurring two-word or three-word phrases that define the main feedback themes — individual word frequency misses the phrase-level meaning
  • You are building an NLP preprocessing pipeline for text classification or topic modeling and need n-gram features extracted from text columns as input for downstream ML model training or feature engineering
  • You have already tokenized text columns and removed stop words, and now need to generate bigrams or trigrams from the cleaned token sequence for phrase-level frequency ranking or pattern identification
  • You want to compare which multi-word phrases appear most frequently across different product categories, customer segments, or time periods to identify experience pattern differences that single-word analysis cannot reveal

When to avoid it

  • You only need individual word-level tokens rather than multi-word phrases — use Tokenizing Text for single-word token extraction with the same NLP preprocessing options including stop word removal and stemming
  • You need to extract specific structured patterns from text using regex — for example extracting product codes, order numbers, or email addresses — use Find Text for pattern-based extraction rather than n-gram sequence generation
  • Your text columns are very short — fewer than four or five words per record — and the n-gram size you want to use equals or exceeds the average record length, which would produce very few or no valid n-gram sequences per row

Where it fits in your Infoveave automation

Extract N-Grams is one step inside a multi-step Infoveave workflow. Chain it with other activities — no code, no manual hand-offs.

ConnectLoad product review data, support ticket text, survey responses, or other natural language text into Infoveave
Tokenizing Text (optional)Optionally tokenize and clean text columns first if you want to apply stop word removal and stemming as a separate preprocessing step before n-gram extraction
You are hereExtract N-GramsGenerate n-gram sequences of the configured size from your text column with stop word removal and stemming options applied
Count and RankGroup by the n-gram column and count occurrences to rank the most frequent phrases across your text dataset
Visualize and AutomateFeed phrase frequency rankings into dashboards or word cloud visualizations and schedule the pipeline to refresh automatically on new data

Build this workflow visually in Infoveave Data Automation — drag, connect, and schedule with no infrastructure setup.

Infoveave — Workflow Builder
● SavedSchedule: Daily 06:00
Data SourceConnectLoad product review data, …Tokenizing Text (optional)Optionally tokenize and cl…YOU ARE HEREExtract N-GramsGenerate n-gram sequences …Count and RankGroup by the n-gram column…Visualize and AutomateFeed phrase frequency rank…Dashboard

How teams use Extract N-Grams

Real scenarios where this transformation saves hours of manual work.

Retail

Identify Recurring Phrase Patterns in Product Review Text

A retail analytics team processes a product review dataset with a review text column and a product category column. Extract N-Grams with size 2 and stop word removal enabled produces bigrams from each review. The team groups the bigram output by product category and ranks bigrams by frequency to identify the most common two-word phrases associated with each category. Phrases like delivery delay, packaging damaged, and size incorrect emerge clearly from frequency ranking across thousands of reviews.

Technology

Extract Recurring Issue Phrases from Support Ticket Descriptions

A platform support team processes customer ticket descriptions with Extract N-Grams to generate bigrams from each description after stop word removal. The bigram frequency table identifies recurring phrases like login fail, api timeout, data missing, and export error that appear consistently across unrelated tickets. These phrase patterns reveal systemic product issues more accurately than individual word frequencies, which mix verbs, nouns, and adjectives from different contexts without retaining their co-occurrence relationship.

Healthcare

Analyze Multi-Word Symptom Phrases in Anonymized Patient Intake Notes

A health data team applies Extract N-Grams with trigram extraction to anonymized patient intake form text to identify three-word symptom phrases that appear with increasing frequency across specific time windows. Trigrams capture clinical phrase patterns — persistent chest pain, shortness of breath, difficulty swallowing — that single-word tokenization dissects into unrelated tokens. Trigram frequency trends over time periods help identify emerging symptom profile shifts for population health monitoring.

See Extract N-Grams in action

Input data (left) is transformed using the configuration below. The output table (right) is ready for dashboards or downstream steps.

Column To Extract:Review
Output Method:One per row
Output Column:ngrams
Include Original:Yes
Size:2
Clear Stop Words:Yes
Stem Words:No
Sort Words:No

Input Data

Product IDCategoryReview
P001ElectronicsThe product quality is amazing and very durable.
P002ClothingGreat fabric quality but the size runs small.
P003ElectronicsPoor product quality and the battery drains fast.

Output Data

Product IDCategoryReviewngrams
P001ElectronicsThe product quality is amazing...product quality
P001ElectronicsThe product quality is amazing...quality amazing
P001ElectronicsThe product quality is amazing...amazing durable
P002ClothingGreat fabric quality but the size runs small.great fabric
P002ClothingGreat fabric quality but the size runs small.fabric quality
P002ClothingGreat fabric quality but the size runs small.quality size
P003ElectronicsPoor product quality and the battery drains fast.poor product
P003ElectronicsPoor product quality and the battery drains fast.product quality

Configuration

Key fields to configure in the Infoveave workflow builder. Full reference available in the documentation.

Column To Extract

Select the text column containing natural language content from which n-gram sequences should be extracted. The column is first cleaned — punctuation is removed and text is lowercased — before n-gram sequences are generated from the word sequence. Select the column that contains the most analytical valuable free text in your dataset.

Output Method

Choose how extracted n-grams are output. One per row creates a new row for each n-gram with all other columns from the original row repeated — this mode is standard for frequency counting via grouping. One per column places n-grams in sequential numbered columns across the row. JSON stores all n-grams as a JSON array in a single output column for programmatic or ML pipeline input.

Size

Specify the number of words in each n-gram sequence. Size 2 produces bigrams — two-word sequences. Size 3 produces trigrams — three-word sequences. Larger sizes produce fewer but longer phrase sequences per row. Choose the size that matches the phrase length most relevant to your analysis — bigrams are the most common starting point for customer text analysis.

Clear Stop Words

Enable to remove common function words from the word sequence before n-grams are generated. Stop word removal ensures the generated n-grams contain only content-bearing words and avoids producing n-grams that are entirely composed of function words like of the or and a. Enabling stop word removal is strongly recommended for quality text analysis — it significantly reduces noise and improves phrase frequency signal.

Stem Words

Enable to reduce words to their root stem before n-gram sequences are generated. Stemming ensures that review and reviews, fail and failing, damage and damaged are treated as the same word form in n-gram sequences. This is useful when you want to aggregate n-gram frequencies across variant word forms and build phrase pattern rankings based on concept frequency rather than exact surface form frequency.

Frequently asked questions

Everything you need to know about Extract N-Grams in Infoveave.

Also in Text & String — and what runs before & after

Transformations in the same family as Extract N-Grams, often chained together in the same Infoveave workflow.

Part of Infoveave Data Automation

80+ transformations. Zero manual steps.

Extract N-Grams is one of over 80 transformation activities available inside Infoveave workflows. Chain transformations together — no code, no exports, no waiting for IT.

Ready to see Infoveave in action?

Book a Demo
ISO 27001ISO 27017ISO 27701GDPRHIPAACCPAAICPACSR LogoCapterra Reviews — Infoveave

© 2026 Noesys Software Pvt Ltd

Infoveave® is a product of Noesys

All Rights Reserved