Data TransformationText & StringIntermediate

Tokenizing Text

Infoveave Data Automation — Text & String

Tokenizing Text breaks text content into individual words, removes noise like stop words, and optionally stems words to their root — giving NLP pipelines and word frequency visualizations clean, structured token data to work with.

Support tickets, product reviews, survey responses, customer feedback, and notes fields all contain analysis-critical information locked in unstructured prose. Word frequency analysis, word cloud generation, keyword extraction, and sentiment model input all require text to be broken into discrete, normalized tokens before analysis can begin. Tokenizing Text handles the complete preprocessing sequence — splitting on whitespace, removing stop words that add noise without meaning, applying word stems to group related forms, and outputting in the format that downstream analysis requires. The one-token-per-row output is ready for grouping and counting. The JSON output is ready for programmatic processing.

Input:Tabular dataset with one or more text columns containing natural language content such as product reviews, support tickets, descriptions, or notes fieldsOutput:Tabular dataset with tokens extracted from text columns in one of three formats: one token per row, one token per column, or all tokens as a JSON array in a single output column

What Tokenizing Text does

Tokenize text columns into individual word tokens in Infoveave with NLP options including stop word removal, word stemming, and output as one token per row, one per column, or JSON. Prepare text data for word frequency, sentiment, and text analytics pipelines.

When to use Tokenizing Text

  • You have free-text columns — customer reviews, support ticket descriptions, survey responses, or product notes — that you want to analyze for word frequency, keyword patterns, or as input for word cloud visualizations
  • You are building an NLP preprocessing step in a no-code data pipeline and need to tokenize text columns and optionally remove stop words and apply word stemming before feeding tokens into downstream analysis or ML models
  • You want to compare which words appear most frequently across different customer segments, time periods, or product categories by preparing one-token-per-row data that can then be grouped and counted
  • You need to prepare text columns for n-gram extraction and want to remove stop words and stem words at the tokenization stage before applying Extract N-Grams in a subsequent step

When to avoid it

  • Your text columns contain numerical or categorical coded data rather than natural language prose — tokenization is designed for natural language text and will treat every code segment and delimited value as a word token
  • You only need to split text based on a specific character delimiter rather than on natural language word boundaries — use Split Column for delimiter-based splitting which does not apply NLP stop word or stemming logic
  • You need to extract specific patterns from text like email addresses, numbers, or codes using regex matching — use Find Text for pattern-based extraction rather than word-boundary tokenization

Where it fits in your Infoveave automation

Tokenizing Text is one step inside a multi-step Infoveave workflow. Chain it with other activities — no code, no manual hand-offs.

ConnectLoad text data — reviews, ticket descriptions, survey responses, or notes fields — containing natural language text columns
You are hereTokenizing TextBreak text columns into word tokens with stop word removal and optional stemming to produce clean, analysis-ready token data
Count or GroupGroup by the token column and count occurrences to produce word frequency tables, or pass tokens to Extract N-Grams for phrase-level analysis
VisualizeFeed token frequency data into word cloud visualizations, bar charts, or trend analysis dashboards in Infoveave
AutomateSchedule the text preprocessing pipeline to update token frequency analysis automatically as new text records arrive

Build this workflow visually in Infoveave Data Automation — drag, connect, and schedule with no infrastructure setup.

Infoveave — Workflow Builder
● SavedSchedule: Daily 06:00
Data SourceConnectLoad text data — reviews, …YOU ARE HERETokenizing TextBreak text columns into wo…Count or GroupGroup by the token column …VisualizeFeed token frequency data …AutomateSchedule the text preproce…Dashboard

How teams use Tokenizing Text

Real scenarios where this transformation saves hours of manual work.

Retail

Analyze Product Review Keywords by Star Rating

A retail analytics team has a product review dataset with a text column for review content and a star rating column. Tokenizing Text processes the review column with stop word removal enabled, producing one-token-per-row output. The team groups the tokenized output by star rating and counts token frequency to identify which words appear most in 1-star versus 5-star reviews — revealing recurring complaints in low ratings and consistent praise themes in high ratings without any manual content reading.

Technology

Prepare Support Ticket Descriptions for Word Frequency Analysis

A customer success team processes a support ticket dataset with a description field containing the customer-reported issue in free text. Tokenizing Text with stop word removal and word stemming enabled produces one token per row with stems like connect for connection and connected. The team aggregates token counts to identify which issue themes dominate support volume each month, then uses those findings to prioritize documentation improvements and bug fix queues.

Healthcare

Extract Key Terms from Patient-Reported Symptoms for Trend Analysis

A health data analytics team processes anonymized patient intake forms containing a free-text symptoms field. Tokenizing Text removes medical stop words — common terms that appear uniformly and do not differentiate conditions — and stems words to group clinical term variants. The one-token-per-row output is then aggregated by time period and care category to identify emerging symptom term frequency patterns that may indicate seasonal health trend shifts.

See Tokenizing Text in action

Input data (left) is transformed using the configuration below. The output table (right) is ready for dashboards or downstream steps.

Column Names:Description
Option Mode:One token per row
Output Column:tokens
Include Original:Yes
Clear Stop Words:Yes
Stem Words:No
Sort Words:No

Input Data

Employee IDNameDescription
E001John DoeThis is a sample text for tokenization.
E002Marie DupontAnother example sentence for NLP analysis.
E003Carlos GomezText mining requires clean preprocessing steps.

Output Data

Employee IDNameDescriptiontokens
E001John DoeThis is a sample text for tokenization.sample
E001John DoeThis is a sample text for tokenization.text
E001John DoeThis is a sample text for tokenization.tokenization
E002Marie DupontAnother example sentence for NLP analysis.example
E002Marie DupontAnother example sentence for NLP analysis.sentence
E002Marie DupontAnother example sentence for NLP analysis.NLP
E002Marie DupontAnother example sentence for NLP analysis.analysis

Configuration

Key fields to configure in the Infoveave workflow builder. Full reference available in the documentation.

Column Names

Select one or more text columns to tokenize. All selected columns are processed in the same step with the same NLP options applied consistently. For columns that require different processing options — for example one column with stemming and another without — add separate Tokenizing Text steps for each.

Option Mode

Choose how tokens are output. One token per row creates a new row for each token, making it suitable for grouping and counting word frequencies. One token per column places each token in a sequentially numbered output column — suitable when you need a fixed-width token array. JSON stores all tokens for a row as a JSON array in a single output column — suitable for programmatic or ML pipeline consumption.

Clear Stop Words

Enable stop word removal to filter out common function words — is, the, a, and, for, of — that appear frequently in text but carry no analytical signal. Stop word removal narrows the output to content-bearing words and significantly improves the quality of word frequency analysis and word cloud visualizations. The stop word list covers common English function words.

Stem Words

Enable stemming to reduce words to their root stem — connection becomes connect, running becomes run, analyses becomes analys. Stemming consolidates variant forms of the same concept so they aggregate together in frequency counts rather than appearing as separate tokens. Stemming is particularly useful before n-gram extraction or when building frequency rankings.

Sort Words

Enable alphabetical sorting of tokens before output. Sorting is mainly useful in JSON and one-per-column modes where a consistent token order across rows makes downstream processing more predictable. In one-token-per-row mode, sorting affects the row order within each original row's token group.

Frequently asked questions

Everything you need to know about Tokenizing Text in Infoveave.

Also in Text & String — and what runs before & after

Transformations in the same family as Tokenizing Text, often chained together in the same Infoveave workflow.

Part of Infoveave Data Automation

80+ transformations. Zero manual steps.

Tokenizing Text is one of over 80 transformation activities available inside Infoveave workflows. Chain transformations together — no code, no exports, no waiting for IT.

Ready to see Infoveave in action?

Book a Demo
ISO 27001ISO 27017ISO 27701GDPRHIPAACCPAAICPACSR LogoCapterra Reviews — Infoveave

© 2026 Noesys Software Pvt Ltd

Infoveave® is a product of Noesys

All Rights Reserved