Infoveave Data Automation — Text & String
Tokenizing Text breaks text content into individual words, removes noise like stop words, and optionally stems words to their root — giving NLP pipelines and word frequency visualizations clean, structured token data to work with.
Support tickets, product reviews, survey responses, customer feedback, and notes fields all contain analysis-critical information locked in unstructured prose. Word frequency analysis, word cloud generation, keyword extraction, and sentiment model input all require text to be broken into discrete, normalized tokens before analysis can begin. Tokenizing Text handles the complete preprocessing sequence — splitting on whitespace, removing stop words that add noise without meaning, applying word stems to group related forms, and outputting in the format that downstream analysis requires. The one-token-per-row output is ready for grouping and counting. The JSON output is ready for programmatic processing.
Tokenize text columns into individual word tokens in Infoveave with NLP options including stop word removal, word stemming, and output as one token per row, one per column, or JSON. Prepare text data for word frequency, sentiment, and text analytics pipelines.
Tokenizing Text is one step inside a multi-step Infoveave workflow. Chain it with other activities — no code, no manual hand-offs.
Build this workflow visually in Infoveave Data Automation — drag, connect, and schedule with no infrastructure setup.
Real scenarios where this transformation saves hours of manual work.
A retail analytics team has a product review dataset with a text column for review content and a star rating column. Tokenizing Text processes the review column with stop word removal enabled, producing one-token-per-row output. The team groups the tokenized output by star rating and counts token frequency to identify which words appear most in 1-star versus 5-star reviews — revealing recurring complaints in low ratings and consistent praise themes in high ratings without any manual content reading.
A customer success team processes a support ticket dataset with a description field containing the customer-reported issue in free text. Tokenizing Text with stop word removal and word stemming enabled produces one token per row with stems like connect for connection and connected. The team aggregates token counts to identify which issue themes dominate support volume each month, then uses those findings to prioritize documentation improvements and bug fix queues.
A health data analytics team processes anonymized patient intake forms containing a free-text symptoms field. Tokenizing Text removes medical stop words — common terms that appear uniformly and do not differentiate conditions — and stems words to group clinical term variants. The one-token-per-row output is then aggregated by time period and care category to identify emerging symptom term frequency patterns that may indicate seasonal health trend shifts.
Input data (left) is transformed using the configuration below. The output table (right) is ready for dashboards or downstream steps.
DescriptionOne token per rowtokensYesYesNoNoInput Data
| Employee ID | Name | Description |
|---|---|---|
| E001 | John Doe | This is a sample text for tokenization. |
| E002 | Marie Dupont | Another example sentence for NLP analysis. |
| E003 | Carlos Gomez | Text mining requires clean preprocessing steps. |
Output Data
| Employee ID | Name | Description | tokens |
|---|---|---|---|
| E001 | John Doe | This is a sample text for tokenization. | sample |
| E001 | John Doe | This is a sample text for tokenization. | text |
| E001 | John Doe | This is a sample text for tokenization. | tokenization |
| E002 | Marie Dupont | Another example sentence for NLP analysis. | example |
| E002 | Marie Dupont | Another example sentence for NLP analysis. | sentence |
| E002 | Marie Dupont | Another example sentence for NLP analysis. | NLP |
| E002 | Marie Dupont | Another example sentence for NLP analysis. | analysis |
Key fields to configure in the Infoveave workflow builder. Full reference available in the documentation.
Column Names
Select one or more text columns to tokenize. All selected columns are processed in the same step with the same NLP options applied consistently. For columns that require different processing options — for example one column with stemming and another without — add separate Tokenizing Text steps for each.
Option Mode
Choose how tokens are output. One token per row creates a new row for each token, making it suitable for grouping and counting word frequencies. One token per column places each token in a sequentially numbered output column — suitable when you need a fixed-width token array. JSON stores all tokens for a row as a JSON array in a single output column — suitable for programmatic or ML pipeline consumption.
Clear Stop Words
Enable stop word removal to filter out common function words — is, the, a, and, for, of — that appear frequently in text but carry no analytical signal. Stop word removal narrows the output to content-bearing words and significantly improves the quality of word frequency analysis and word cloud visualizations. The stop word list covers common English function words.
Stem Words
Enable stemming to reduce words to their root stem — connection becomes connect, running becomes run, analyses becomes analys. Stemming consolidates variant forms of the same concept so they aggregate together in frequency counts rather than appearing as separate tokens. Stemming is particularly useful before n-gram extraction or when building frequency rankings.
Sort Words
Enable alphabetical sorting of tokens before output. Sorting is mainly useful in JSON and one-per-column modes where a consistent token order across rows makes downstream processing more predictable. In one-token-per-row mode, sorting affects the row order within each original row's token group.
Everything you need to know about Tokenizing Text in Infoveave.
Transformations in the same family as Tokenizing Text, often chained together in the same Infoveave workflow.
Part of Infoveave Data Automation
Tokenizing Text is one of over 80 transformation activities available inside Infoveave workflows. Chain transformations together — no code, no exports, no waiting for IT.
Ready to see Infoveave in action?