Chris Webb's BI Blog

Analysis Services, MDX, PowerPivot, DAX and anything BI-related

Removing Punctuation From Text In Power Query

with 6 comments

In one of my first posts on Power Query (and still my favourite) I found the top 100 words in the complete works of Shakespeare. As always when you’re learning a new tool, though, I look back at what I wrote then and realise there are better ways of doing things… one of which is removing all of the punctuation from a piece of text.

In the post I’m talking about I used the following expression to remove punctuation:

Text.Remove(RemoveLineBreaks,{"," ,"." ,"?" ,";" ,":" ,";" ,"’","@" ,"#" ,"~" ,"{" ,"[" ,"}" ,"]" ,"(" ,")", "*"})

However, as you can see, it only removes the punctuation could be bothered to hard-code in the list – luckily Shakespeare didn’t use too much exotic punctuation! There is a better way of doing this, though, and it relies on the Character.FromNumber() function which takes a number and returns the equivalent Unicode character; for example, the expression Character.FromNumber(38) returns an ampersand & symbol.

Looking at the list of Unicode characters here it’s easy to identify the ranges of numbers that represent punctuation characters. These ranges can then be represented in M as lists. Remembering that lists of ranges of numbers can be declared easily by using the notation {1..5}, which results in the list of numbers {1,2,3,4,5}, you can write a list of lists containing the numbers representing the Latin punctuation characters in Unicode like so:

{{0..31},{33..47},{58..64},{91..96},{123..191}}

and you can use List.Combine() to turn this list of lists into a single list of numbers.  Finally, you can use List.Transform() and Character.FromNumber() to get a list of the Unicode characters for these numbers, and pass that list to Text.Remove(). Here’s a query showing everything:

let

    //get a list of lists containing the numbers of Unicode punctuation characters

    numberlists = {{0..31},{33..47},{58..64},{91..96},{123..191}},

    //turn this into a single list

    combinedlist = List.Combine(numberlists),

    //get a list of all the punctuation characters that these numbers represent

    punctuationlist = List.Transform(combinedlist, each Character.FromNumber(_)),

    //some text to test this on

    inputtext = "Hello! My name is Chris, and I'm hoping that this *cool* post will help you!",

    //the text with punctuation removed

    outputtext = Text.Remove(inputtext, punctuationlist)

in

    outputtext

 

The output of this query is this:

image

You can download the sample workbook for this post here.

Written by Chris Webb

August 18, 2014 at 11:20 am

Posted in Power Query

6 Responses

Subscribe to comments with RSS.

  1. […] Removing Punctuation From Text In Power Query (Chris Webb) […]

  2. Hi Chris,

    Could you have an Excel Table with a list of punction and/or abbreviations (e.g. Ltd, Co, NZ, AU, UK) and feed that into Text.Remove() ?. Then if ever something new comes a long that you want to remove from the input, users can just add it to the table.

    Regards,

    John

    John

    August 20, 2014 at 11:34 pm

  3. Another option could be

    punctuationlist = {Character.FromNumber(0)..Character.FromNumber(33)}However, given the number of ranges involved, the transform function is probably cleaner.

    Colin Banfield

    August 27, 2014 at 6:22 pm

    • Crap, my concatenation with other lists was removed. Let’s try again:

      punctuationlist = {Character.FromNumber(0)..Character.FromNumber(33)}

      Colin Banfield

      August 27, 2014 at 6:26 pm

  4. Nope, must be the ampersand character (“&”)
    messing things up.

    punctuationlist = {Character.FromNumber(0)..Character.FromNumber(33)}
    {Character.FromNumber(47)..Character.FromNumber(64)}
    {Character.FromNumber(58)..Character.FromNumber(64)}
    {Character.FromNumber(91)..Character.FromNumber(96)}
    {Character.FromNumber(123)..Character.FromNumber(191)}

    There would be a concatenation character (ampersand) at the end of each line, or the whole thing could be wrapped in List.Combine, as you have already done.

    Colin Banfield

    August 27, 2014 at 6:31 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 3,189 other followers

%d bloggers like this: