Removing Punctuation From Text In Power Query

In one of my first posts on Power Query (and still my favourite) I found the top 100 words in the complete works of Shakespeare. As always when you’re learning a new tool, though, I look back at what I wrote then and realise there are better ways of doing things… one of which is removing all of the punctuation from a piece of text.

In the post I’m talking about I used the following expression to remove punctuation:

Text.Remove(RemoveLineBreaks,{"," ,"." ,"?" ,";" ,":" ,";" ,"’","@" ,"#" ,"~" ,"{" ,"[" ,"}" ,"]" ,"(" ,")", "*"})

However, as you can see, it only removes the punctuation could be bothered to hard-code in the list – luckily Shakespeare didn’t use too much exotic punctuation! There is a better way of doing this, though, and it relies on the Character.FromNumber() function which takes a number and returns the equivalent Unicode character; for example, the expression Character.FromNumber(38) returns an ampersand & symbol.

Looking at the list of Unicode characters here it’s easy to identify the ranges of numbers that represent punctuation characters. These ranges can then be represented in M as lists. Remembering that lists of ranges of numbers can be declared easily by using the notation {1..5}, which results in the list of numbers {1,2,3,4,5}, you can write a list of lists containing the numbers representing the Latin punctuation characters in Unicode like so:

{{0..31},{33..47},{58..64},{91..96},{123..191}}

and you can use List.Combine() to turn this list of lists into a single list of numbers.  Finally, you can use List.Transform() and Character.FromNumber() to get a list of the Unicode characters for these numbers, and pass that list to Text.Remove(). Here’s a query showing everything:

let

    //get a list of lists containing the numbers of Unicode punctuation characters

    numberlists = {{0..31},{33..47},{58..64},{91..96},{123..191}},

    //turn this into a single list

    combinedlist = List.Combine(numberlists),

    //get a list of all the punctuation characters that these numbers represent

    punctuationlist = List.Transform(combinedlist, each Character.FromNumber(_)),

    //some text to test this on

    inputtext = "Hello! My name is Chris, and I'm hoping that this *cool* post will help you!",

    //the text with punctuation removed

    outputtext = Text.Remove(inputtext, punctuationlist)

in

    outputtext

 

The output of this query is this:

image

You can download the sample workbook for this post here.

8 thoughts on “Removing Punctuation From Text In Power Query

  1. Hi Chris,

    Could you have an Excel Table with a list of punction and/or abbreviations (e.g. Ltd, Co, NZ, AU, UK) and feed that into Text.Remove() ?. Then if ever something new comes a long that you want to remove from the input, users can just add it to the table.

    Regards,

    John

  2. Another option could be

    punctuationlist = {Character.FromNumber(0)..Character.FromNumber(33)}However, given the number of ranges involved, the transform function is probably cleaner.

    1. Crap, my concatenation with other lists was removed. Let’s try again:

      punctuationlist = {Character.FromNumber(0)..Character.FromNumber(33)}

  3. Nope, must be the ampersand character (“&”)
    messing things up.

    punctuationlist = {Character.FromNumber(0)..Character.FromNumber(33)}
    {Character.FromNumber(47)..Character.FromNumber(64)}
    {Character.FromNumber(58)..Character.FromNumber(64)}
    {Character.FromNumber(91)..Character.FromNumber(96)}
    {Character.FromNumber(123)..Character.FromNumber(191)}

    There would be a concatenation character (ampersand) at the end of each line, or the whole thing could be wrapped in List.Combine, as you have already done.

  4. Hi Chris,

    What if the unicodes exist all over a whole table? How can I utilise your technique to remove all punctuation characters from the whole table?

    Cheers
    John

  5. Hi Chris,

    Thank you for this post. It works good with Excel, but I found that in Power BI it sometimes are deleting spaces.
    I have checked charcode foe missed symbols and it was 32. It strange as it’s no in the list.
    The text language was Russian and Ukrainian.

Leave a Reply