Sunday, March 14, 2021

Software Algorithms to Solve Messy Data Problems

 By Steve Endow

In the software / consulting world, we're sometimes given data that is messy, and we, the software developers, consultants, or report writers, are tasked with making sense of that data and using it in a business process.  Sometimes we have to do "creative" and often unintuitive things to that data to make it usable.

For example, I was once asked to import employee time clock data into a payroll system.  Simple, right?  In the ideal world, an employee clocks in for work in the morning, and clocks out for work in the afternoon.  But in the real world, it's never this easy.

First, the customer's horrible timeclock system did not differentiate between a "clock in" or a "clock out".  I was simply given two fields:  Employee ID and Date Time.  It was up to me to figure out which entries were "clock in" and which entries were "clock out".

Second, sometimes the employees would take breaks or leave for lunch. They would therefore "clock out" when leaving, and "clock in" when returning to work. So an employee could have up to four records for a single shift.

Third, employees would regularly forget to clock in or clock out.  Maybe they would forget to clock in when they arrived in the morning, but would remember to clock out & in for lunch, and clock out when leaving for the day.  Or maybe they would forget to clock out for lunch.  Or maybe they would forget to clock out at the end of the day. I had to deal with missing records and reconstruct the employee's time.

Fourth, the customer had employees who worked night shifts. Some employees arrived at work at 8pm and left work at 4am.  So if an employee forgot to clock in or out, I could not use time of day to simply assume which record was missing.  I sometimes had to reconstruct a morning clock out to correspond to a clock in the prior evening

Fifth, the time clock data did not have any indication of pay periods. So I had to calculate when every pay period began and ended, and import the appropriate time clock data into payroll transactions that were categorized in the correct pay period.  So if an employee worked on both Saturday and Sunday, I had to put the Saturday time in pay period 1, and the Sunday time in pay period 2.  I also had to know when a work week spanned two fiscal periods--so work on Tuesday might go into pay period 1, while work on Wednesday might go into pay period 2.

(If you are asking why the customer would use such a useless time clock system, I asked that same question.)

Last, as if this wasn't complex enough, I had to clean up and categorize all of this data using only SQL.

Let's just say that it took me several hours and lots of white boarding to figure how to do all of this clean up using a single SQL stored procedure.  Thankfully that project is a distant memory.


I tell that story because this morning, Erik Darling, a SQL Server expert, posted this interesting question on Twitter:

I sense a software algorithm might solve this...

This is a simplified version of his actual situation, but it boils down to this.

He has a situation where he has 8 numbers.  Those numbers, when grouped in 4 pairs, will sum to 4 values.  Thankfully, each of the 8 numbers is only used once, so that helps to simplify the process.  And to further simplify the problem, I'm going to assume that each of the 8 numbers is unique and each of the 4 sums is unique--obviously that's a very big assumption that will likely not often hold in the real world.  But you have to start somewhere...

This is kind of a messy problem.  Which 2 of the 8 numbers should be added for sum 1?  Then which 2 of the remaining 6 numbers should be added for sum 2? etc.

I thought this was an interesting data problem.  At first, I had no idea how this could be solved.  Was there some type of linear algebra matrix that could magically solve this?  If so, my linear algebra class from 27 years ago would be of little use, as I can't remember what day of the week it is.

So, what does a computer software developer do when he or she does not know the "correct" way to do something?

Why, of course, we turn to brute force!

But we still have an opportunity to be clever.  In this case, since Erik works with SQL, I wanted to find a clever way to brute force this problem using SQL.

Given 8 numbers, we need to figure out which 4 pairs of those numbers will produce the desired 4 sums.  To do that, I figured that if we produce every permutation of those 8 numbers, we could then see which 4 pairs give us the "right answer" when added together.

When I think of generating all permutations (all combinations) of a set of data, I think of the SQL CROSS JOIN.  It's a magical tool for generating every combination of two or more data sets.

I mocked this up in Excel to think about it further, using only 4 sample values.

All permutations of adding 4 numbers

Since that seemed to make sense, I mocked up this SQL query to attempt to solve the problem.  One table contains 4 "sums".  Another table contains 8 "addends".  The query uses a CROSS JOIN to find all sum permutations, and exclude numbers from being added to themselves.  It then joins that result set to the desired sums to identify which pairs of numbers can be added to produce the sums.

Brute force SQL using a CROSS JOIN

Here is the GitHub Gist with the sample query.

My SQL CROSS JOIN approach is not elegant or fancy, but it appears to work.

Erik received another possible SQL solution from Daniel Hutmacher.

A different solution using completely different SQL technique

Daniel's SQL query uses CROSS APPLY and OUTER APPLY, which I don't think I've ever used before, so it results in a query that I don't understand, but does appear to produce similar results, showing which pairs of numbers can be added to produce the desired sums.

But, if my CROSS JOIN solution actually works and properly solves the problem, I think it might be more scalable than the CROSS APPLY + OUTER APPLY technique for a more complex scenario.

I scaled up the problem to handle 40 "addends", when placed in groups of 4, sum to produce 10 values.  For example, we need to find which 4 numbers solves A + B + C + D = E.  We have 40 numbers to choose from and 10 values to match against.

To handle this scenario, I essentially only had to add 2 more CROSS JOINs to my query and update the WHERE clause.  

The SQL algorithm seems to easily scale out

Here is the GitHub Gist for this sample query.

If you run my first query and then run my second query, you will notice a significant performance difference.  Obviously there are no indexes on the temp tables, but it likely shows that this solution does not scale well, so you have to be careful where you apply it.

The two different approaches show how you can go about solving these problems in completely different ways, and I'm guessing there are several other interesting ways to solve Erik's problem using SQL, and perhaps other tools.

Can you think of any other creative ways to solve the problem that Erik posed?


Steve Endow is a Microsoft MVP in Los Angeles.  He works with Dynamics 365 Business Central, Microsoft Power Automate, Power Apps, Azure, and .NET.

You can also find him on Twitter and YouTube

No comments:

Post a Comment

All comments must be reviewed and approved before being published. Your comment will not appear immediately.

Docker Engine Enterprise and Mirantis Container Runtime for Business Central Developers

by Steve Endow NOTE:   This post relates to Docker Engine Enterprise and Mirantis Container Runtime on Windows Server .  I don't believe...