Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Couldn't you validate the data on upload or if it's such a common problem just auto-fix it by left padding w/ zeros?


There are data sources that have a clear unambiguous specification of what information is in each field, and exactly how they are formatted, such that you could auto-parse the column. And even have no data entry errors, so that you can rely on that parsing and have clean error free imported data.

There are many more that don't have a clear spec, or even if they do, have every possible variation of data corruption / keying errors / user misunderstanding / total 'don't give a shit, I'll enter it how I want, that's what the computer is supposed to handle' problems in that source data, that make parsing an absolute nightmare.

I've had to manually review a list of 10k+ data readings monthly, from an automated recorder, because the guy who was supposed to copy the collected data files as is and upload them, instead opened each one and fixed what he (badly mistakenly) thought had been recorded wrong. Different changes in a dozen different ways based on no particular logic beyond "that doesn't look right". And un-fireable, of course.


auto-fix by padding assumes that the only way to get an ID that's too short is by excel removing the leading zeros - presumably it could also be human error when entering etc. (in theory I suppose that would be validated in whatever form the number is entered into in the app that generates the CSV, but hard to say for sure). Or some other issue in data processing pipeline erroneously truncating the ID number some how.

Validating the data would at least prevent getting invalid data into database (and presumably this is happening already), but it doesn't actually "fix" the problem, you still then need the original provider of the data to fix what's missing.


These aren't fixed-width fields with a known length. The length of the field is unique to the person sending the file and they don't tell us how long it's supposed to be.


The parent post says: "Excel interprets a number and then formats it as a number, but in healthcare, 10 digit numbers are really strings." - so I was under impression that the length for that particular column is fixed - probably just misunderstood the sentence.

I wonder how human operators figure out if the value is correct, or the Excel messed it up, or the input was invalid in the first place? If it's even possible to do it reliably then probably there is some set of patterns and methods that possibly could be turned into an algorithm... just thinking out loud here, but seems as an interesting problem to tackle...


Validating on upload is sensible and should be done anyway. However, left-padding is not always going to work, it depends on the data. Common problems I see are with phone numbers. UK mobiles begin 07nnn. Excel kindly drops the leading zero and then really helpfully treats the long number as being an exponential-format floating point number, which cannot be turned back to the original number. In our systems, we require country codes in front of the numbers, so UK mobiles should being 447nnn. That doesn't lose the leading zero but still gets the E-format mess when opened by Excel. Customers often miss out the 44 and use the national 07 format. If that leading zero is missing, it can be quite difficult for a system to reliable determine whether that number, beginning with a 7, is an international number for a destination in Russia, or a user error and should've been 447, from an 07 data value.


One wonders why Excel doesn't store data as strings (or equivalent lossless format) and process it as numbers like SQLite does for instance.


Because Excel is a spreadsheet, for calculating with numbers.

Then there's additional code for dealing with all the inconvenient ways people format things, or want to add text labels, or do things closer to numerical/financial analysis, or all the other extras wrapped around the core "put numbers in boxes and do math".

That misunderstanding is at the core of Excel misuse.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: