I work in healthcare and by far the biggest production issues we've ever run int...

Maha-pudma · on Oct 6, 2020

Every single problem I've ever had with excel stems from excel trying to be too clever. Nearly all my problems with importing data into it could be solved if excel left everything as is until I asked it to format values a certain way.

jasode · on Oct 6, 2020

>Every single problem I've ever had with excel stems from excel trying to be too clever. Nearly all my problems with importing data into it could be solved if excel left everything as is

We're biased to bash on Microsoft for being "too clever" but maybe we need a reality check by looking at the bigger picture.

Examples of other software not written by Microsoft that also drops the leading zeros and users asking questions on how to preserve them:

- Python Pandas import csv issue with leading zeros: https://stackoverflow.com/questions/13250046/how-to-keep-lea...

- R software import csv issue with leading zeros: https://stackoverflow.com/questions/31411119/r-reading-in-cs...

- Google Sheets issue with leading zeros: https://webapps.stackexchange.com/questions/120835/importdat...

Conclusion: For some compelling reason, we have a bunch of independent programmers who all want to remove leading zeros.

Maha-pudma · on Oct 6, 2020

I'm certainly not singling out excel, this thread is just about excel so thought I'd share my frustration with that. I've had similar experiences with pandas certainly. I've not used the rest of your examples, but do have similar issues with libreoffice calc. Regards pandas it is certainly easier sorting out the assumptions independent programmers make by including 'dtype=str'.

lima · on Oct 6, 2020

The issue is that CSVs are untyped and software has to guess the type, sometimes unsuccessfully.

thirdsun · on Oct 6, 2020

Does it? I think it should treat everything as a string since it can't be certain about the correct type. It could also suggest the correct type rather than applying it without user confirmation.

thedudeabides5 · on Oct 6, 2020

You want Excel to auto-interpret a table of numbers, that come from a CSV, as entirely text?

Looking forward to the first time anyone tries to use your excel on a table of numbers and then immediately has to multiply everything by *1 (in a separate table) just to get it back into numbers...

throwaway_pdp09 · on Oct 6, 2020

Oh yes please.

At least you would know what's happening and be in control of it

"Hey, is that a date? I bet that's a date!" - Aaaargh Noooo!

SCNP · on Oct 6, 2020

YES!! I seem to remember a few versions of Excel ago there was a way to import data and set the data type for each column in a preview window to see what it would look like. And to your point, it's very easy to select an existing column in Excel and change the data type anyway. So why not just make everything a string by default?

kausban · on Oct 6, 2020

Maybe you recall the CSV import window in libreoffice

salamander014 · on Oct 6, 2020

You are right on the cause of the issue. But the software does not HAVE to guess the type. If I paste something into Excel, I want what I pasted to be what's in Excel. Changing the type and inadvertently affecting the data isn't something any data software should do by default. And the fact that OP said they can't turn it off means that this is beyond a design mistake, it is a FEATURE. If Excel took itself seriously as a data platform, it would not make a change to any text by default.

zentiggr · on Oct 6, 2020

But Excel has never been a data platform. It's a spreadsheet.

It's user's faults for using it in ways that it was never designed for.

Excel has always been about sticking numbers in boxes and calculating with them.

If you want unmodified string input, input strings into a tool intended to handle them.

Project specifications can be hard. Using 1) .xls files after they were superseded, 2) ANY data transfer method without considering capacity or truncation issues, speaks of incompetence.

rustybelt · on Oct 6, 2020

You can format the column as text before you paste or you can load data from text and set the types for every column. Excel has features to handle all of these issues. I think it's biggest problem is that it's just too easy to use. It will allow novices to do all sorts of things with their data without providing guardrails for common mistakes.

Isthatablackgsd · on Oct 6, 2020

Agreed with this. Excel does have data import feature that will allows us to fine-tune the import process to prevent this kind of mistake.

People just double click the CSV and complained that it didn't do it correctly. It is the same situation with scientific research data that researchers don't bother to use escape marker or blindly open the file without going through the proper import process. Then they blamed Excel for the that without understanding how Excel works.

Yes, Excel does have their quirks. But there are ways around those quirks, they have thousands of thousands guides out there about Excel. There is no excuses for people to complain about Excel didn't do the way that users want it to do without looking up for information.

oauea · on Oct 6, 2020

Except that literally every end-user I've ever had the displeasure of dealing with did not know how to properly import data into excel.

Double clicking the CSV should open the data import dialog.

stjohnswarts · on Oct 7, 2020

Yeah it does, libreoffice has methods of recognizing numbers that are to be interpreted as strings as well. This is user error and the programmer trying to blame that on the program rather than it being a process issue. The spreadsheet can't read your mind

pif · on Oct 6, 2020

> If I paste something into Excel, I want what I pasted to be what's in Excel.

And that's what you have in Excel. What gets displayed is a separate issue.

And no, you don't want to see exactly what you typed in, not in the general case.

And no, I can't believe I am defending Excel!

matt-attack · on Oct 6, 2020

I believe excel considered itself a finance system not a data system. Hence the clear bias towards numbers over strings.

michaelcampbell · on Oct 6, 2020

> has to guess the type

I believe that's the point, it certainly does NOT need to.

fnord123 · on Oct 6, 2020

Absolutely! I ran into the arff format when using Weka and it was a revelation:

https://www.cs.waikato.ac.nz/ml/weka/arff.html

reportgunner · on Oct 6, 2020

Well Excel trying to be clever stems from people not trying to be clever. We truly reap what we sow.

Ntrails · on Oct 6, 2020

As an intern I was once given a spreadsheet of manually entered daily data, stored by a human being over the previous couple of years. Obviously he'd entered the dates in whatever form he felt like at the time and Excel had parsed them in a variety of godawful ways not least of which was defaulting to American month/day assumptions before UK day/month.

I think it would have been far quicker to just manually write a new column interpreting the dates based on previous/next etc. Instead I spent God knows how long trying to be clever, failing, and being embarrassed that I could not solve this obviously trivial problem.

stjohnswarts · on Oct 7, 2020

And the people who make excel will just tell you "excel is not a database" lol

rdiddly · on Oct 9, 2020

Which sadly, is what I want to tell them too!

reportgunner · on Oct 6, 2020

The right way to open CSVs in Excel is: Data -> From Text -> Browse to your csv -> Delimited -> Comma -> Next -> Select Text formatting for columns with leading zeroes -> Finish

You can either send xlsx with format or csv without format. If this would be disabled then we'd have another group of people complaining that their dates from CSV are not parsed.

ss64 · on Oct 6, 2020

and to enforce that, remove the file association which opens .CSV files in Excel by default. If every machine opened CSV files in notepad a huge percentage of 'Excel' problems would just go away. It could be worse though, try opening and saving a CSV in Word and see what that does.

eCa · on Oct 6, 2020

I used to work with books, which obviously includes ISBNs. They are 13 digit numbers, helpfully turned into scientific notation by Excel.

eythian · on Oct 6, 2020

I used to do migrations between library systems, and often they'd export from the old system as CSV. One rule I gave them was "never open this file in excel", because they'd always try to be helpful by tidying something and I'd end up with scientific notation ISBNs, dates that were in one form if they could be coerced into US-style infix, otherwise left in the original form, barcodes losing zeros, etc. etc.

Cyberspy · on Oct 6, 2020

The same problem happens with UK phone numbers, where the leading 0 gets chopped off Solvable if there's a space between the dialling code and the number, but when individuals enter their number, they don't always do this

eru · on Oct 6, 2020

Don't ISBNs also contain an X sometimes?

eCa · on Oct 6, 2020

The old ISBN10 standard could get X (meaning 10) as the last character (the last number is a check digit), but ISBN13 (which prefixes 978 or 979 to ) is all numbers. The check digit is calculated differently. One obvious advantage is that the ISBN13 is barcode compatible.

I would guess that most modern actors in the book business has been primarily using ISBN13 for at least the last decade.

eru · on Oct 6, 2020

Thanks! The last time I read about ISBNs was probably more than a decade ago.

timclark · on Oct 6, 2020

X can be a check digit. From memory the check digit is calculated using a modulo 11 calculation.

rutthenut · on Oct 6, 2020

Interesting to see just how many responses to this real-world problem are suggesting that people rewrite CSV files to work with Excel.

They clearly do not understand system integration and the use of CSV text files for data interchange between multiple systems and application. Hey, JSON and Javascript libraries are the answer to that, eh

There are already enough potential issues with CSV interpretation on wrapping strings, escaping characters and so on, but changing the content when a delimiter is found should not be added to that list.

You bold point is the most important, the default behaviour of Excel when opening a plain text CSV file is to alter the content for display, applying magic and often-unwanted formatting rules. That should be optional.

It should be possible to open a text CSV file in Excel, view the contents in columns but the same textual form, save the file in CSV format and open it in another viewer/editor and still see the same content as the original file.

robk · on Oct 6, 2020

The problem is windows doesn't have a preloaded double click viewer for csv except excel if you have Office. The average user just double clicks. If notepad or wordpad did any kind of formatting for viewing it would be less an issue. So you have to solve for excel as a dev knowing that's what 99% of people will use.

rutthenut · on Oct 6, 2020

If you 'solved for Excel', it will be broken for the other applications that may want to use data from those Excel-specific CSV files, which is generally why CSV was chosen in the first place.

stjohnswarts · on Oct 7, 2020

If you want it to be a string then put it in quotes. It's that simple. If the user is too dumb to slow down and look at the import options it is not excel's fault, it's a process issue. Excel can't read a cell and then read your mind and extract the correct option.

kazinator · on Oct 6, 2020

> there isn't a way to get them back if you save/overwrite.

Of course there is, in your case. If patient identifiers have a fixed, known length, then you can pad with leading zeros to recover them.

You only have a problem if 012345 and 12345 are distinct patient identifiers.

It is bone-headed in the first place to use numeric-looking identifiers (such as containing digits only) which are really strings, and then allow leading zeros. Identifiers which are really strings should start with a letter (which could be a common prefix). E.g. a patient ID could be a P00123.

This is useful for more than just protecting the 00. Anywhere in the system, including on any printed form, if you see P00123, you have a clue that it's a patient identifier. An input dialog can reject an input that is supposed to be a patient identifier if it is missing the leading P, or else include a fixed P in the UI to remind the user to look for P-something in whatever window or piece of paper they are copying from.

The main point in my comment is that instead of shaking your fist that the behavior of other users in the system, such as those who choose Excel because it's the only data munging thing they know how to use, you can look for ways that your own conventions and procedures are contributing to the issue.

If people are going to "Excel" your data, and then loop it back to you, maybe your system should be "Excel proofed".

jugg1es · on Oct 10, 2020

The lengths of the identifiers are not known, man. We are dealing with like 200 hospitals. They all have various kinds of Electronic Medical Record system like Epic or Cerner, or one of the 15 other third party vendors that all have custom implementations. Healthcare is insane in our country.

lr · on Oct 6, 2020

First off, I “hate” Excel, but only because people misuse it. It, like so many other products of its kind, are for numbers! Yet I would be hard-pressed to find someone actually using it for numbers. People absolutely insist on using it as a generic table system for holding non-number data.

I recently had to show a 20+ year Excel-using fanatic how to import data from a CSV file so that they could select as type Text columns that contain leading zeros. The ability exists, but I have found scant few people who know how to actually use the product properly.

Oh, and I also work in healthcare.

ineedasername · on Oct 6, 2020

Excel supports a variety of data types other than numbers and includes many built in functions for dealing with non-numeric data. Whatever it's history, it is not accurate to say that it is only for numbers.

zentiggr · on Oct 6, 2020

Regardless of its _additional_ support for other types, those are only in support of its primary role as a spreadsheet. It is designed around the concept of plugging in _numbers_ and calculating with them. Everything else is to make those calculations well notated and human friendly.

Any other use of Excel is bending it into a role it wasn't intended for, and user beware.

And it is all too easy to just go there since there are soooo many convenience features for those who don't want to laern how to do the tasks well.

ineedasername · on Oct 6, 2020

If notation were the sole purpose there'd be little need for anything but row and column labels or simple string input. Instead there's a full set of string manipulation tools, embedding of other objects, charts, images, drawings etc.

You are taking the application as it existed 35 years ago and saying it must still be that thing, yet it has had 35 years to evolve far beyond that. Microsoft itself, when it talks about Excel, talks about using it to organize "data", not just numerics data. It has become a more general purpose tool.

mewpmewp2 · on Oct 6, 2020

As a lay-person, what would you use for non numbers tabular data?

lr · on Oct 6, 2020

People love to use Excel to look at (with their eyes, or a screen reader, i.e., non programmatically) all kinds of data. They use it to list exports from data stores, and then take notes on what they discovered, show and hide data using the filter option, etc. It is just far too easy to use (and misuse) that it ends up being used for any type of data that fits into a table.

Isthatablackgsd · on Oct 6, 2020

Excel have lots of different uses. It can be used as inventory tracking, sales, catalog, mileages, contact list, and the list goes on. It is useful in many different fields and industries.

You have to remember, Excel is extremely powerful beast. It have many specialized features that will handle any data it encountered with. I used Excel for 15 years and I am still finding features that made the process quicker. Of course, Excel have its limits and I am well aware of that.

mewpmewp2 · on Oct 6, 2020

I mean I thought you said that they should not use excel for these scenarios, but something else. So I thought you had an alternative software in mind?

lr · on Oct 6, 2020

I use Splunk (our organization pays lots of money for it, and yet it just sits there because people would rather use Excel). There is also Tableau and other products like it. Most of these tools have a free version, as well. They take some learning, but it is well worth it. Of course, taking that same amount of time to actually learn how Excel works is worth the time, too. I guess the issue is that Excel is just too "easy" to use, and hence people using it for years and years and not actually knowing how to use it.

I am not trying to be an elitist about this. It is just that the misuse of Excel (because people do not know how to use it) causes massive issues on a daily basis.

hunter2_ · on Oct 6, 2020

I haven't used it much myself, but I think Access is meant to answer this question.

stjohnswarts · on Oct 7, 2020

Excel is fine, but keep the data as xls and you won't have problems like dropping leading zeros. As soon as you convert to a simpler file format like cvs, you will likely lose data unless you are very careful in both export and import.

pjmlp · on Oct 6, 2020

I just upvoted you.

You are fully correct, I have seen plenty of stuff like that in life sciences projects.

pif · on Oct 6, 2020

> Excel interprets a number and then formats it as a number,

Unless you instruct it to interpret the field as a string.

> but in healthcare, 10 digit numbers are really strings.

I'm wondering, if you expect 10 digits and you get less than that, how difficult is it to add some padding zeroes?

hunter2_ · on Oct 6, 2020

With US ZIP (postal) codes, padding with zeroes can actually be ambiguous because we have "zip+4" which is often stored as two fields, but just as often concatenated into a single representation. If you have a zip+4 of "05555-0444" in two fields that each lose leading zeroes, and then concatenate to "5555444" you can't reconstruct it with certainty (00555-5444 or 05555-0444 or 55554-0044).

I had a package get seriously delayed one time because it kept being sent to a city whose 5 digit zip code was equal to [my 5 digit zip code, less its leading 0, and the first digit of my +4]. Fun times.

oauea · on Oct 6, 2020

> Unless you instruct it to interpret the field as a string.

Except double clicking a CSV does not give you the option to do this. At this point Excel already decided to corrupt your data. And guess how most users open CSV files? That's right, they double click.

> I'm wondering, if you expect 10 digits and you get less than that, how difficult is it to add some padding zeroes?

If you expect 10 digits and get less than that, you have corrupted input data. Trying to "fix" this is exactly the sin excel is committing. Don't do that.

lr · on Oct 6, 2020

This would assume people (many who have over a decade of experience using Excel) actually know how to do that. The largest problem with Excel is people.

rustybelt · on Oct 6, 2020

Excel rules. I still use it daily despite having access to big boy tools (python, SQL, R, etc.) Few tools make it easier to answer quick, simple questions about a dataset. Plus it's still the best way to browse SSAS cubes. Blaming excel for these sorts of problems is like blaming a hammer for not being a screwdriver.

I also work in healthcare.

carlmr · on Oct 6, 2020

Their database system afterwards could do that.

Or prepend a letter when producing the CSV to avoid EXCEL doing what it does.

TacoSteemers · on Oct 6, 2020

Maybe it can be solved by prepending "ID:" for exports, and removing it on import. It could be proposed as a convention for all HC software companies.

altacc · on Oct 6, 2020

Introducing standards across many companies is incredibly difficult but your main problem is that a lot of healthcare IT systems are very old and not actively maintained, they're essentially on life support. So changing the format of a file, or requiring new import cleaning, is going to break integrations and data sharing in many places. Basically it'd create far more problems than it's attempting to solve.

TacoSteemers · on Oct 6, 2020

That is understandable.

I guess it depends on which country and which specific part of healthcare you are active in. In the 'care' domain in the Netherlands I also see that some of our integrations are one-off, but the most important ones do have industry wide standardization that receives updates based on law.

IanCal · on Oct 6, 2020

Although make sure to not use that for the first column heading, an uppercase I will make excel think it's a SYLK file and then refuse to open it.

TacoSteemers · on Oct 6, 2020

Good catch! The Wikipedia page confirms it.

Nothing in software development is as straightforward as we might hope.

https://en.wikipedia.org/wiki/Symbolic_Link_(SYLK)

andreareina · on Oct 6, 2020

IME numeric identifiers are just asking for trouble. Some systems are sensitive to leading zeros, some systems will zero-pad, <input type="number"> loses zero-padding, etc. Worked with a client once with references to the same entity scattered across two databases, one of which zero-padded the identifiers and the other didn't. Oy vey.

throwaway_pdp09 · on Oct 6, 2020

> IME numeric identifiers are just asking for trouble

That's a failure of the system if it can't be told to not interpret the data. However you're saying the world is as it is; can't argue.

rutthenut · on Oct 6, 2020

Agree with the problem reported here - the fact that Excel will drop leading zeroes and will turn phone numbers into Exponential format causes many live issues where customers use Excel to work on 'CSV' files :(

slowhand09 · on Oct 7, 2020

The issue I see is you have people who don't know the difference between a string data type and a number. I keep see "Listen to the scientists" and "Listen to the doctors". There are many facets to solving problems. Maybe many disciplines may be better at solving it. Had you/"healthcare" contacted folks with software expertise rather than being your own experts, this would have been a non-problem.

cm2187 · on Oct 6, 2020

Also long numeric ids are rounded by Excel.

And fucking with dates.

noja · on Oct 6, 2020

Does it strip leading zeroes if you format the numbers as strings with quotes like this?

blah,blah,"00001553",blah

mikorym · on Oct 6, 2020

It is sometimes possible to write code to alter your CSVs to retain the strings.

The first thing would be to write them as: entry1,"0123456789",entry2 rather than entry1,0123456789,entry2. This has worked for me in some instances in Excel whereby I have to escape certain things inside a string, but I would not be surprised if Excel still messes this up. For example, giving the triangle exclamation mark box and then helpfully suggest to convert to number.

If you want to go further, you can do something like write a routine that alters the CSV, such as entry1,hospitalString(0123456789),entry2. Sure, there are problems with this too, but Excel can break a lot of things and the above examples I do use in practise (the first example I put the double quotes to escape single quotes in foreign language unicode).

Another thing Excel can do is break your dates, by switching months (usually only for dates < 13th of the month, but often a partial conversion in your data for < 13th and >= 13th) or convert dates to integers.

mratsim · on Oct 6, 2020

Working internationally with CSV holding dates is a recipe for disasters.

ivanhoe · on Oct 6, 2020

Couldn't you validate the data on upload or if it's such a common problem just auto-fix it by left padding w/ zeros?

zentiggr · on Oct 6, 2020

There are data sources that have a clear unambiguous specification of what information is in each field, and exactly how they are formatted, such that you could auto-parse the column. And even have no data entry errors, so that you can rely on that parsing and have clean error free imported data.

There are many more that don't have a clear spec, or even if they do, have every possible variation of data corruption / keying errors / user misunderstanding / total 'don't give a shit, I'll enter it how I want, that's what the computer is supposed to handle' problems in that source data, that make parsing an absolute nightmare.

I've had to manually review a list of 10k+ data readings monthly, from an automated recorder, because the guy who was supposed to copy the collected data files as is and upload them, instead opened each one and fixed what he (badly mistakenly) thought had been recorded wrong. Different changes in a dozen different ways based on no particular logic beyond "that doesn't look right". And un-fireable, of course.

SifJar · on Oct 6, 2020

auto-fix by padding assumes that the only way to get an ID that's too short is by excel removing the leading zeros - presumably it could also be human error when entering etc. (in theory I suppose that would be validated in whatever form the number is entered into in the app that generates the CSV, but hard to say for sure). Or some other issue in data processing pipeline erroneously truncating the ID number some how.

Validating the data would at least prevent getting invalid data into database (and presumably this is happening already), but it doesn't actually "fix" the problem, you still then need the original provider of the data to fix what's missing.

jugg1es · on Oct 6, 2020

These aren't fixed-width fields with a known length. The length of the field is unique to the person sending the file and they don't tell us how long it's supposed to be.

ivanhoe · on Oct 9, 2020

The parent post says: "Excel interprets a number and then formats it as a number, but in healthcare, 10 digit numbers are really strings." - so I was under impression that the length for that particular column is fixed - probably just misunderstood the sentence.

I wonder how human operators figure out if the value is correct, or the Excel messed it up, or the input was invalid in the first place? If it's even possible to do it reliably then probably there is some set of patterns and methods that possibly could be turned into an algorithm... just thinking out loud here, but seems as an interesting problem to tackle...

rutthenut · on Oct 6, 2020

Validating on upload is sensible and should be done anyway. However, left-padding is not always going to work, it depends on the data. Common problems I see are with phone numbers. UK mobiles begin 07nnn. Excel kindly drops the leading zero and then really helpfully treats the long number as being an exponential-format floating point number, which cannot be turned back to the original number. In our systems, we require country codes in front of the numbers, so UK mobiles should being 447nnn. That doesn't lose the leading zero but still gets the E-format mess when opened by Excel. Customers often miss out the 44 and use the national 07 format. If that leading zero is missing, it can be quite difficult for a system to reliable determine whether that number, beginning with a 7, is an international number for a destination in Russia, or a user error and should've been 447, from an 07 data value.

mantap · on Oct 6, 2020

One wonders why Excel doesn't store data as strings (or equivalent lossless format) and process it as numbers like SQLite does for instance.

zentiggr · on Oct 6, 2020

Because Excel is a spreadsheet, for calculating with numbers.

Then there's additional code for dealing with all the inconvenient ways people format things, or want to add text labels, or do things closer to numerical/financial analysis, or all the other extras wrapped around the core "put numbers in boxes and do math".

That misunderstanding is at the core of Excel misuse.

radmuzom · on Oct 6, 2020

Down-voted as it is factually incorrect. See comment by reportgunner below.

dt3ft · on Oct 6, 2020

> I have no idea who downvoted my post

That would be the Excel devs working at Microsoft. They read HN. I can feel it.

edejong · on Oct 6, 2020

Excel devs, do you hear me? Repent for your sins! Go back to Microsoft and fix those problems. The cost of Excel mishandling numbers and dates must be many orders of magnitude higher than your lifetime salary.

Thank you.

wbl · on Oct 6, 2020

They can't. All those spreadsheets that depend on that behavior are out there. It makes spell create with an e look easy.

petre · on Oct 6, 2020

Just format the identifier cells as text. I've also had this problem, this is how I solved it.

simonh · on Oct 6, 2020

That's fine for an individual working on a specific set of data in a specific sheet, but this isn't just a problem for one individual or a small team that can solve it and move on. It's a systemic problem throughout many vast organisations, with continuous influxes of new personnel and constantly changing requirements. When you get an XLS sheet sent over to you from another team that already made this mistake, it's too late and this happens all the time.

dragonwriter · on Oct 6, 2020

> Just format the identifier cells as text

CSV is text. If you mean in Excel, if you opened it in Excel (rather than importing and choosing non-default options), you've already lost the data so formatting doesn't help you.

petre · on Oct 6, 2020

Yes, I mean Excel. We have CSV to XLS import scripts/forms that format identifier cells as text. The data format is standardised. Using templates to do the imports was the dumb part. Microsoft has a Power BI tool if ones doesn't want to write or use scrips. Use that. I assume a government agency has the resporces to pay for it and for data scientists.

https://powerbi.microsoft.com/en-us/

Thanks for bothering to respond instead of downvoting.

dragonwriter · on Oct 6, 2020

> I assume a government agency has the resporces to pay for it and for data scientists

First, the upthread commented said "healthcare" not "government agency".

Second, as someone who has worked in public sector healthcare: HA HA HA!

I mean, sure we have the resources to pay for data scientists (of which we have quite a few) and could conceivably probably afford to develop custom scripts for any CSV subformat that we decided we needed one for (though if its a regular workflow, we're probably acquiring it an importing it into a database without nontechnical staff even touching it, and providing a reporting solution and/or native Excel exports for people who need it in Excel.)

The problem is that when people who aren't technical staff or data scientists encounter and try to use CSVs (often, without realizing that's what they are) and produce problems, its usually well before the kind of analysis which would go into that. If its a regular workflow that's been analyzed and planned for, we probably have either built specialized tools or at least the relevant unit has desk procedures. But the aggregate of the stuff outside of regularized workflows is...large.

oauea · on Oct 6, 2020

Nearly all end-users open a CSV like this:

1. They see a file (they have file extensions turned off, which is the default, so they probably don't even know what a CSV is)

2. They double click it

Excel now corrupted the data. That is the problem. Good luck teaching all end-users how to use Excel properly.

dragonwriter · on Oct 6, 2020

> They have file extensions turned off, which is the default, so they probably don't even know what a CSV is

And if, also by default, Excel is setup with an association with CSVs, the CSV file will, in addition to not having an extension to identify it, will have an icon which identifies it with Excel.

zentiggr · on Oct 6, 2020

> I assume a government agency has the resources to pay for it and for data scientists.

Bold strategy there, let's see how that plays out.

Having been in and around military / DoD usages for a long time, I can tell you it's always an uphill battle to get processes to work well, instead of defaulting to whatever the original spec happened to get included as a result of some incompetent who wasn't even aware of good practice.

tomp · on Oct 6, 2020

So what's the big deal? Just read back the numbers, if it's less than 10 digits prepend zeros. Am I missing something (probably, which means you're leaving something out in the description of the issue)

jugg1es · on Oct 6, 2020

The 10 digits was just an example. There like 100 different ID types with different formats/lengths in healthcare.

baobabKoodaa · on Oct 6, 2020

How on earth is this the top rated comment? I would downvote it if I could. As other people have noted, you can import CSVs values as strings instead of numbers.

Furthermore, losing preceding zeroes in number-typed values is not unique to excel; it is a common feature in all typed programming languages.

SifJar · on Oct 6, 2020

> you can import CSVs values as strings instead of numbers

Of course, but the problem isn't that the person who posted the comment doesn't know this - it's that many users of their systems don't know it. Most people are just going to accept whatever defaults Excel suggests and not know any better, causing problems down the line.

baobabKoodaa · on Oct 6, 2020

No, that's not correct. The parent poster specifically made this false claim: "that's Excel's default behavior and you CANNOT TURN IT OFF!" The false claim is the part that the poster wrote in all caps. He said that this behavior can not be turned off, when in fact it can be turned off.

zentiggr · on Oct 6, 2020

It can't be turned off, if you're opening a CSV in Excel.

It can be avoided, if you go through the Data | Import tools. The complaint is that few Excel users know that the import engine is available, or use it, instead of just opening the file and getting all the default interpolations. Which can't be avoided in the usual Open code path.

jugg1es · on Oct 6, 2020

How? Where is the setting to turn it off when initially loading a file?

jasode · on Oct 6, 2020

>How? Where is the setting to turn it off when initially loading a file?

Confusion is happening because 2 different ideas of Excel using csv files:

- you saying "can't turn this off" : File Explorer double-clicking a "csv" or MS Excel "File->Open" csv.

- others saying "you can preserve leading zeros" : click on Excel 2019 Data tab and import via "From Text/CSV" button on the ribbon menu and a dialog pops up that provides option "Do not detect data types" (Earlier version of Excel has different verbiage to interpret numbers as text)

tasogare · on Oct 6, 2020

Quote your strings. The problem is more with how you generate CSV (which is easy to break in the first place) than Excel’s fault. It’s like writing your identifiers as numbers in JSON instead of string then complain the leading zeros are removed.

yawaramin · on Oct 6, 2020

Quoting doesn't work. Try opening this CSV with Excel:

    a,b
    "01",01

Excel interprets both as the same number–1.

sheetjs · on Oct 6, 2020

The correct way to generate a CSV cell with a leading 0 is

    ="01"

You can verify this with

    01,"01",="01"

vertere · on Oct 6, 2020

Perhaps by "correct way" you meant "dodgy hack to make Excel happy and risk breaking more sensible implementations"?

Excel may predate the RFC but AFAIK MS didn't invent or coin the term CSV, so you can't just say whatever Excel does is correct. The RFC is loose because of nonsense like this, it doesn't mean it was ever a good idea.

stjohnswarts · on Oct 7, 2020

How in the world is Excel supposed to know which fields you want to be numbers and which to be strings? CSV doesn't have that info built in unless you surround the number with quotes and you select the right process. Excel isn't just a CSV importer, it reads all sorts of files, and it needs some help if you expect it to work. What ever happened to process and a sense of responsibility and craft in your work?

vertere · on Oct 7, 2020

Choosing based on whether it's in quotes seems like a better solution than using an equals sign. Not everything follows that convention admittedly, but not everything understands the '=' either. Or it could just treat everything as text until a user tells it otherwise.

But none of that is really the point. Because CSV files aren't just for importing into Excel. One of their main benefits is their portability. In other situations column types might be specified out of band, but even if not, putting equals signs before values is unconventional, so more likely to hurt than help. And in the cases it might help, i.e. when you only care about loading into Excel, then you have options other than CSV, rather than contorting CSV files for Excel's sake.

> What ever happened to process and a sense of responsibility and craft in your work?

I actually have no idea what you are on about. I'm talking about the "responsibility and craft" of not producing screwed up CSV files. Why do some people find that so offensive? Yes, it is not inconceivable that there could be some situation working with legacy systems where putting `="..."` in CSVs is, unfortunately, your best option. Sometimes you do have to put in a hack to get something done. But don't go around telling people (or yourself) that it is "the correct way".

_pmf_ · on Oct 6, 2020

[flagged]

vertere · on Oct 6, 2020

I sure don't want to have to deal with people putting `="01"` in CSV files.

altacc · on Oct 6, 2020

We have multiple legacy systems where I work, communicating via batch csv files, some of which are authored & curated by staff in Excel. I can confirm that doing this would be a very bad thing indeed and make our systems grind to a halt.

dragonwriter · on Oct 6, 2020

For CSV that isn’t solely targeting Excel, hacks around the way Excel works are useless.

OskarS · on Oct 6, 2020

The reason you use CSV is that you want to use it with software that ISN'T Excel. Otherwise you'd just use .xlsx. No other software uses this convention, and it's not correct CSV.

edejong · on Oct 6, 2020

I guess the HN audience has voted. The result is: hacking around Excel stupidities is itself a stupid idea.

If you want Excel to understand your output, perhaps use a library which can write xlsx files.

terwey · on Oct 6, 2020

“Correct” is a strong word. It isn’t defined in https://tools.ietf.org/html/rfc4180 so Excel should not try to be smart and add extensions only they support.

sheetjs · on Oct 6, 2020

Excel predates RFC4180 by nearly 20 years (RFC4180 is October 2005, Excel 1.0 was September 1985) and this behavior was already cemented when the RFC was written.

As for the actual RFC, it's worth taking a read. Any sort of value interpretation is left up to the implementation, to the extent that Excel's behavior in interpreting formulae is 100% in compliance with the spec.

dtech · on Oct 6, 2020

CSV predates Excel, and other CSV implementations don't have this behavior

throwaway_pdp09 · on Oct 6, 2020

What spec?

Anyway the RFC doesn't mandate any value interpretation IIRC.

rsanders · on Oct 6, 2020

If CSV were being used just to exchange data with Excel, we probably wouldn't be using CSV. Many systems neither need nor know that ="01" should be treated as the string "01".

If Excel were the only intended consumer, .xlsx would be a preferable file format. At least it's mostly unambiguous.

dpwm · on Oct 6, 2020

I ran into this last week with a UK bank. I was offered a CSV file. What I got was a CSV file with excel formulae in it.

I actually wanted a CSV file – preferably without having to resort to sed to strip out excel formulae.

andreareina · on Oct 6, 2020

And now the csv parser (or downstream process) has to guess whether to interpret that as the raw string or as the eval'd value.

zentiggr · on Oct 6, 2020

If you want to write a bespoke CSV generator for an application where you know for sure that the file is only ever going to go straight to an Excel instance, sure.

For all the other uses in the world, that's a breaking change.

giantDinosaur · on Oct 6, 2020

I have a list of companies I'd like you to consult for. Coincidentally, they're companies I'd like to work for, and I've quite enjoyed building proper database solutions which replace incredibly hacky/terrible Excel (or Excel adjacent) solutions.

account42 · on Oct 6, 2020

Not even Excel uses that syntax when exporting to CSV (at least by default).

zelphirkalt · on Oct 6, 2020

Are you asuming creation of CSV? 'cause most excel users wont be aware of why they should store in csv instead of excel's proprietary formats.