Show HN: Statwing is statistical analysis, simplified

taliesinb · on Aug 6, 2012

I'm glad to see other people working on this problem. We (Wolfram|Alpha) are doing this too, starting out with the making the 'easy cases' nearly automatic.

Here's a blog-post describing our effort: http://blog.wolframalpha.com/2012/02/09/launching-a-democrat...

You can also play around with our examples without having to sign up to Wolfram|Alpha Pro. My favorite is an automatic analysis of the Titanic data that nicely illustrates that while the motto "women and children first" applied, being rich certainly helped: http://www.wolframalpha.com/input/?i=+&examplefile=1&...

We cover other kinds of simple analysis and visualization too, like heat maps, Venn diagrams, graphs, and so on. As always, feedback welcome.

glaugh · on Aug 7, 2012

Agreed, glad to be a part of the community of folks trying to democratize data analysis. It feels like an important problem to work on, and we're passionate about it (as I'm sure you are, too).

Thanks for chiming in.

anonDataUser · on Aug 7, 2012

This is great. Do you have any solution for highly sensitive data?

aphyr · on Aug 7, 2012

The walkthrough took me through finding a correlation between voting preference and neuroticism. Great! But it's also worth noting that this dataset shows larger effect sizes at similar CIs for the correlation between [preference and age] and [age and neuroticism]. This, folks, is why ANOVA is important.

That aside, the product was clear, fast, and intuitive. Well-chosen visualizations and a clean emphasis on the important moments for basic covariate analysis. Well done.

glaugh · on Aug 7, 2012

Nice, well pointed out.

Unfortunately, we don't have regressions yet. But just to make sure these findings were still valid, we tossed this data into a program that did to make sure the effect remained (it did).

It's no substitute for regressions/ANOVA, but for now here's what Statwing can do: If you add a filter that excludes datapoints below, say, 40 years old. Looking only at folks older than 40, there's no relationship between [neuroticism and age], but the relationship between [neuroticism and preference] remains.

But, point taken. We'll count this as a vote for prioritizing regression. Thanks!

danso · on Aug 6, 2012

1. Great looking product. I clicked through a little bit and liked the general polish, but didn't have time to explore everything.

2. For people whose jobs involve statistical analysis, how much need is there for something like this? The more analysis I do, though, the more I realize that the hard part is collecting the data and programmatically "piping" it from package to package...And from professionals I know in various numbers-based industries, their biggest blind spot seems to be that ability to gather data that doesn't come in a CSV/Excel sheet for them.

* edit: in addition to the challenges presented above, the challenge of cleaning data so that a package like Statwing can do a proper analysis

glaugh · on Aug 6, 2012

Thanks, really appreciate #1.

Agreed that quite often the hardest part is getting the data together (particularly on the web). But from our perspective, it's still true that conducting the actual analysis and visualizing the data should be a lot easier than it is. And that's particularly true if you're in our initial audience, the roughly 50% of SPSS/R/Minitab/etc. users who never use anything past the basic functionality of those programs.

I guess a simpler answer is that we think there's a need because this is a product that I badly wanted when I was an analyst/consultant, splitting time between Excel and the basic functionality of SPSS.

edit: Also, and this isn't very helpful, but we talk to a ton of people about their data analysis needs, and we hear a good chunk of them talk a lot about the pain of using highly technical solutions for relatively simple problems like analyzing a survey.

danso · on Aug 6, 2012

More notes:

#3 I have to say, my first impression was that the tutorial was a little annoying, but it's actually done pretty slickily and it introduces features, such as the multiple variable analysis, that I probably would not have stumbled upon in the first place. Well done.

#4 That's my SOPA project you're referencing! :) https://www.statwing.com/demos/sopa (though if you can edit the copy, credit should also go to the Center for Responsive Politics, from which the campaign finance data was collected)

glaugh · on Aug 6, 2012

Nice! It's a great dataset, really fun, and we're big fans of ProPublica.

We'll definitely edit the copy. I'll ping you unicast to make sure we did it right. Yay!

edit: Also, thanks for the feedback on the tour. Our goal is to make the interface so intuitive that it shouldn't require the tour to know what to do. We've got some updates in mind that should get us much closer to that goal.

swalsh · on Aug 6, 2012

"For people whose jobs involve statistical analysis, how much need is there for something like this?"

I don't know who the creators are targeting, but perhaps it's not experienced people? One of the great parts of the internet is the way it seems to lower the barrier of entry for just about everything. If this app can help people learn how to do simple statistical analysis, there just might be a load of value there. I didn't really click around, but something they might think about doing is allowing users to hyperlink directly to an "analysis session" so a blogger can not only link to a graph, but the data itself. I can imagine a scenario where a blogger writes a post about maybe housing prices, and draws some unreasonable conclusion. Then a reader goes, and adds in inflation data which changes the story. He then replies in the comments with a link to his new "analysis session" spurring a new conversation.

Its probably taking the tool to a different direction, but i really like what's here.

Someone · on Aug 7, 2012

"If this app can help people learn how to do simple statistical analysis"

IMO, that is big IF. Statistical analysis is like writing security software: if you are not an expert, do not do it; if you are an expert, you already know you should not do it (but you also know that, sometimes, somebody has to do it, anyways)

The challenge is in finding a feature set that is useful, yet foolproof, and beats what Excel provides.

recardona · on Aug 6, 2012

I do a lot of stat. analysis and I was impressed with the clarity of the analyses. However, I ran through the Obama v. Romney tutorial and was surprised to see that the software was averaging survey items (Likert-scale data). I thought that this was not allowed since it is troublesome to interpret the output (how do you interpret 8.36 Neuroticism?)

Aside from that, I can see this filling a need for those whom are aware of the importance of statistical significance but do not have the time to look up the appropriate analysis function in R/SAS/SPSS/...

mgurlitz · on Aug 6, 2012

You're right, they shouldn't be making that average. Specifically, Likert data is ordinal, meaning 14 is less than 15 and greater than 13, but the gap between 15 and 14 may be different than 14 and 13.

For example, let's say people measure neuroticism exponentially, and an increase of one point means 10x perceived neuroticism. Because mean(log(x)) != log(mean(x)), the average won't be representative.

Everything else looks OK though: count, median, percentiles, a histogram.

glaugh · on Aug 6, 2012

Thanks for the comment.

Agreed, Likerts are definitely an area of controversy.

A couple references from Wikipedia about Likert-as-continuous:

Pro: http://www.ncbi.nlm.nih.gov/pubmed/20146096

Con: http://xa.yimg.com/kq/groups/18751725/128169439/name/1Likert...

Our stance is generally a pragmatic one. It's very common practice to analyze data this way, so we enable it. But if you want to analyze your Likerts as ordinal data, you can change the variable type to Ranks (ie Ordinal), where we handle the result as you'd expect (nonparametric test, no averages). Note that this feature is disabled in the demo.

Thanks!

edit: formatting

talbina · on Aug 7, 2012

There was a company that applied to YC that wanted to do the "Google Docs for Statistics" but was rejected. They wrote about it in a blog post but I can't find it. They ended up not launching.

It will be worth it to connect with these people to see if there is anything that can be learned from them.

glaugh · on Aug 7, 2012

Definitely let us know if you think of their name or dig up their blog post. Sounds interesting.

TrevorBurnham · on Aug 7, 2012

I believe the company talbina's thinking of is mine: We applied as Theoryville, got interviewed, got rejected, applied to Betaspring (http://betaspring.com/), got accepted, changed our name to DataBraid, and proceeded to fall apart over the course of a summer.

I do think the idea has a lot of potential, and what StatWing has built is already more complete than what my team managed to build in 3 months. A few suggestions I'd offer based on that experience:

1. Parsing CSVs is easy in theory, but painful in practice, because CSVs in the wild tend to be full of junk. I would provide a JSON API that makes it easy for developers to put data in your system directly, allowing people to build their own CSV parsers for you.

2. Use GitHub as your model. You want people to collaborate around data the same way that developers collaborate around code. Just about every day when I was doing DataBraid, we'd discuss a use case and then say "Oh, GitHub already figured out the right way to do this." The most compelling use case here is that researchers can run different sets of tests on the same data and discuss which approach is the most valid/insightful.

3. Getting to revenue will be hard, but having paying customers will make it much, much easier to attract investment. So find the MVP that people will pay for and put everything else on a "nice-to-have" list.

Best of luck!

jenius · on Aug 6, 2012

Looks really great overall - props! One small design thing in there that bothered me was how the gradients reverse in the buttons on hover - this should never happen. Just lighten or darken the color on hover (move the gradient up with background position and add a transition is a good trick), then consider reversing the gradient on active (or just adding an inset shadow).

Everything else in the design looks great and this is totally nitpicky, but hope it helps!

lejohnq · on Aug 6, 2012

I also work on Statwing so thanks very much for the comment.

Now that I look at it more, the front page buttons do look weird compared to all of our other buttons. We've become numb to it after looking at it so often. Most of our buttons do the design thing that you described, so we'll change that shortly! Thanks for the feedback.

Bill_Dimm · on Aug 6, 2012

Very nice. One tip: Don't require an email address to provide feedback and you'll get more feedback.

A bug that I found in the tour for "Politics and the Big 5":

The instruction bubble says: To run a different analysis, remove "Neuroticism" from the white box by clicking the X to the right of the variable name. But, "neuroticism" is not one of the variables I was using. It seems that something was hard-coded when it shouldn't have been.

glaugh · on Aug 7, 2012

Ah, thanks a bunch. Appreciate both the bug and the feedback tip. Have a good one, thanks for checking out Statwing.

kylemaxwell · on Aug 6, 2012

This looks great and I look forward to running some analyses of the same test data between Statwing and Wolfram|Alpha Pro in a mini-bakeoff.

EDIT: Can you talk about your business model any? Sort of a freemium service, or maybe charging for a future API, or something along those lines? Please don't say "ads".

glaugh · on Aug 7, 2012

Fortunately, people are pretty used to paying for this kind of a product. So we'll do freemium based on number/size of datasets uploaded and some as-yet-unreleased advanced features. Probably throw in some academic discounts for good measure.

Thanks for the question. Cheers!

kylemaxwell · on Aug 7, 2012

Good to hear. I always like seeing cool sites have a way to make money so I can have confidence they'll be around for a while. :)

grantjgordon · on Aug 6, 2012

Very nice. Who's the target audience for this? Students? Curious enthusiasts? Analysts within companies?

glaugh · on Aug 6, 2012

We think of our target audience in concentric circles. We'll likely have users from each circle at any given time, but we'll prioritize our product and marketing towards the inner circles then move outwards:

Circle 1. A few specific analysts in a few specific companies we're associated with. They analyze survey data, they use only basic functionality of the fancy tools, and they want a simpler solution.

Circle 2. People analyzing surveys generally. It's a straightforward application where existing tools are way too complicated.

Circle 3. The rest of the 50% of stats tool users that never use more than the core functionality of existing tools (that number is from our research).

Circle 4. People who analyze at work. In particular, Excel power-user analysts and marketing folks for whom the go-to tool for analysis is the pivot table. We want to ease them into the world of more powerful, statistical analysis. We do a lot of usability testing with these folks and we're excited about their reactions so far. But they're not in a lot of pain, so they're not a great initial audience for us.

Grand vision stuff: Tools like SPSS and the like were built in the 80s, and Excel pivot tables were built in the 90s. They've been updated but not overhauled, and there's a gaping hole between them in terms of ease of use and power. As small, rich datasets become ubiquitous, are people in 2020 really going to be using tools from 1990? We hope not.

grantjgordon · on Aug 7, 2012

Thanks! Very insightful.

kirillzubovsky · on Aug 6, 2012

A statistics application that run on the cloud and looks good too? Yes please! Looking forward to playing around with the data to see what's possible. Where are you guys planning to take this software?

tel · on Aug 6, 2012

I'm worried for how quickly you can do tests with this interface. I feel my fingers urging for hypothesis hunting---do you have multiple comparison corrections in place?

glaugh · on Aug 6, 2012

Totally valid. We do multiple comparison protection on ANOVA post hoc tests, but not across all analyses.

Ultimately we'll need to address this. Hopefully doing so (automatically) will differentiate Statwing from other stats package, where one is quite free to shoot one's self in the foot (and one often does).

We'll count this as a vote for the prioritization of that feature.

Thanks for the comment, really appreciate it.

tel · on Aug 7, 2012

Honestly, I wrote is as a disguised compliment. Doing tests quickly makes statistical validation available and that is solidly better than winging it because you can't be bothered to do the math.

I did a bit of brainstorming previously about penalties and negative feedback controls for hypothesis testing in a medical context. The metaphor I liked was that you are buying hypothesis tests with data and therefore there is a penalty risk for each attempted test. I never worked out the math very thoroughly, but I'd love to see how a system like that would work live.

I think it'd be an amazing boon to your system to have these kinds of feedback. You'd not only be easy and available but also trusty since you make sure you never promise too much.

Very cool project.

hashpipers234 · on Aug 6, 2012

I can do everything they can do in matlab with your data in less time and with less hassle. my only price is a xmen comic book and a 6 pack of coke.

hokua · on Aug 7, 2012

Similar to what Swivel was trying to do. Great idea, nice execution, but really how will you monetize this? There is no real market for consumer grade "intuitive" statistical software. While this will appeal to casual data analyzers, these users arnt ready to spend much money on tools. And those doing data analysis for a living prefer their power tools: R, SAS, Matlab, NumPy, etc.

glaugh · on Aug 7, 2012

Agreed that if you spend most of your day most days doing analysis of large datasets you probably need a power tool.

But there's a whole class of overlooked folks who need to do statistical analysis on smaller datasets on more of a weekly or several-consecutive-intense-days-per-month basis. These folks, who split time between Excel and stats tools, make up a surprisingly high proportion of the user base of stats products. And they tell us they're willing to pay for something that makes their analysis and communication more efficient.

Thanks for the comments, and for the kind words RE the idea and execution.

edit: And to be fair to your point, we're sort of comparing apples and oranges insomuch as you're looking at what we have now (not nearly enough) and we're looking at our roadmap for what we'll have in six months, a year, etc.

jaylevitt · on Aug 7, 2012

What if they hire some stats bloggers and become the next Nate Silver - only with analyses that WE can all interact with and learn from?

What if this could improve statistical literacy?

doleson · on Aug 6, 2012

Are there any plans to add-in any realtime feeds? Like say weather data and the Dow jones close to see any correlations?

fywacro · on Aug 6, 2012

Depends on what you mean by realtime. In most cases (stats-wise, anyway) analyzing a non-stored infinite-length data stream is a very different challenge from analyzing a stored, finite-length data set.

Streaming algorithms do exist for many basic statistical measures. But in many other cases, the best streaming algos aren't cheap or accurate enough to be useful.

Bucketing can sometimes substitute for a bona fide streaming algorithm. But again, there's plenty of cases where bucketing won't work well enough to make it useful.

I haven't really looked at Statwing yet--the premise is really tantalizing, though. Gotta find an excuse to throw a spreadsheet in there and see what comes back.

georgek · on Aug 6, 2012

+1. I really like the intuitive interface and the speed with which I can conduct analysis. It would be great to have a library of feeds for each user that is automatically curated / updated. This library could include both public datasets (fore free) but also proprietary feeds specific to my industry or even my company that are only accessible by me (which I would pay for).

glaugh · on Aug 7, 2012

This would indeed be super cool. Could definitely see us getting to this eventually.

jqueryin · on Aug 7, 2012

While I appreciate the graphs, I'd also like to see the numbers if I hover over wording that says "Very clearly significant". What confidence interval are we talking about? 95%?

If I was you, I'd hide this information from the average user but make it available in a tooltip to those of us who care.

glaugh · on Aug 7, 2012

Good call RE the tooltip.

Just for reference, everything's at 95% confidence. We do mention that in the Advanced output but it's perhaps a bit too hidden.

leeny · on Aug 7, 2012

The optional upgrade survey appears to be broken. After I submit the survey, I get redirected back to the login page with my username in the query string. After I click "login", I get the alert telling me I can take a survey to upgrade. Rinse. Repeat.

dlf · on Aug 9, 2012

I absolutely love this. I'm learning to code (slowly) and have an infatuation with data visualization. I've imagined what something like this might look like, and I think you guys absolutely nailed it. Well done!

dlf · on Aug 9, 2012

P.S. I shared this with the Maxwell School alumni group on LinkedIn, so hopefully that drives some traffic your way! I think that my fellow MPA alums will dig it.

duaneb · on Aug 6, 2012

Very cool. Why should I use this instead of R/gnuplot?

lejohnq · on Aug 6, 2012

Thanks!

We are trying to make Statwing automatically display the right analyses for the portions of your data you are most interested in.

If we can accomplish that, then hopefully we've helped make you faster at understanding the relationships in your data. Maybe that is enough so you don't need to break out R for basic analyses. Otherwise I would also use R. I made some graphs in R that wouldn't be able to do in Statwing right now, but if we can output the right things based on your data then hopefully you could save some time with us.

leeny · on Aug 9, 2012

I'd like to throw in another vote for prioritizing regressions and specifically adding logistic regressions to the mix. Thanks!

mcarvin · on Aug 6, 2012

demo is very cool. love anything that can make pattern recognition in large datasets this much easier.

fredsters_s · on Aug 6, 2012

looks really awesome. interested to see what if any data analysis can be linked to current events.

Flenser · on Aug 7, 2012

"Female tends to have slightly higher values for Neuroticism than Male"