Submitted by terrykrohe t3_10mdxdd in dataisbeautiful
Comments
terrykrohe OP t1_j62imh0 wrote
best-fit lines, correlations: state+local ed spending VS evangelical
Purpose
In order to 'understand' the non-random, top/bottom, Rep/Dem differentiation of metric values, eight "response" metrics are correlated with three "predictor" metrics. This post presents the 'response' variable state+local ed spending vs the evangelical 'predictor' metric....
the eight "response" metrics: GDP, state taxes; suicide rate, opioids; life expectancy, infant mortality; incarceration, state+local ed spending
... the three "predictor" metrics: 'rural-urban', evangelical, diversity*
the "big picture"
i) There is a non-random, top/bottom, Dem/Rep pattern. Patterns have reasons/causes and are mathematical.
ii) Rep states are always on the negative side (less GDP, more suicides, lower life expectancy, etc).
iii) How did 150 million voters, acting individually, separate the fifty states into two such disparate groups?
iv) is there a "predictive" metric or combination of metrics which can be used to explain the characteristic Rep/Dem differences seen in the data?
other comments
i) the t-test value of 0.10 for EdSpending is the largest t-test value of the eight Response metrics – indicating that the data has a 10% probability that the sample means represent the same Population
ii) the Ed Spending metric shows 'typical' Response to evangelical Predictor: increasing evangelical population correlates with decreased ed spending
iii) however, the impact values are small – indicating that the Evangelical-EdSpending relationship is not-important
iv) that curious Rep state with the smallest evangelical population? Utah
coffeesharkpie t1_j62z0bw wrote
I don't get your interpretation of the t-value and the 10% probability. To the best of my understanding the closer t is to 0, the more likely there isn't a significant difference between both samples. Now to get the p-value of the t-value we would need the number of dfs. But the p-values also don't tell us anything about the actual probabilities, but only how likely your data is, assuming a true null hypothesis (you'd need Bayesian statistics to get actualprobabilities).
terrykrohe OP t1_j633q99 wrote
https://reference.wolfram.com/language/ref/TTest.html
... open "details and options"
coffeesharkpie t1_j63k7fy wrote
Doesn't help to make clear if you report the t-value or the p-value of a t-test. If it's the t-value, you would at least need to report the related p-value to judge if the mean difference is statistically significant or not. If it is the p-value depending on the chosen alpha level (commonly .05) and depending on if it's one or two-sided, it's likely not statistically different because the value is too high. And if it's the p-value, you would also not interpret this directly as a probability of there being a difference in the means (at least in a Frequentist framework).
malachai926 t1_j63chqv wrote
Showing the first two plots as a sorted scatter plot is kind of an odd way to convey the data. You'd have been better served showing a histogram of sorts and what the fitted t-distribution would be between your two data sets. The overlap between those two curves is what gives you the best visual representation of a statistical difference.
It's also not really clear how you are classifying the data. Is every data point a state? Are you classifying a state as "Democrat" or "Republican" based on majority vote for president in some election? This info is necessary to properly interpret your results. If that's what you're doing, that's also kind of an odd analysis, since the state as a whole clearly doesn't represent just one party, not to mention that the population differences in conjunction with their classification really ought to be weighted accordingly. You're destroying the whole concept of "per capita" if this is what you are doing.
The "t-test" number on the top left is confusing. Is that the t-statistic or the p-value? And why doesn't the top right graph have a number, especially when it looks more likely to have a statistical difference?
Your bottom chart has a typo. "Evangelival."
I see you post stuff like this regularly. IMO you ought to clean up your presentation quite a bit and give it more thought. It's kind of a mess.
Signed, a biostatistician
terrykrohe OP t1_j643917 wrote
"Is every data point a state?"
50 states = 50 plot points
"Are you classifying a state as "Democrat" or "Republican" based on majority vote for president in some election?"
red = Rep states in 2020 election
blue = Dem states in 2020 election
"Is that the t-statistic or the p-value?
t-tests are usually reported using the p-value
"And why doesn't the top right graph have a number, especially when it looks more likely to have a statistical difference?"
... the t-test is sensitive to small mean variations: the top right plot shows the means separated by a SD, which is NOT a small difference ( t-test = 0.000015).
malachai926 t1_j64oqam wrote
To be frank, it's just poor presentation. Statisticians like myself will see lots of problems with this. If I am confused, I guarantee that the layperson will be even more so.
>red = Rep states in 2020 election
blue = Dem states in 2020 election
Even here, you aren't being clear enough. Are they "republican" because their votes for president in the 2020 election were majority in favor of the Republican candidate? Republican because they elected more Republican House congresspeople / senators? I can infer that you're likely referring to the electoral college result, but when people have to infer what you mean with your data, that's just bad practice that is bound to get you in trouble in the future.
>t-tests are usually reported using the p-value
Not always, no. A lot of published research will tell you both the t-statistic AND the p-value. If you're giving us a p-value, you should say it's a p-value, end of story.
>the t-test is sensitive to small mean variations: the top right plot shows the means separated by a SD, which is NOT a small difference ( t-test = 0.000015).
That's great, but why didn't you state that result in the graph? And again, don't say "t-test equals", at least say "t-test p-value equals". It's nonsense to say that a test equals something. The test generates a statistic and a p-value which equal something, but the test itself is a test. It pays to be explicit with what you are saying, or else other statisticians could misinterpret what you are saying. In this case, if someone thought you meant the t-statistic was 0.000015, that would mean the results were highly non-significant and would think you screwed up your calculation.
You seem to have some idea in your mind of how things are "typically" interpreted by various groups of people, but you should NOT rely on those assumptions because inevitably someone will interpret gray area in a way you didn't intend. It is always far, far preferable to be as explicit as you can with your definitions of things.
Again I think showing this as a sorted scatterplot is just weird. You really ought to show this data as a histogram. You're using a t-test, yeah? So it's really incumbent on you to demonstrate that the data really does follow the shape of a t-distribution to prove to your audience that such a test is acceptable. A histogram achieves that; this scatterplot does not.
Finally, maybe it's just me, but grouping these things together on a state level just feels like you're losing so much detail and misclassifying so much data that I really question the validity of your results. Maybe this is the best you have to work with, but you are classifying a state that went 51% in favor of the Democrat as 100% Democratic and vice versa, which then classifies every single school district in that state, including the likely numerous rural school districts where people are more likely to be conservative, as "Democratic" school districts contributing however much money they contributed towards education. You'd get a lot more robust data and far less of this kind of error if you were able to get this data by school district. If you don't have that data, it is what it is, but the end result is that I'll consider everything I said here and think "eh, this is kinda just bad analysis and is meaningless" and it gets disregarded. And I imagine you wouldn't want the analysis you spent all of this time and effort on to be disregarded, yeah?
terrykrohe OP t1_j65asln wrote
1
I do not think that the "lay person" has trouble understanding the presentation:
i) Dem states residents spend $300 more per person on education than do Rep state residents
ii) Rep states are more evangelical than are Dem states
iii) for both Dem and Rep states, as the evangelical % increases, the state+local ed spending decreases
2
I do not think that the "lay person" mis-understands why a state is labelled Rep or Dem (note the "2020 election" in title)
3
I do not think that the "lay person" cares about the t-test reporting (the issue is a "tempest in a tea-pot"). I have never had a non-"lay person" ask if the t-test is the statistic or the p-value; the Mathematica documentation notes By default, a probability value or p-value is returned.
4
The data is a visualization of tabular data presented by the source. The data is visualized using the Mathematica function"ListPlot":
https://reference.wolfram.com/language/ref/ListPlot.html
5
... you object to the "grouping at the state level": it is the way that the source presents the data.
6
There is NO analysis being done here; just data presentation. Inferences are the Reader's prerogative.
malachai926 t1_j65vrxn wrote
>I have never had a non-"lay person" ask if the t-test is the statistic or the p-value
Not everyone is as thorough as I am. You seem to have a strong interest in statistics, based on the content you typically post, and if you want to succeed in the field of statistics and get noticed, you'll have to start cleaning up your presentation.
>The data is a visualization of tabular data presented by the source. The data is visualized using the Mathematica function"ListPlot":
Then you really ought to use a different function. This is just a strange way of presenting your data. You've posted very similar types of graphs here often, and it seems like they don't get much of a response. The strange presentation is probably why.
>There is NO analysis being done here; just data presentation. Inferences are the Reader's prerogative.
A t-test is analysis.
terrykrohe OP t1_j66kodi wrote
... yeah, the t-test is analysis
It validates the separation of Rep and Dem states into distinct Sample populations; which permits the bottom plot: correlation of ed spending vs evangelical %, considered for Rep and Dem states separately.
(that "NO" violated the "avoid absolutes" dictum)
terrykrohe OP t1_j62i7ct wrote
sources
state+local ed spending
https://www.usgovernmentspending.com/compare_state_spending_2019b20a#copypaste
evangelical population
https://www.pewforum.org/religious-landscape-study/religious-tradition/evangelical-protestant/
tool: Mathematica
***************
top two plots:
– dashed lines are the mean values; the 'boxes' show one standard deviation from the mean
– "(3400 ± 630 (18%)" represents (mean ± 1 SD (relative SD); "relative SD" = SD/mean
bottom plot:
– the ellipses are centered on the Rep/Dem means; the standard deviations are represented by the ellipses' axes
– the 50 plot points represent the (evangelical, state+local ed spending) coordinates for each state; and are colored according to their 2020 Electoral College vote
– "r" is the Pearson correlation value
– the lines are the 'best-fit' lines thru the Dem and Rep data