No reason to choose, you can use both depending on your needs or processing stage. For example:
All steps can be done in either language but effort/support differs depending on case.
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt # roughly ~base R plotting functionality
import pandas as pd #roughly ~base R & tidyr functionality
import seaborn as sns #roughly ~ggplot2 functionality
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.filterwarnings('ignore')
#to make the plots appear inline, and saved in notebook:
%matplotlib inline
sns.set_context("talk") # seaborn function to make plots according to purpose (talk, paper, poster, notebook)
# We'll show people what versions we use
import datetime
now = datetime.datetime.now().isoformat()
print('Ran on ' + now)
import IPython
print(IPython.sys_info())
!pip freeze | grep -E 'seaborn|matplotlib|pandas|statsmodels'
Matplotlib Version: 2.2.3
NumPy Version: 1.15.2
('Pandas Version:', u'0.23.0')
Python Version: 2.7.15+ (default, Oct 2 2018, 22:12:08)
[GCC 8.2.0]
Ran on 2018-11-14T18:39:29.291564
{'commit_hash': u'a0d6ad545',
'commit_source': 'installation',
'default_encoding': 'UTF-8',
'ipython_path': '/usr/local/lib/python2.7/dist-packages/IPython',
'ipython_version': '5.7.0',
'os_name': 'posix',
'platform': 'Linux-4.18.0-10-generic-x86_64-with-Ubuntu-18.10-cosmic',
'sys_executable': '/usr/bin/python',
'sys_platform': 'linux2',
'sys_version': '2.7.15+ (default, Oct 2 2018, 22:12:08) \n[GCC 8.2.0]'}
matplotlib==2.2.3
pandas==0.23.0
seaborn==0.8.1
statsmodels==0.8.0

Multiple great tutorials already exist on how to handle, clean and shaping data into the right format(s) for your visualizations and analyses in python. I was helped by the 10 minutes to pandas tutorial, this brief overview, I still regularly need to check this cheat sheet, and of course google/stackoverflow gives you access to the very bright and helpful python data science community.
If you want a more thorough explanation of pandas, check this book on python for data analysis or this Python Data Science Handbook.
Rather than rehash what's in those works, we will learn by example using the data from a Navon experiment with hierarchical letters probing precedence and interference of global processing. For more background and a description of the experiment, see Chamberlain et al. (2017):
"The Navon task was a modified version of a hierarchical letter paradigm (Navon, 1977), designed to reduce the potential influence of factors confounding global processing such as spatial location and variation in shape characteristics. Participants were required to distinguish whether a global letter shape made up of local letters or the local letters themselves were vowels or consonants (Fig. 1). Vowel and consonant letter shapes were kept as comparable as possible. There were 5 consonant types (C, D, F, H, T) and 5 vowel types (A, E, I, O, U). Trial congruency was defined by the category type (vowel/consonant). In some congruent trials the exact letter identity matched between local and global stimulus levels, whilst in all other congruent trials only the category type matched. Presentation location of the test stimulus was randomized on a trial-by-trial basis, in order to eliminate the ability of participants to fixate on local spatial locations to determine global shape. The stimulus was presented in one of four corners of a 100x100-pixel square around central fixation. There were 10 practice trials followed by two blocks of 100 experimental trials.
In alternate blocks whose order was randomized, participants were instructed to either focus on the global letter shape (global selective attention) or on the local letter shapes (local selective attention) and press the ‘J’ key if the letter was a vowel, and the ‘F’ key if the letter was a consonant. Each trial began with a fixation cross presented for 1000 ms. The fixation cross then disappeared and was followed by the experimental stimulus (a white letter shape on a black background). The stimuli were presented for 300 ms followed by a 4 s response window. Feedback was presented in the form of a coloured (red/green) fixation cross which also encouraged central fixation for the next trial. Both accuracy and reaction time(s) were recorded. Stimulus presentation and data collection were controlled using the Psychopy package (Peirce, 2007) and stimuli were created using the MATLAB toolbox GERT (v1.20) (Demeyer & Machilsen, 2012)."
!pwd
!ls data/ #we can use shell commands to see where the data is (ls=dir on windows)
/home/sander/Bureaublad/Dropbox/exp/Reproducible_workflow_tutorial/PythoninPsy CO10 demographicNavonGlobCO10.csv demographicNavonLocCO10.csv CO12 demographicNavonGlobCO12.csv demographicNavonLocCO12.csv
import glob, os #to work with paths
df = pd.DataFrame()
folders = ['CO10','CO12'] # the raw data is in two different folders
for folder in folders:
if folder=='CO10': sep=';' #data in different folders use different field separators
else: sep=','
all_files = glob.glob(os.path.join('data', folder, "*.csv")) # get list of individual data files
df_from_each_file = (pd.read_csv(f, sep=sep, index_col=0) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
df = df.append([df,concatenated_df])
df.head()
| acc_mean | acc_raw | acc_std | age | cond | condition | congruence | corrAns | expversion | gender | ... | n | order | participant | resp_raw | rt_mean | rt_raw | rt_std | stimFile | xPos | yPos | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 0 | 18.0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | ... | 1 | 18 | 12277 | 'f' | 0,746331811 | 0,746331811 | 0 | Global_C_local_C.png | NaN | -50 |
| 1 | 1 | 1 | 0 | 18.0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | ... | 1 | 41 | 12277 | 'f' | 0,687273145 | 0,687273145 | 0 | Global_C_local_D.png | NaN | 50 |
| 2 | 1 | 1 | 0 | 18.0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | ... | 1 | 46 | 12277 | 'f' | 0,705967069 | 0,705967069 | 0 | Global_C_local_F.png | NaN | -50 |
| 3 | 1 | 1 | 0 | 18.0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | ... | 1 | 78 | 12277 | 'f' | 0,661773503 | 0,661773503 | 0 | Global_C_local_H.png | NaN | -50 |
| 4 | 1 | 1 | 0 | 18.0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | ... | 1 | 74 | 12277 | 'f' | 0,794407368 | 0,794407368 | 0 | Global_C_local_T.png | NaN | -50 |
5 rows × 21 columns
# add demographics
all_files = glob.glob(os.path.join('data', "*.csv"))
df_from_each_file = (pd.read_csv(f, sep=sep, index_col=0) for f in all_files)
dfdemo = (pd.concat(df_from_each_file, ignore_index=True, axis=0)
.drop_duplicates(['Subject'], keep='first', inplace=False) # drop duplicate rows for subjects
.sort_values('Subject')
)
dfdemo.head()
| Subject | age | gender | |
|---|---|---|---|
| 0 | 7576 | 19.0 | 1.0 |
| 1 | 10711 | 20.0 | 1.0 |
| 2 | 10771 | 20.0 | 1.0 |
| 3 | 11200 | 19.0 | 1.0 |
| 4 | 11212 | 18.0 | 1.0 |
Note the method chaining in the example above. Most pandas methods return a DataFrame so that another pandas method can be applied to the result (with the dot formulation). Using the pandas-ply library you can also use dplyr style piping in pandas. Depending on how you value code readability, performance or debugging you'll prefer one or the other.
print dfdemo.age.value_counts()
print df.groupby(['participant']).age.apply(lambda x: x.iloc[0]).value_counts()
18.0 145 19.0 78 20.0 24 21.0 10 23.0 4 24.0 4 40.0 1 28.0 1 17.0 1 25.0 1 41.0 1 22.0 1 26.0 1 Name: age, dtype: int64 18.0 145 19.0 78 20.0 24 21.0 10 23.0 4 24.0 4 40.0 1 28.0 1 17.0 1 25.0 1 41.0 1 22.0 1 26.0 1 Name: age, dtype: int64
# age column is the same between dfs so we can drop it in the original data before merging
df=df.drop(['age'], axis=1)
df= pd.merge(df, dfdemo, how='left', left_on='participant', right_on='Subject')
df.head()
| acc_mean | acc_raw | acc_std | cond | condition | congruence | corrAns | expversion | gender_x | location | ... | resp_raw | rt_mean | rt_raw | rt_std | stimFile | xPos | yPos | Subject | age | gender_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | left_right | ... | 'f' | 0,746331811 | 0,746331811 | 0 | Global_C_local_C.png | NaN | -50 | 12277 | 18.0 | 1.0 |
| 1 | 1 | 1 | 0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | up_right | ... | 'f' | 0,687273145 | 0,687273145 | 0 | Global_C_local_D.png | NaN | 50 | 12277 | 18.0 | 1.0 |
| 2 | 1 | 1 | 0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | down_right | ... | 'f' | 0,705967069 | 0,705967069 | 0 | Global_C_local_F.png | NaN | -50 | 12277 | 18.0 | 1.0 |
| 3 | 1 | 1 | 0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | left_right | ... | 'f' | 0,661773503 | 0,661773503 | 0 | Global_C_local_H.png | NaN | -50 | 12277 | 18.0 | 1.0 |
| 4 | 1 | 1 | 0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | down_right | ... | 'f' | 0,794407368 | 0,794407368 | 0 | Global_C_local_T.png | NaN | -50 | 12277 | 18.0 | 1.0 |
5 rows × 23 columns
sns.distplot(df.age.dropna(), kde=False);
print 'Variables:\n', df.columns
print 'variables:\n', df.dtypes
print 'nb of participants:', len(df['participant'].unique())
print 'trials per participant:', len(df)/ len(df['participant'].unique()) #not exactly
Variables:
Index([u'acc_mean', u'acc_raw', u'acc_std', u'age_x', u'cond', u'condition',
u'congruence', u'corrAns', u'expversion', u'gender_x', u'location',
u'n', u'order', u'participant', u'resp_raw', u'rt_mean', u'rt_raw',
u'rt_std', u'stimFile', u'xPos', u'yPos', u'Subject', u'age_y',
u'gender_y'],
dtype='object')
variables:
acc_mean int64
acc_raw int64
acc_std int64
age_x float64
cond object
condition object
congruence object
corrAns object
expversion object
gender_x object
location object
n int64
order int64
participant int64
resp_raw object
rt_mean object
rt_raw object
rt_std int64
stimFile object
xPos float64
yPos int64
Subject int64
age_y float64
gender_y float64
dtype: object
nb of participants: 275
trials per participant: 309
# rename a variable/column
df = df.rename(columns={'acc_raw': 'acc', 'rt_raw': 'rt'})
#properly format RTs (dot notation)
df['rt'] = df['rt'].apply(lambda x: str(x).replace(',','.')).astype('float64')
# fill in missing values
df.loc[df.rt<0,['rt']]= np.nan
# we have 2 gender columns now, how do they compare?
df.gender_x.value_counts()
F 44800 M 15600 Name: gender_x, dtype: int64
# Correcting values in gender column
df.loc[df['gender_x'] == 'vrouw', 'gender_x'] = 'F'
df.groupby('gender_x')['participant'].nunique()
count 151 unique 2 top F freq 112 Name: gender_x, dtype: object count 60400 unique 2 top F freq 44800 Name: gender_x, dtype: object F 44800 M 15600 Name: gender_x, dtype: int64
# check gender stats
df.groupby('gender_y')['participant'].nunique()
count 151.0 unique 2.0 top 1.0 freq 110.0 Name: gender_y, dtype: float64 count 60400.0 unique 2.0 top 1.0 freq 44000.0 Name: gender_y, dtype: float64 1.0 44000 0.0 16400 Name: gender_y, dtype: int64
# count missing values
df.isnull().sum()
acc_mean 0 acc 0 acc_std 0 age_x 600 cond 0 condition 0 congruence 0 corrAns 0 expversion 60400 gender_x 24800 location 0 n 0 order 0 participant 0 resp_raw 302 rt_mean 0 rt 302 rt_std 0 stimFile 0 xPos 60400 yPos 0 Subject 0 age_y 600 gender_y 24800 dtype: int64
# Save merged data
df.to_csv('dfmerged.csv')
# Save merged data
df = pd.read_csv('dfmerged.csv')
Common operation:
Can be about:
It helps to start with the result you want, and work backwards from there. If you want to get a single value for each group, use aggregate(), apply() (or one of the shortcuts). If you want to get a subset of the original rows, use filter(). And if you want to get a new value for each original row, use transform(). Check this page for some more advanced groupby recipes. Note that apply() is most flexible because it can implement an aggregate or a transform, and anything in between. Examples follow...
df.groupby(['cond', 'congruence']).acc.mean() #.reset_index()
cond congruence
selAttGlob congruent 0.918967
incongruent 0.899812
selAttLoc congruent 0.936808
incongruent 0.894413
Name: acc, dtype: float64
Insert a column with standardized RT:
zscore = lambda x: (x - x.mean()) / x.std()
df.insert(17, 'Zage', df.groupby(['participant'])['age'].transform(zscore))
# proportion of incorrect trials
1-df.acc.mean()
0.08750000000000002
sns.distplot(df.groupby(['participant']).acc.mean(), bins=20);
?sns.distplot()
Or just press shift-tab when cursor is on a method/module to view info. Note that you can find all jupyter lab commands in the left panel, under the Commands tab. You can also open a console (file-> view console for notebook) to quickly try out some code before entering them in a cell.
df = df.groupby('participant').filter(
lambda x : x['acc'].mean() > df.acc.mean()-(2.5*df.groupby(['participant']).acc.mean().std()))
print len(df.participant.unique())
266
sns.distplot(df.groupby(['participant']).acc.mean(), bins=20);
# discard incorrect trials for RT analyses
dfrt = df[df.acc==1]
sns.distplot(dfrt.rt);
dfrt = dfrt.groupby('participant').filter(
lambda x : x['rt'].mean() < df.rt.mean()+(2.5*df.groupby(['participant']).rt.mean().std()))
print len(dfrt.participant.unique())
258
sns.distplot(dfrt.rt);
len(dfrt)
85200
dfrt = dfrt[dfrt.groupby('participant').rt.transform('mean') +(2.5*dfrt.groupby('participant').rt.transform('std')) > dfrt.rt]
len(df)
82200
sns.distplot(dfrt.rt);
fig, ax = plt.subplots()
conditions = ['congruent', 'incongruent']
for condition in conditions:
condition_data = dfrt[(dfrt['congruence'] == condition) & (dfrt['cond'] == 'selAttGlob')]['rt']
sns.kdeplot(condition_data, shade=True, label=condition)
sns.despine()
More ways to explore RT distributions.
# summary statistics of RT per condition
dfrt.groupby(['condition','cond','congruence']).rt.describe()
| count | mean | std | min | 25% | 50% | 75% | max | |||
|---|---|---|---|---|---|---|---|---|---|---|
| condition | cond | congruence | ||||||||
| Cc | selAttGlob | congruent | 9009.0 | 0.542534 | 0.127964 | 0.207480 | 0.461660 | 0.523406 | 0.598791 | 1.582623 |
| selAttLoc | congruent | 9089.0 | 0.556194 | 0.135140 | 0.189890 | 0.466694 | 0.536255 | 0.615781 | 1.714776 | |
| Cv | selAttGlob | incongruent | 8721.0 | 0.559838 | 0.142280 | 0.102924 | 0.469102 | 0.538965 | 0.618073 | 1.845350 |
| selAttLoc | incongruent | 8708.0 | 0.539550 | 0.123220 | 0.068658 | 0.453851 | 0.524752 | 0.606011 | 1.735075 | |
| Vc | selAttGlob | incongruent | 8958.0 | 0.523678 | 0.125208 | 0.069782 | 0.439635 | 0.509356 | 0.581043 | 1.911829 |
| selAttLoc | incongruent | 8651.0 | 0.584175 | 0.142684 | 0.015780 | 0.491742 | 0.563984 | 0.648891 | 1.976961 | |
| Vv | selAttGlob | congruent | 9092.0 | 0.513652 | 0.115375 | 0.229930 | 0.437660 | 0.496987 | 0.570268 | 1.520877 |
| selAttLoc | congruent | 9234.0 | 0.516471 | 0.115800 | 0.199753 | 0.435849 | 0.500862 | 0.574013 | 1.470798 |
g = sns.factorplot(x="condition", y="rt", hue="congruence",
col="cond", unit='participant', data=dfrt, kind="point");
g.set(xlabel="Condition", yticklabels=[y for y in np.linspace(0,.6,6)]);
# we could define a function for saving plots for use outside of notebook (e.g. in paper)
def save_fig(fig, figname):
if not os.path.exists('figs'):
os.makedirs('figs')
fig.savefig("figs/%s.pdf" % figname, dpi=300) #high dpi for printing
fig.savefig("figs/%s.png" % figname, dpi=300)
fig.savefig("figs/%s.svg" % figname, dpi=300)
dfsum = dfrt.groupby(['participant', 'congruence', 'cond']).rt.mean().reset_index()
g = sns.boxplot(x="cond", y="rt", hue="congruence", data=dfsum);
g.legend(loc="upper left");
save_fig(g.get_figure(),'box')
def interference(group):
return group[group.congruence=='incongruent'].rt.mean()-group[group.congruence=='congruent'].rt.mean()
dfIF = dfrt.groupby(['participant','cond']).apply(interference).reset_index()
dfIF = dfIF.rename(columns={0: 'interference'})
dfIF.head()
| participant | cond | interference | |
|---|---|---|---|
| 0 | 7576 | selAttGlob | 0.008086 |
| 1 | 7576 | selAttLoc | 0.014542 |
| 2 | 10711 | selAttGlob | 0.013350 |
| 3 | 10711 | selAttLoc | 0.024520 |
| 4 | 10771 | selAttGlob | 0.066292 |
sns.swarmplot(x="cond", y="interference", data=dfIF);
sns.set_context("poster", font_scale=1.3)
sns.set_palette('deep') # options: deep, muted, bright, pastel, dark, colorblind
fig, ax = plt.subplots(figsize=(10, 6))
conditions = ['selAttGlob', 'selAttLoc']
for i, condition in enumerate(conditions):
condition_data = dfIF[(dfIF['cond'] == condition)]['interference']
sns.distplot(condition_data, label=condition);
ax.axvline(x=dfIF[(dfIF['cond'] == condition)]['interference'].mean(), linestyle='--', color= sns.color_palette()[i])
# embellish plot
sns.despine()
ax.set_title('Navon interference on reaction times')
ax.set_xlabel("Interference")
# Improve the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, ['local interference', 'global interference'], loc="best");
save_fig(fig, "interference")
For more customization options, see the seaborn tutorials. More flexibility is provided by the matplotlib core: a cheat sheet.
from statsmodels.stats.weightstats import ztest
out = ztest(dfIF[dfIF.cond=='selAttLoc']['interference'], value=0)
print "t-test for global interference different from zero:\nt = ", round(out[0],2), "p = ", round(out[1],4)
out = ztest(dfIF[dfIF.cond=='selAttGlob']['interference'], value=0)
print "t-test for local interference different from zero:\nt = ", round(out[0],2), "p = ", round(out[1],4)
t-test for global interference different from zero: t = 14.92 p = 0.0 t-test for local interference different from zero: t = 8.38 p = 0.0
Luckily these conclusions are consistent with those reported in the paper...
md = smf.mixedlm("rt ~ congruence * cond", dfrt, groups=data["participant"])
mdf = md.fit()
print(mdf.summary())
Mixed Linear Model Regression Results
======================================================================================
Model: MixedLM Dependent Variable: rt
No. Observations: 71620 Method: REML
No. Groups: 258 Scale: 0.0136
Min. group size: 158 Likelihood: 51605.4400
Max. group size: 386 Converged: Yes
Mean group size: 277.6
--------------------------------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 0.529 0.004 142.079 0.000 0.522 0.537
congruence[T.incongruent] 0.013 0.001 10.854 0.000 0.011 0.016
cond[T.selAttLoc] 0.009 0.001 7.125 0.000 0.006 0.011
congruence[T.incongruent]:cond[T.selAttLoc] 0.011 0.002 6.556 0.000 0.008 0.015
groups RE 0.003 0.003
======================================================================================
sns.factorplot(x="condition", y="acc", hue="congruence",
col="cond", unit="participant", data=df, kind="point");
Incidentally, it is possible to use latex in your markdown, for example, drop that Bayes!
$ P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} $
Just kidding, I'm not going to show you how to do a Bayesian analysis on these data...
Statsmodel has an alternative method to deal with clustered data: Generalized Estimating Equations (GEE). GEEs have a few attractive advantages over hierarchical linear models (“mixed models”), for example fewer assumptions about the random effects and more intuitive interpretation of coefficients. Especially for discrete dependent variables (i.e., our binary accuracy data) the likelihood-based mixed models can show difficult convergence and a lack of robustness to misspecification of the covariance structure. See also: McNeish, D., Stapleton, L. M., & Silverman, R. D. (2016). On the Unnecessary Ubiquity of Hierarchical Linear Modeling. Psychological Methods.
# model formulation
fml = "acc ~ cond * congruence"
# covariance structure
ex = sm.cov_struct.Exchangeable()
#link fu
fa = sm.families.Binomial(sm.families.links.logit)
model = sm.GEE.from_formula(fml, "participant", df, cov_struct=ex, family=fa)
result = model.fit()
print(result.summary())
print(result.cov_struct.summary())
GEE Regression Results
===================================================================================
Dep. Variable: acc No. Observations: 82200
Model: GEE No. clusters: 266
Method: Generalized Min. cluster size: 200
Estimating Equations Max. cluster size: 400
Family: Binomial Mean cluster size: 309.0
Dependence structure: Exchangeable Num. iterations: 8
Date: Fri, 25 May 2018 Scale: 1.000
Covariance type: robust Time: 16:22:30
===============================================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------------------------------------
Intercept 2.5646 0.049 51.960 0.000 2.468 2.661
cond[T.selAttLoc] 0.2534 0.057 4.446 0.000 0.142 0.365
congruence[T.incongruent] -0.2408 0.048 -4.998 0.000 -0.335 -0.146
cond[T.selAttLoc]:congruence[T.incongruent] -0.3669 0.072 -5.126 0.000 -0.507 -0.227
==============================================================================
Skew: -3.0706 Kurtosis: 7.4768
Centered skew: -2.9520 Centered kurtosis: 7.1066
==============================================================================
The correlation between two observations in the same cluster is 0.023
import rpy2.rinterface
%reload_ext rpy2.ipython
%R -n require(lme4);
%R -n require(tidyr); require(ggplot2)
A killer feature of Jupyter notebooks are magic commands.
%%R
# ^ Tells the notebook that this code cell is actually R
# Now we can write R code as if this were an R notebook
X_vals <- c(1, 2, 3, 4, 5)
y_vals <- c(1, 2, 4, 8, 16)
plot(X_vals, y_vals,
col='purple', pch=12,
main='R plot in Python')
np.random.seed(42)
# Make a pandas DataFrame
testdf = pd.DataFrame(np.random.normal(0,1,size=(100, 3)), columns=list('ABC'))
testdf['C'] = testdf['C'] + 2
testdf.head()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-2-a33b25f73a6a> in <module> ----> 1 np.random.seed(42) 2 # Make a pandas DataFrame 3 testdf = pd.DataFrame(np.random.normal(0,1,size=(100, 3)), columns=list('ABC')) 4 testdf['C'] = testdf['C'] + 2 5 testdf.head() NameError: name 'np' is not defined
%%R -i testdf
testdf %>%
gather("Category", "X") %>%
ggplot(aes(x = Category, y = X, fill = Category)) +
geom_violin() +
stat_summary(fun.y=mean, color='black', geom='point', size = 3) +
theme_bw()
dfmelted = pd.melt(testdf, value_vars=["A","B","C"], var_name="category", value_name="X")
print dfmelted.head()
dfmelted["xbin"] = dfmelted.X > .5
sns.violinplot(x="category", y="X", data=dfmelted, inner="quart");
category X 0 A 0.496714 1 A 1.523030 2 A 1.579213 3 A 0.542560 4 A 0.241962
Note that we used the pandas melt function to go from a dataset in the wide form (testdf above) to long form (done in R with the melt or gather function). There are many more reshaping and pivoting functions in pandas, which we won't cover. The best way to get a feel for the pandas reshape and pivot functions is to go through this brief visual guide. Further details can be found in the pandas documentation.
from plotnine import *
(ggplot(dfmelted, aes(x = 'category', y = 'X', fill = 'category'))
+ geom_violin()
+ stat_summary(fun_y=np.mean, color='black', geom='point', size = 3)
+ theme_bw())
<ggplot: (8743702045645)>
dfreduc=df.iloc[:5000]
dfreduc.head()
| acc_mean | acc | acc_std | cond | condition | congruence | corrAns | expversion | gender_x | location | ... | resp_raw | rt_mean | rt | rt_std | stimFile | xPos | yPos | Subject | age | gender_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | left_right | ... | 'f' | 0,746331811 | 0.746332 | 0 | Global_C_local_C.png | NaN | -50 | 12277 | 18.0 | 1.0 |
| 1 | 1 | 1 | 0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | up_right | ... | 'f' | 0,687273145 | 0.687273 | 0 | Global_C_local_D.png | NaN | 50 | 12277 | 18.0 | 1.0 |
| 2 | 1 | 1 | 0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | down_right | ... | 'f' | 0,705967069 | 0.705967 | 0 | Global_C_local_F.png | NaN | -50 | 12277 | 18.0 | 1.0 |
| 3 | 1 | 1 | 0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | left_right | ... | 'f' | 0,661773503 | 0.661774 | 0 | Global_C_local_H.png | NaN | -50 | 12277 | 18.0 | 1.0 |
| 4 | 1 | 1 | 0 | selAttGlob | Cc | congruent | ['f'] | NaN | F | down_right | ... | 'f' | 0,794407368 | 0.794407 | 0 | Global_C_local_T.png | NaN | -50 | 12277 | 18.0 | 1.0 |
5 rows × 23 columns
%%R
lr.test = function(m1, m2, name){
print(summary(m1))
print(summary(m2))
out = anova(m1, m2)
chi2 = out$Chisq[2]
dof = out$"Chi Df"[2]
p = out$"Pr(>Chisq)"[2]
test_str = "Likelihood ratio test for %s:\n Chisq(%d) = %.2f; p = %.3g"
writeLines(sprintf(test_str, name, dof, chi2, p))
}
%%R -i dfreduc
m = glmer(acc ~ cond * congruence + (1 | participant), dfreduc, family=binomial)
m.null = glmer(acc ~ cond + congruence + (1 | participant), dfreduc, family=binomial)
lr.test(m, m.null, "interaction effect of selective attention * congruence")
Error in glmer(acc ~ cond * congruence + (1 | participant), dfreduc, family = binomial) : could not find function "glmer"
%%R -i dfrt
m = glmer(rt ~ cond * congruence + (1 | participant), dfrt)
m.null = glmer(rt ~ cond + congruence + (1 | participant), dfrt)
lr.test(m, m.null, "interaction effect of selective attention * congruence")
UsageError: Cell magic `%%R` not found.
Aside from the examples, this tutorial is basically a mashup of sections of these sources:
https://github.com/ujjwalkarn/DataSciencePython