December 30, 2012

Four groups: a visual point of view


I follow the study written on the message the groups of my study article

We start with the first screen that colorize my first group explained. This first group describes the most hard and poor people where we found violent crime problems. 
We can see that the value is shared on the entire map. Certain states are more represented as California  New York, Michigan, Delaware and Kansas

Group around the violent criminality 

The fourth group the group of the ideal family , white with two children ,..  This perfect group is shared along the united states of america. More representative states are based on the north . Where the south is more sweet.

The group of the ideal family

 The group of workers. We found this group in all the state with an equal representation.

The group of poor workers

The group of manager is really interesting. because the structure of this group is really important in California , New Jersey, Connecticut and Massachusetts . Idaho is lowest represented as Kentucky , Louisiana, ...

The group of the managers 


Images of violent crimes by state in USA

Images of violent crimes by state in USA


The first image propose the max county value of the state
The second image is the mean value of the violent crime of the counties
From the max violent crimes on county
From the mean violent crimes of the counties

Predict and model estimation

Short summary

For study and to estimate a good model we need a scientific approach in segment values to construct the model and values to tes the model.
For certain method as neural network , we need to add some validation values to stop and find the stability of the engine.

Our exercise is to try to extract from the existing data , a stable model to predict the value of the violent crimes for 100k habitant .The typology of the violent crimes is large each county communities and countries have their proper approach. We can include murders, sexual acts, ...
More approach is considered on my study by regression , SVM and neural network.


The values segmentation

The data are cut as this:

  • 1094 values for the model estimation
  • 401 values for the validation of the model
  • 101 values to verify and test the precision of the model


The first strategies

The data are really hard to be separated , it s why in ours strategies the classification could be short.

The first analysis should permit to estimate models in a classical manner.
A first good point to compare with strategy of data reduction.
A second analysis should cut the data in more than 2 classes and models. Yes , to have a more accuracy estimation we can cut our model and use a SVM for automatic classification on new values.
A third approach is to remove unnecessary data as extrema or non connected individuals or variables.
The last study should mix these strategies.    

The groups resulting of my study on classification

Short intro

I want to present the final elements resulting of the review of the variables of my study on violent crimes.
Why grouping variables ? It is a good question but we need to know how variables interact and coexist . At the same time their understanding permit to implement some strategy on research as vectorization or map reducing.
When I have started my study I haven't searched to define group of people. But the study has moving me to identify some humans group. When I say humans group , I want to talk about social relationship.
It s really a surprise for me to distinguish clearly social and cultural group.
I let you discover my final analysis.

Results

The first group : grouped around the violent crime per 100k hab variable 

HousVacant, LandArea, LemasPctOfficDrugUn, numbUrban ,NumIlleg ,NumImmig, NumInShelters, NumStreet, NumUnderPov, PctForreignBorn, PctHousNoPhone ,PctIlleg, PctLargHouseFam, PctLargHouseOccup, PctLess9thGrade , PctNotSpeakEnglWell , PctPersDenseHous, PctPopUnderPov ,PctRecentImmig ,PctRecImmig10,PctRecImmig5,PctRecImmig8, PctVacantBoarded , PctWOFullPlumb , pctWPubAsst , PopDens ,population  , racepctblack  ,racePctHisp   , ViolentCrimesPerPop.

I know that the readability of this group is not easy. But I can give some information. 
This group is a group of people living in an area where house are vacant and/or boarded, really urban and very dense, with illegitime children, immigrant, where people don t have a phone, living in large house with public assistance and not graduate. Black and hispanic race. 
We have all the principal values around the violent criminality. Reducing one of this factor can have a real impact on crime activity.

The second group : the poor workers 

agePct12t21,agePct12t29,agePct16t24,agePct65up, FemalePctDiv, householdsize, indianPerCap, MalePctDivorce, MalePctNevMar  , MedOwnCostPctInc ,MedOwnCostPctIncNoMtg , MedRentPctHousInc ,MedYrHousBuilt , PctEmplManu, PctEmplProfServ, PctHousLess3BR, PctImmigRec10,PctImmigRec5,PctImmigRec8 ,PctImmigRecentPctNotHSGrad, PctOccupManu, PctUnemployed, PctUsePubTrans, PctVacMore6Mo, pctWFarmSelf  pctWSocSec, PersPerFam,PersPerOccupHous,PersPerOwnOccHous,PersPerRentOccHous , racePctAsian, TotalPctDiv 
This group correspond to the mean of the population that use public transport, have manual work or unemployed , without social security, immigrant , without diploma. Indian and Asian are represented.


The third group : the managers

AsianPerCap ,blackPerCap, HispPerCap, medFamInc ,medIncome ,MedNumBR , MedRent, OwnOccHiQuart ,OwnOccLowQuart, OwnOccMedVal , PctBSorMore, PctOccuptMgmtProf, perCapInc, RenLowQ,RentHighQ,RentMedian, white per cap   
This group is interesting, because we mix some race as white, black , asian and hispanic . An this group is composed by managers whose living in their proper house or renting it. We can say that this group manage the second group.


The fourth group : the ideal family

PctBornSameState, PctEmploy, PctFam2Par, PctKids2Par, PctSameCity85 ,PctSameHouse85,PctSameState85,PctSpeakEnglOnly,PctTeen2Par , pctUrban, pctWInvInc, PctWorkMom,PctWorkMomYoungKids , PctWRetire, pctWWage ,PctYoungKids2Par , racePctWhite  , PctHouseOccup ,PctHouseOwnOccupPctPersOwnOccup
This incredible group is a perfect family as we can see in the idealiste literature.
With two kids, living in the same area since a long time , speaking in english , working or retired and not unemployed and white race.
Stability of the group on their area permit the employment and the tv dict some idea as two children by family, ...






November 19, 2012

French study

Predict by mixing strategy

My last prediction analysis is  


Method
RMSE
MAE
MSE
ARV
likelihood, gaussian mixture    
0.1241
0.087554
0.015401
0.42164
Full data set
0.13475
0.098797
0.018157
0.49709
Full data set cut in 2 classes
0.13686
0.097323
0.01873
0.51276
 Full data set cut in 3 classes
0.13521
0.094228
0.018282
 0.50052
Removed Variables
0.13406
0.097274
0.017972
0.49202
Removed Communities
0.12757
0.092739
0.016275
0.44557
Mixte avec 2 classes
0.1241
0.087554
0.015401
0.42164
linear regression
0.12437
0.087327
0.015467
0.42344
Full data set
0,13499
0.099144
0.018222
0.49888
 Full data set cut in 2 classes
0.13763
0.099092
 0.018942
0.51857
 Full data set cut in 3 classes
0.13501
0.096144
0.018227
0.49899
Removed Variables
0,134
0,097173
0,017957
0,49161
Communautés supprimées
0.12747
0.092553
0.016248
0.44483
Mixed with 2 classes
0.12437
0.087327
0.015467
0.42344
PLS regression 1st
0.12438
 0.08572
0.015472
 0.42357
Full data set
0,13347
0.09774
0.017815
0.48772
Full data set cut in 2 classes
0.13245
 0.094019
0.017542
0.48025
 Full data set cut in 3 classes
0.13047
 0.091678
0.017021
0.466    
Removed Variables
0,13291
0,09554
0,017665
0,48362
Communautés supprimées
0.12764
0.091114
0.016292
 0.44602
Mixed with 2 classes
0.12438
 0.08572
 0.015472
 0.42357
PLS regression advanced
0.1207
0.085773
0.01457
0.39888
Full data set
0.12743
0.093526
0.016238
 0.44455
Full data set cut in 2 classes
0.12396
0.089755
0.015366
0.42067
 Full data set cut in 3 classes
0.12021
 0.087285
 0.014451
0.39562    
Removed Variables
0.12829
0.094293
0.016458
 0.45057
Communautés supprimées
0.12429
 0.088444
 0.015448
0.4229
Mixed with 2 classes
0.1207
0.085773
0.01457
0.39888
SVM Polynomial
0.12175
0.08589
0.014822
0.40579
Full data set
0.12985
0.092377
0.01686
0.46268
Full data set cut in 2 classes
0.12911
0.088887
0.01667
0.45637
 Full data set cut in 3 classes
0.13302 
0.092129
0.017695
0.48444
Removed Variables
0,12925
0,089951
0,016705
0,45733
Communautés supprimées
0.12797
0.090735
0,017175
0,47019
Mixed with 2 classes
0.12175
0.08589
0.014822
0.40579
Neural network
0,11787
0,086258
0,013893
0,40909
Full data set
0,11787
0.086258
0.013893
0.40909
Full data set cut in 2 classes
0,13692
0.10066
0.018747
0.51323
 Full data set cut in 3 classes
0.13393
0.094034
0.017938
 0.4911
Removed Variables
0,13351
0,095503
0,017824
0,48797
Communautés supprimées
 0.13552
0.094944
 0.018367
0.50283
Mixed with 2 classes
0,13283
0,097711
0,017645
0,48306