In this post lets see about pandas which is the most important tool in data science written top of the numpy. Panda is used to clean, manipulate and analysis the data. It has inbuilt visualisation techniques which is used to plot graphs . It also used to create a frame table with rows and columns.
Application of pandas:
1. Pandas is used in statistics and neuralscience
Panda is something used to show data in form of tables .Pandas mainly consist of two function series and DataFrame.
To install pandas:(In your cmd)
pip install pandas
What is Series?
Series is same as table it consists of 4 parameters data, index, dtype, name, copy, fastpath. Make sure you use S capital in Series to avoid errors. To print series follow. np.nan is null value
import numpy as np
import pandas as pd
x = pd.Series(['A','B','C',np.nan,'D'])
print(x)
output:
0 A
1 B
2 C
3 NaN
4 D
DataFrame in pandas:
import numpy as np
import pandas as pd
x= ['A','B','C','D','E','F','G','H','I','J']
y = [1,2,3,4,5,6,7,8,9,10]
df = pd.DataFrame(data=x,index=y,columns=["i"])
print(df)
output:
i
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
9 I
10 J
To print Date range using pandas:
import pandas as pd
d = pd.date_range('20210601',periods=10)
print(d)
output:
DatetimeIndex(['2021-06-01', '2021-06-02', '2021-06-03', '2021-06-04',
'2021-06-05', '2021-06-06', '2021-06-07', '2021-06-08',
'2021-06-09', '2021-06-10'],
dtype='datetime64[ns]', freq='D')
Convertion of dictionary to DataFrame:
import numpy as np
import pandas as pd
print('---------------------------------')
cf=pd.DataFrame({'A':[1,2,3,4],
'B':['a','b','c','d'],
'c':pd.Series('ii',index=(range(4))),
'D':np.array([5]*4),
'E':'techlanguage'})
print(cf)
output:
A B c D E
0 1 a ii 5 techlanguage
1 2 b ii 5 techlanguage
2 3 c ii 5 techlanguage
3 4 d ii 5 techlanguage
How to view a data in pandas?
Lets create a dataframe first from that lets see how to view a data
import numpy as np
import pandas as pd
d = pd.date_range('20210601',periods=10)
print('----------------------------------')
cf=pd.DataFrame(np.random.randn(10,4),index=d,columns=['A','B','C','D'])
print(cf)
output
-----------------------------------------------
A B C D
2021-06-01 0.196218 0.692613 -0.516274 -1.166615
2021-06-02 -0.757079 -0.875631 1.412703 1.272985
2021-06-03 0.734642 -0.081268 -0.017365 0.635054
2021-06-04 -0.021272 -1.020569 0.268014 0.750294
2021-06-05 1.154764 0.216445 0.916505 -1.017789
2021-06-06 0.778009 -1.151132 1.177456 -0.464986
2021-06-07 -0.051266 -1.401961 -1.232411 -0.552707
2021-06-08 -0.740522 -0.052965 -0.124883 -0.009026
2021-06-09 0.390379 1.128178 -0.793724 1.340582
2021-06-10 -1.101693 -0.365774 -1.188283 -0.644097
head() and tail() :
Used to print the first five elements and last five elements
print('---------------------------------')
print(cf.head())
print('---------------------------------')
print(cf.tail())
output:
-----------------------------------------------
A B C D
2021-06-01 -0.352654 -1.545496 0.241450 -0.718301
2021-06-02 0.145778 -0.193142 1.254713 1.296125
2021-06-03 -2.447800 -0.231696 0.881387 1.253856
2021-06-04 0.642593 0.915895 0.483134 -0.620165
2021-06-05 -0.389417 0.464068 -0.950464 1.234421
-----------------------------------------------
A B C D
2021-06-06 0.001256 0.633361 1.955822 1.491964
2021-06-07 3.114453 0.284696 -0.694232 -1.950772
2021-06-08 0.124894 0.805616 0.534558 -0.415105
2021-06-09 0.814426 -0.831956 0.103323 -0.487111
2021-06-10 -0.483753 -1.583053 -0.958189 0.255926
index:
Used to print all the index value
print(cf.index)
output
DatetimeIndex(['2021-06-01', '2021-06-02',
'2021-06-03', '2021-06-04',
'2021-06-05', '2021-06-06',
'2021-06-07', '2021-06-08',
'2021-06-09', '2021-06-10'],
dtype='datetime64[ns]', freq='D')
Columns:
Used to print all column names
print('--------------------------------')
print(cf.columns)
output
----------------------------------------
Index(['A', 'B', 'C', 'D'], dtype='object')
Describe():
Used to show mean, max, min ect
print('---------------------------------')
print(cf.describe())
output
-----------------------------------------------
A B C D
count 10.000000 10.000000 10.000000 10.000000
mean 0.062032 -0.376059 -0.004030 0.161761
std 0.746521 0.777607 0.913673 1.110062
min -0.961250 -1.373856 -1.960989 -2.219939
25% -0.583649 -0.725754 -0.403223 -0.289852
50% 0.169691 -0.464967 0.125751 0.443269
75% 0.517888 -0.179336 0.684998 0.831990
max 1.119663 1.432168 1.092677 1.559623
Sorting:
Used to sort based on the index and values
print(cf.sort_index(axis=1,ascending=False))
print('---------------------------------')
print(cf.sort_values(by='A'))
output
D C B A
2021-06-01 -0.015767 -0.477356 0.906313 0.740205
2021-06-02 -0.328280 -0.558549 -1.070898 -0.701623
2021-06-03 -0.214263 0.835058 -1.905962 0.078338
2021-06-04 0.417488 -0.433438 0.575620 0.516499
2021-06-05 1.407020 -1.363445 1.755119 -0.709050
2021-06-06 2.684388 -1.008185 -0.156115 0.096735
2021-06-07 0.358821 -0.096272 -0.971703 0.204794
2021-06-08 1.111937 -0.706687 -1.402163 -1.127931
2021-06-09 -1.100631 -0.308159 -0.263300 1.796101
2021-06-10 -1.529672 0.703853 -0.820730 -0.542098
-----------------------------------------------
A B C D
2021-06-08 -1.127931 -1.402163 -0.706687 1.111937
2021-06-05 -0.709050 1.755119 -1.363445 1.407020
2021-06-02 -0.701623 -1.070898 -0.558549 -0.328280
2021-06-10 -0.542098 -0.820730 0.703853 -1.529672
2021-06-03 0.078338 -1.905962 0.835058 -0.214263
2021-06-06 0.096735 -0.156115 -1.008185 2.684388
2021-06-07 0.204794 -0.971703 -0.096272 0.358821
2021-06-04 0.516499 0.575620 -0.433438 0.417488
2021-06-01 0.740205 0.906313 -0.477356 -0.015767
2021-06-09 1.796101 -0.263300 -0.308159 -1.100631
In next lets see about the slicing in pandas. Hope you understand feel free to comment!!!!
Comments
Post a Comment