Search Rocket site

Python Data Science and the Rocket MultiValue Database (Part 1 of 3)

Michael Rajkowski

March 21, 2018

I’ve frequently mentioned, in many of my Rocket MultiValue Python presentations, that one of the benefits of Python is the easy access it gives us to Data Science.

There are several Python libraries used for data science.

( see: https://www.datasciencecentral.com/profiles/blogs/9-python-analytics-libraries-1 )

While I do not think I can teach you everything you need to know about Data Science in Python in one blog post, I want to touch on the primary Python Package for Data Science – NumPy, and how it can leverage and be leveraged in a MultiValue Python solution.

To get you started we will cover the following:

  • What is NumPy
  • Simple NumPy Example
  • Moving data between NumPy array and a Rocket MultiValue database

While this may not be enough to have you predicting the future, it will get you started down the exciting road of Data Science.

WHAT IS NUMPY

NumPy as described by the numpy.org site:

“NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.“

Simple NumPy Example

Many of the Python Packages used for Data Science expand off of the NumPy package, and the multi-dimensional array.

A 1-dimensional array, is a list of elements.  In the following example we take create a numpy.array from a list of numbers:

python> import numpy as np
python> simple = np.array( [ 11, 99, 55, 95, 44, 66 ] )
python> simple
array([11, 99, 55, 95, 44, 66])

The numpy.array has a lot of methods and functions.  The follow is just a few of the Aggregate Functions and the Sorting method.

python> simple.mean()
61.666666666666664
python> np.median(simple)
60.5
python> np.std(simple)
30.131194614367494
python> simple.sort()
python> simple
array([11, 44, 55, 66, 95, 99])

Moving data between NumPy array and a Rocket MultiValue Database

The common denominator between the u2py.DynArray and the numpy.array is the Python list.  Both objects can be created from a list and/or nested list, and both objects can return a list or nested list.

Expanding on our previous example, we can create a u2py.DynArray from the list representation of the NumPy Array:

python> import u2py
python> rec = u2py.DynArray(simple.tolist())
python> rec
<u2py.DynArray value=b'11\xfe44\xfe55\xfe66\xfe95\xfe99'>

We can also create a NumPy Array from the u2pyDynArray:

python> new_np_array = np.array( rec.to_list() )
python> new_np_array
array(['11', '44', '55', '66', '95', '99'],
dtype='<U2')

Note that to ensure we can use our aggregate function, we should set the dtype as we create the numpy.array:

python> new_np_array = np.array( rec.to_list(), dtype=np.float64 )
python> new_np_array
array([ 11.,  99.,  55.,  95.,  44.,  66.])
python> new_np_array.mean()
61.666666666666664

The nested list can be imported in NumPy since it is shaped like a multi-dimensional array.

Once loaded you can view the dimensions of the array by checking its shape

python> nested_list = [ [ 1, 2, 3 ], [ 4, 5, 6] ]
python> print(nested_list)
[[1, 2, 3], [4, 5, 6]]
python> two_d_np_array = np.array( nested_list )
python> two_d_np_array
array([[1, 2, 3],
[4, 5, 6]])
python> two_d_np_array.shape
(2, 3)

Since the numpy.array is a grid of values, the shape is returned as a tuple which gives the size of each dimension.   Note the fact that the numpy.array is a grid is something you must consider when moving the data to and from a MultiValue Dynamic Array.

Consider a simple dynamic array:

python> dynamic_array = u2py.DynArray( [ "Widgets", [ "red", "blue", "green", "yellow", "orange" ], [ 11, 33, 22, 42, 6 ] ] )
python> dynamic_array
<u2py.DynArray value=b'Widgets\xfered\xfdblue\xfdgreen\xfdyellow\xfdorange\xfe11\xfd33\xfd22\xfd42\xfd6'>

We are able to create the Dynamic Array from a Python list, where some of the list elements are nested lists.  We cannot do the same with the constructor for the numpy.array which expects the fields to fill a grid.

python> np_array = np.array( dynamic_array.to_list() )
Traceback (most recent call last):
File "<console>", line 1, in <module>
ValueError: setting an array element with a sequence

The above error is thrown, since the first element “Widgets”, did not have the same depth as the nested lists in the other elements of the list.  We can get the data to load by explicitly setting the dtype to np.object, which would then allow us to create a one-dimensional numpy.array

python> np_array = np.array( dynamic_array.to_list(), dtype=np.object )
python> np_array
array(['Widgets', ['red', 'blue', 'green', 'yellow', 'orange'],
['11', '33', '22', '42', '6']], dtype=object)
python> np_array.shape
(3,)

While this gets the data into one numpy.array you have a bit of work before you can use it.  A simple work around, when working with multivalues (and multisubvalues) is to only import the associated fields of a MultiVale array. python> widgets_sold = np.array( [ dynamic_array.extract(2).to_list(), dynamic_array.extract(3).to_list() ] ) python> widgets_sold array([[['red', 'blue', 'green', 'yellow', 'orange']], [['11', '33', '22', '42', '6']]], dtype='<U6')

We have just scratched the surface of Python Data Science for the MultiValue Developer.  In part 2, I will delve into building convenience functions in Python to assist with loading and storing MultiValue data into NumPy, and will introduce the Python Pandas package.