Mapping Files

PyPads using the concept of mapping files to track which functions should be logged. These files are written in YAML. YAML (YAML Ain’t A Markup Language) is a human readable data serialization language. YAML has features such as comments and anchors these features make it desirable.

The mapping file can be divided broadly into different parts like metadata, fragments and mappings. Each section is explained in detail below. Following excerpts show possible mapping files. While the keras file uses implicit syntax for the path matchers marked by a prepending :, the sklearn version depicts how to use YAML typing with !!python/pPath, !!python/rSeg or !!python/pSeg.

metadata:
  author: "Thomas Weißgerber"
  version: "0.1.0"
  library:
    name: "keras"
    version: "2.3.1"

mappings:
  :keras.metrics.Metric:
    hooks: ["pypads_metric"]
    data:
      concepts: ["keras classification metrics"]

  :keras.engine.training.Model:
    :__init__:
      hooks: ["pypads_init"]
    :{re:(fit|fit_generator)$}:
      hooks: ["pypads_fit"]
    :predict_classes:
      hooks: ["pypads_predict"]
metadata:
  author: "Thomas Weißgerber"
  version: "0.1.0"
  library:
    name: "sklearn"
    version: ">= 0.19.1"

fragments:
  default_model:
    !!python/pPath __init__:
      hooks: "pypads_init"
    !!python/rSeg (fit|.fit_predict|fit_transform)$:
      hooks: "pypads_fit"
    !!python/rSeg (fit_predict|predict|score)$:
      hooks: "pypads_predict"
    !!python/rSeg (fit_transform|transform)$:
      hooks: "pypads_transform"

mappings:
  !!python/pPath sklearn:
    !!python/pPath base.BaseEstimator:
      ;default_model: ~
      data:
        concepts: ["algorithm"]
    !!python/pPath metrics.classification:
      !!python/rSeg .*:
        hooks: "pypads_metric"
        data:
          concepts: ["Sklearn provided metric"]
    !!python/pPath tree.tree.DecisionTreeClassifier:
      ;default_model: ~
      data:
        name: decision tree classifier
        other_names: []
        type: Classification
        hyper_parameters:
          model_parameters:
            - name: split_quality
              kind_of_value: "{'gini', 'entropy'}"
              optional: 'True'
              description: The function to measure the quality of a split.
              default_value: "'gini'"
              path: criterion
            - name: splitting_strategy
              kind_of_value: "{'best', 'random'}"
              optional: 'True'
              description: The strategy used to choose the split at each node.
              default_value: "'best'"
              path: splitter
            - name: max_depth_tree
              kind_of_value: integer
              optional: 'True'
              description: The maximum depth of the tree.
              default_value: None
              path: max_depth
            - name: min_samples_split
              kind_of_value: "{integer, float}"
              optional: 'True'
              description: The minimum number of samples required to split an internal node.
              default_value: '2'
              path: min_samples_split
            - name: min_samples_leaf
              kind_of_value: "{integer, float}"
              optional: 'True'
              description: The minimum number of samples required to be at a leaf node.
              default_value: '1'
              path: min_samples_leaf
            - name: min_weight_fraction_leaf
              kind_of_value: float
              optional: 'True'
              description: The minimum weighted fraction of the sum total of weights (of all
                the input samples) required to be at a leaf node.
              default_value: '1'
              path: min_weight_fraction_leaf
            - name: max_features
              kind_of_value: "{integer, float, 'auto', 'sqrt', 'log2', None}"
              optional: 'True'
              description: The number of features to consider when looking for the best split.
              default_value: None
              path: max_features
            - name: random_state
              kind_of_value: "{integer, RandomState instance, None}"
              optional: 'True'
              description: The seed of the pseudo random number generator to use when shuffling
                the data. If int, random_state is the seed used by the random number generator;
                If RandomState instance, random_state is the random number generator; If None,
                the random number generator is the RandomState instance used by np.random.
              default_value: None
              path: random_state
            - name: max_leaf_nodes
              kind_of_value: integer
              optional: 'True'
              description: Grow a tree with max_leaf_nodes in best-first fashion.
              default_value: None
              path: max_leaf_nodes
            - name: min_impurity_decrease
              kind_of_value: float
              optional: 'True'
              description: A node will be split if this split induces a decrease of the impurity
                greater than or equal to this value.
              default_value: '0'
              path: min_impurity_decrease
            - name: class_weight
              kind_of_value: "{dict, list of dicts, 'balanced', None}"
              optional: 'False'
              description: Weights associated with classes.
              default_value: None
              path: class_weight
          optimisation_parameters:
            - name: presort
              kind_of_value: "{boolean, 'auto'}"
              optional: 'True'
              description: Whether to presort the data to speed up the finding of best splits
                in fitting.
              default_value: "'auto'"
              path: presort
          execution_parameters: []
Metadata
The metadata part contains information about the author, the mapping file version and the library information. The mapping file version is required so that a change in the tracking functionalities can be easily traced to the version of the mapping file. Even while having the same library version, a user can modify the mapping file to track additional functions of the library or remove some tracking functionalities. Such changes need to be handled to provide better experiment tracking and reproducibility. PyPads does this via versioning of the mapping file. Another tag called “library” contains information about the library which the mapping file addresses such as the name of the library and the version of the library. This metadata section helps PyPads track different versions of libraries without them having a conflict.
metadata:
  author: "Thomas Weißgerber"
  version: "0.1.0"
  library:
    name: "sklearn"
    version: "0.19.1"
Fragments
Repeated patterns in the library can be included in the fragments section of the mappings file. Fragments allows users to link functions across classes. For example, in scikit-learn the fit function is a function for fitting the estimators. All classification/regression estimators will have a fit function. In such a scenario, the user does not have to write mappings for each and every estimator. Instead, the user can add the function to the fragments part and PyPads will automatically log those functions.
fragments:
  default_model:
    .__init__:
      events: "pypads_init"
    .{re:(fit|.fit_predict|fit_transform)$}:
      events: "pypads_fit"
Mappings
This part in the mapping file gives information to PyPads about the functions to track. In the example, we use the sklearn base estimator to encompass all logging functionalities from a single point. The user can add other classes as shown with the Decision Tree Classifier. By doing this the user also has to provide all the hyperparameters so that PyPads knows what to track. For each hyperparameter the user also has to provide the name of the hyperparameter, whether it is optional or not, its description and so forth.

Concepts

PyPads mapping files contain keys called concepts. When creating a main key in the mappings file, it could be anything such as a metric, a dataset, splitting strategy, an algorithm and so forth. The concepts key present within the main key links the main key to previously determined categories such as metric, dataset or algorithm to name a few. This helps PyPads recognize what type the main key is and how to process it.

Notations

PyPads can accept different notations through the YAML parser. Users can use the power of regular expressions to specify function groups that should trigger specific events. Here in the below given example, we hook all functions in sklearn.metrics.classification to “pypads_metric”. We also inform PyPads that all functions of this form are an instance of sklearn provided metrics using the concepts key.

mappings:
  sklearn:
    .metrics.classification.{re:.*}:
       data:
         concepts: ["Sklearn provided metric"]
         events: "pypads_metric"

Adding a new mapping file

When a user wants to add their own mapping file, they have to follow the following steps # Create a YAML mapping file in the path pypads/bindings/resources/mapping with the appropriate name and version number # Add a metadata part containing information about the author, version of the mapping file and library # Add fragments if a general function name is present. You can use regex to specify the patterns # Add mappings for metrics, datasets etc is they are present # PyPads will pick up the information when it is restarted.